--- language: - en license: apache-2.0 tags: - tokenizer - unigram - minipile - concept-encoder - chatml - morphology --- # Custom Unigram Tokenizer for Minipile (65k Vocab) This is a Unigram tokenizer (SentencePiece-style) trained on the [JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile) dataset. **Language**: English (en). *Note: While the Unigram algorithm handles unicode characters, the vocabulary is optimized for English text, code, and common technical terms found in Minipile.* It was developed for the **[MrCogito](https://github.com/ksopyla/MrCogito)** project, which explores novel transformer architectures like the **Concept Encoder**. ## Training Details - **Dataset**: [JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile) - **Sample Size**: 50000 documents - **Preprocessing**: Documents were truncated to a maximum length of 4096 characters to ensure training stability while preserving local context (code blocks, latex, paragraphs). - **Algorithm**: Unigram (SentencePiece) - **Vocab Size**: 65536 - **Normalization**: NFKC (Cased) - **Pre-tokenization**: Metaspace ## Motivation for Concept Encoders This tokenizer is specifically optimized for **Concept Encoder** and **Concept Decoder** architectures (e.g., Perceiver IO, Latent Transformers). ### Why Unigram for Concept Encoding? Concept Encoders work by compressing a sequence of tokens $T$ into a smaller set of abstract latent vectors (concepts) $C$. The efficiency of this compression depends heavily on the input quality: * **The Problem with BPE (BERT/GPT)**: BPE is a greedy compression algorithm. It often splits words into arbitrary frequent chunks (e.g., `unbe` + `liev` + `able`). A Concept Encoder must waste model capacity "repairing" these arbitrary splits to understand the word before it can even begin extracting the higher-level concept. * **The Unigram Advantage**: The Unigram algorithm is probabilistic and tends to preserve linguistically meaningful morphological units (e.g., `un` + `believ` + `able`). This acts as a "soft pre-compression", feeding the encoder units that already carry semantic weight. This allows the Concept Encoder to focus its limited latent capacity on *semantic aggregation* rather than *morphological repair*. ### Why Minipile? While XLNet uses the Unigram algorithm, its vocabulary (circa 2019) lacks modern terms. By training a fresh Unigram model on **Minipile** (2023), we combine the superior morphological segmentation of Unigram with the vocabulary coverage of modern LLMs (code, Python, ChatML, technical terms). ## Features - **Algorithm**: Unigram (SentencePiece). - **Vocab Size**: 65536 tokens. - **Normalization**: NFKC (Cased). - **Pre-tokenization**: Metaspace (reversible). - **Chat Support**: Includes standard ChatML special tokens (`<|im_start|>`, `<|im_end|>`) and a pre-configured chat template. ## Usage ```python from transformers import AutoTokenizer # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("ksopyla/minipile-unigram-65k-50k") # Basic encoding text = "Hello, world! This is a test." tokens = tokenizer.tokenize(text) print(tokens) # Output: [' Hello', ',', ' world', '!', ' This', ' is', ' a', ' test', '.'] # Chat Template messages = [ {"role": "user", "content": "What is the Concept Encoder?"}, {"role": "assistant", "content": "It is a novel transformer architecture..."} ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) print(prompt) # Output: <|im_start|>user\nWhat is the Concept Encoder?<|im_end|>\n<|im_start|>assistant\n... ``` ## Special Tokens - PAD: `` - UNK: `` - CLS: `` - SEP: `` - MASK: `` - Chat: `<|im_start|>`, `<|im_end|>`, `<|user|>`, `<|assistant|>`, `<|system|>`, `<|endoftext|>` - Unused: `<|unused0|>` ... `<|unused99|>` (100 reserved tokens)