---
language:
- en
license: apache-2.0
tags:
- tokenizer
- unigram
- minipile
- concept-encoder
- chatml
- morphology
---

# Custom Unigram Tokenizer for Minipile (65k Vocab)

This is a Unigram tokenizer (SentencePiece-style) trained on the [JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile) dataset. 

**Language**: English (en). 
*Note: While the Unigram algorithm handles unicode characters, the vocabulary is optimized for English text, code, and common technical terms found in Minipile.*

It was developed for the **[MrCogito](https://github.com/ksopyla/MrCogito)** project, which explores novel transformer architectures like the **Concept Encoder**.

## Training Details

- **Dataset**: [JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)
- **Sample Size**: 50000 documents
- **Preprocessing**: Documents were truncated to a maximum length of 4096 characters to ensure training stability while preserving local context (code blocks, latex, paragraphs).
- **Algorithm**: Unigram (SentencePiece)
- **Vocab Size**: 65536
- **Normalization**: NFKC (Cased)
- **Pre-tokenization**: Metaspace

## Motivation for Concept Encoders

This tokenizer is specifically optimized for **Concept Encoder** and **Concept Decoder** architectures (e.g., Perceiver IO, Latent Transformers).

### Why Unigram for Concept Encoding?
Concept Encoders work by compressing a sequence of tokens $T$ into a smaller set of abstract latent vectors (concepts) $C$. The efficiency of this compression depends heavily on the input quality:

*   **The Problem with BPE (BERT/GPT)**: BPE is a greedy compression algorithm. It often splits words into arbitrary frequent chunks (e.g., `unbe` + `liev` + `able`). A Concept Encoder must waste model capacity "repairing" these arbitrary splits to understand the word before it can even begin extracting the higher-level concept.
*   **The Unigram Advantage**: The Unigram algorithm is probabilistic and tends to preserve linguistically meaningful morphological units (e.g., `un` + `believ` + `able`). This acts as a "soft pre-compression", feeding the encoder units that already carry semantic weight. This allows the Concept Encoder to focus its limited latent capacity on *semantic aggregation* rather than *morphological repair*.

### Why Minipile?
While XLNet uses the Unigram algorithm, its vocabulary (circa 2019) lacks modern terms. By training a fresh Unigram model on **Minipile** (2023), we combine the superior morphological segmentation of Unigram with the vocabulary coverage of modern LLMs (code, Python, ChatML, technical terms).

## Features
- **Algorithm**: Unigram (SentencePiece).
- **Vocab Size**: 65536 tokens.
- **Normalization**: NFKC (Cased).
- **Pre-tokenization**: Metaspace (reversible).
- **Chat Support**: Includes standard ChatML special tokens (`<|im_start|>`, `<|im_end|>`) and a pre-configured chat template.

## Usage

```python
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("ksopyla/minipile-unigram-65k-50k")

# Basic encoding
text = "Hello, world! This is a test."
tokens = tokenizer.tokenize(text)
print(tokens) 
# Output: [' Hello', ',', ' world', '!', ' This', ' is', ' a', ' test', '.']

# Chat Template
messages = [
    {"role": "user", "content": "What is the Concept Encoder?"},
    {"role": "assistant", "content": "It is a novel transformer architecture..."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
# Output: <|im_start|>user\nWhat is the Concept Encoder?<|im_end|>\n<|im_start|>assistant\n...
```

## Special Tokens
- PAD: `<pad>`
- UNK: `<unk>`
- CLS: `<cls>`
- SEP: `<sep>`
- MASK: `<mask>`
- Chat: `<|im_start|>`, `<|im_end|>`, `<|user|>`, `<|assistant|>`, `<|system|>`, `<|endoftext|>`
- Unused: `<|unused0|>` ... `<|unused99|>` (100 reserved tokens)