QiTianTokenizer-XLarge

QiTianTokenizer is a universal multilingual tokenizer primarily optimized for Chinese–English mixed text,
offering consistent and reversible tokenization across diverse languages and scripts.
It is designed as a general-purpose tokenizer, not tied to any specific model,
and fully compatible with the 🤗 Transformers ecosystem.

✨ Overview

Property	Value
Name	QiTianTokenizer-XLarge
Type	Tokenizer-only repository
Purpose	General multilingual tokenization
Primary Languages	Chinese, English
Extended Support	Multilingual (Unicode-complete)
Architecture	Byte-level BPE
Vocabulary Size	128,000 tokens
Fast Implementation	✅ Available (`QiTianTokenizerFast`)
Framework	🤗 `transformers`
License	Apache 2.0

🧩 QiTian Tokenizer Series

Variant	Vocabulary Size	Description	Recommended Use
QiTianTokenizer-Tiny	12k	Lightweight tokenizer designed for compact or embedded models.	On-device or low-resource tasks
QiTianTokenizer-Base	32k	Balanced vocabulary offering solid coverage and efficiency for most multilingual use cases.	Recommended for general use
QiTianTokenizer-Medium	64k	Optimal balance in language coverage — broad enough to capture fine-grained linguistic diversity while maintaining reasonable model complexity.	Recommended for multilingual and high-quality general-purpose models
QiTianTokenizer-Large	96k	Extended multilingual vocabulary designed for diverse cross-lingual pretraining and high-capacity language models.	High-resource training
QiTianTokenizer-XLarge	128k	Full-script and domain-extensive vocabulary for comprehensive multilingual modeling.	Research & large-scale pretraining

All variants share consistent token definitions, special tokens, and compatible configurations.

⚙️ Usage

You can load this tokenizer directly with AutoTokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

# Example
text = "你好，QiTian！"
tokens = tokenizer(text)
print(tokens["input_ids"])

➕ Batch Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

# Example
texts = ["Hello, 世界！", "QiTian is multilingual."]
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
print(batch_tokens["input_ids"])

💬 Chat Template (`apply_chat_template`)

For chat-style data, you can format a list of messages using apply_chat_template:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "你好，介绍一下 QiTianTokenizer。"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
print(text)

# If you need token ids directly:
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    enable_thinking=False,
    return_tensors="pt",
)
print(inputs["input_ids"])

Parameters

add_generation_prompt
- True: append the assistant role token (e.g. <|assistant|>) at the end, so the model can continue generating.
- False: do not append generation prompt (useful for evaluating full dialogues).
enable_thinking
- True: wrap the assistant part with a thinking span (e.g. <|begin_of_think|> ... <|end_of_think|>), if your training/inference uses it.
- False: keep plain assistant content without the thinking wrapper.

📦 Files Included

File	Description
`tokenizer.json`	Serialized fast tokenizer definition
`tokenizer_config.json`	Configuration (max length, padding side, etc.)
`tokenizer.py`	Tokenizer implementation

🔍 Special Tokens

Token	Purpose
`<\|bos\|>`	Beginning of sequence
`<\|eos\|>`	End of sequence
`<\|eot\|>`	End of turn (marks message boundary)
`<\|pad\|>`	Padding token for batch alignment
`<\|mask\|>`	Masked token for MLM-style objectives
`<\|system\|>`	Defines system or meta-instruction context
`<\|user\|>`	Marks user message boundary in conversational data
`<\|assistant\|>`	Marks assistant message boundary
`<\|begin_of_think\|>`	Begin internal reasoning span
`<\|end_of_think\|>`	End internal reasoning span

🔖 License

This tokenizer and vocabulary are released under the Apache License 2.0. You are free to use, modify, and redistribute it under the same license terms.

📚 Citation

If you use QiTianTokenizer in your research or project, please cite it as:

@misc{QiTianTokenizer,
  title  = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
  author = {Morton Li},
  year   = {2026},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support