QiTianTokenizer-XLarge

QiTianTokenizer is a universal multilingual tokenizer primarily optimized for Chinese–English mixed text,
offering consistent and reversible tokenization across diverse languages and scripts.
It is designed as a general-purpose tokenizer, not tied to any specific model,
and fully compatible with the πŸ€— Transformers ecosystem.


✨ Overview

Property Value
Name QiTianTokenizer-XLarge
Type Tokenizer-only repository
Purpose General multilingual tokenization
Primary Languages Chinese, English
Extended Support Multilingual (Unicode-complete)
Architecture Byte-level BPE
Vocabulary Size 128,000 tokens
Fast Implementation βœ… Available (QiTianTokenizerFast)
Framework πŸ€— transformers
License Apache 2.0

🧩 QiTian Tokenizer Series

Variant Vocabulary Size Description Recommended Use
QiTianTokenizer-Tiny 12k Lightweight tokenizer designed for compact or embedded models. On-device or low-resource tasks
QiTianTokenizer-Base 32k Balanced vocabulary offering solid coverage and efficiency for most multilingual use cases. Recommended for general use
QiTianTokenizer-Medium 64k Optimal balance in language coverage β€” broad enough to capture fine-grained linguistic diversity while maintaining reasonable model complexity. Recommended for multilingual and high-quality general-purpose models
QiTianTokenizer-Large 96k Extended multilingual vocabulary designed for diverse cross-lingual pretraining and high-capacity language models. High-resource training
QiTianTokenizer-XLarge 128k Full-script and domain-extensive vocabulary for comprehensive multilingual modeling. Research & large-scale pretraining

All variants share consistent token definitions, special tokens, and compatible configurations.


βš™οΈ Usage

You can load this tokenizer directly with AutoTokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

# Example
text = "δ½ ε₯½οΌŒQiTian!"
tokens = tokenizer(text)
print(tokens["input_ids"])

βž• Batch Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

# Example
texts = ["Hello, δΈ–η•ŒοΌ", "QiTian is multilingual."]
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
print(batch_tokens["input_ids"])

πŸ’¬ Chat Template (apply_chat_template)

For chat-style data, you can format a list of messages using apply_chat_template:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "δ½ ε₯½οΌŒδ»‹η»δΈ€δΈ‹ QiTianTokenizer。"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
print(text)

# If you need token ids directly:
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    enable_thinking=False,
    return_tensors="pt",
)
print(inputs["input_ids"])

Parameters

  • add_generation_prompt

    • True: append the assistant role token (e.g. <|assistant|>) at the end, so the model can continue generating.
    • False: do not append generation prompt (useful for evaluating full dialogues).
  • enable_thinking

    • True: wrap the assistant part with a thinking span (e.g. <|begin_of_think|> ... <|end_of_think|>), if your training/inference uses it.
    • False: keep plain assistant content without the thinking wrapper.

πŸ“¦ Files Included

File Description
tokenizer.json Serialized fast tokenizer definition
tokenizer_config.json Configuration (max length, padding side, etc.)
tokenizer.py Tokenizer implementation

πŸ” Special Tokens

Token Purpose
<|bos|> Beginning of sequence
<|eos|> End of sequence
<|eot|> End of turn (marks message boundary)
<|pad|> Padding token for batch alignment
<|mask|> Masked token for MLM-style objectives
<|system|> Defines system or meta-instruction context
<|user|> Marks user message boundary in conversational data
<|assistant|> Marks assistant message boundary
<|begin_of_think|> Begin internal reasoning span
<|end_of_think|> End internal reasoning span

πŸ”– License

This tokenizer and vocabulary are released under the Apache License 2.0. You are free to use, modify, and redistribute it under the same license terms.


πŸ“š Citation

If you use QiTianTokenizer in your research or project, please cite it as:

@misc{QiTianTokenizer,
  title  = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
  author = {Morton Li},
  year   = {2026},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support