Apertus Tokenizer Extension
This repo packages the raw tokenizer arms used for the Apertus extension comparison work.
Layout:
fresh/*: fresh-discovery tokenizers trained from the two corpus mixescontinuous/*: true append-only continuous-BPE extensions starting from Apertus
Included arms:
fresh/glossapi_only_50kfresh/glossapi_plus_hplt_70_30_50kcontinuous/continuous_glossapi_only_156672continuous/continuous_glossapi_plus_hplt_70_30_156672
Each arm directory is directly loadable with transformers.AutoTokenizer.from_pretrained(...).
Continuous arms also include replication_check.json and front_end_contract_check.json.
The comparable merged cutoff grid is handled separately; this repo contains the four raw tokenizer arms only.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support