Apertus Tokenizer Extension

This repo packages the raw tokenizer arms used for the Apertus extension comparison work.

Layout:

  • fresh/*: fresh-discovery tokenizers trained from the two corpus mixes
  • continuous/*: true append-only continuous-BPE extensions starting from Apertus

Included arms:

  • fresh/glossapi_only_50k
  • fresh/glossapi_plus_hplt_70_30_50k
  • continuous/continuous_glossapi_only_156672
  • continuous/continuous_glossapi_plus_hplt_70_30_156672

Each arm directory is directly loadable with transformers.AutoTokenizer.from_pretrained(...). Continuous arms also include replication_check.json and front_end_contract_check.json.

The comparable merged cutoff grid is handled separately; this repo contains the four raw tokenizer arms only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support