I can’t seem to find anywhere the information about the exact type of tokenization for models xlm-roberta-base and huggyllama/llama-7b. I know they are SentencePiece but they could be SentencePiece BPE or SentencePiece Unigram. Do they fall into any of those two categories or do they use another SentencePiece-y algorithm that cannot be named as one of those?
Please help!
hmm…?
They do fall cleanly into the two main SentencePiece categories.
xlm-roberta-base(XLM-R) uses SentencePiece Unigram (Unigram LM).huggyllama/llama-7b(LLaMA 7B family) uses SentencePiece BPE (Byte-Pair Encoding), with byte fallback behavior in common implementations. (Hugging Face)
No “third unnamed SentencePiece-y algorithm” is needed to explain either model.
1) Background: “SentencePiece” is a library, not one algorithm
SentencePiece is a tokenizer toolkit that can train and apply different subword model types. In particular, it implements:
- BPE (merge-based)
- Unigram LM (probabilistic vocab + best segmentation)
The SentencePiece project states it “implements subword units” including BPE and unigram language model. (GitHub)
The official options list shows --model_type can be unigram, bpe, word, or char. (GitHub)
So the correct question is exactly what you asked: “is this SentencePiece model Unigram or BPE?”
2) The two algorithms in plain English
SentencePiece Unigram (Unigram LM)
Think “pick the best segmentation from a fixed vocabulary.”
- Training builds a vocabulary and assigns each token a probability.
- Tokenization finds the most likely sequence of tokens that composes the input (often via a Viterbi-style dynamic program).
- Practical effect: it can choose among multiple valid segmentations by likelihood.
SentencePiece explicitly names this “Unigram language model” and exposes it as --model_type=unigram. (GitHub)
SentencePiece BPE
Think “start with characters then repeatedly merge the most useful pairs.”
- Training starts from smaller symbols and learns merge rules.
- Tokenization applies those merges (greedily, in practice).
SentencePiece exposes this as --model_type=bpe. (GitHub)
Important nuance for LLaMA-style BPE
Many LLaMA-family tokenizers are described as BPE with byte fallback. That means “if a Unicode character is out of vocabulary coverage, fall back to encoding it as bytes so you do not emit <unk>.” Hugging Face’s LLaMA tokenizer docs call out ByteFallback explicitly. (Hugging Face)
3) XLM-R (xlm-roberta-base) tokenization type
The XLM-R paper is explicit:
- “We apply subword tokenization directly on raw text data using **SentencePiece … with a unigram language model.”
- It also notes a 250K vocabulary size for XLM-R (a key design choice for multilingual coverage).
So XLM-R is not “SentencePiece maybe-BPE.” It is SentencePiece Unigram.
Why it still confuses people
Two common sources of confusion:
-
Misleading filenames / artifacts
People run into files named likesentencepiece.bpe.modelfor XLM-R checkpoints and assume “must be BPE,” then hit odd training or loading issues. This exact confusion shows up in community threads. (PyTorch Forums) -
“Fast tokenizer” documentation blur
Hugging Face’s XLM-R docs describe the slow tokenizer as SentencePiece-based, and the fast tokenizer section uses wording like “Based on BPE,” which can be read as “the model is BPE.” That wording is a known source of misunderstanding because the underlying SentencePiece model for XLM-R is Unigram per the paper. (Hugging Face)
Bottom line: treat the paper and/or the SentencePiece model’s internal model_type as authoritative.
4) LLaMA 7B (huggyllama/llama-7b) tokenization type
For LLaMA-family models, the standard tokenizer is SentencePiece BPE.
Two strong anchors:
- Hugging Face model docs (implementation-facing)
- “The tokenizer is a byte-pair encoding model based on SentencePiece.” (Hugging Face)
- The tokenizer class documentation also states: “Based on byte-level Byte-Pair-Encoding. This uses notably ByteFallback and no normalization.” (Hugging Face)
- LLaMA paper (training-facing)
- The LLaMA paper states they tokenize with the BPE algorithm, using the SentencePiece implementation. (arXiv)
So for your specific model ID (huggyllama/llama-7b): it is in the LLaMA 1 lineage and uses the same tokenizer design: SentencePiece BPE (with byte fallback behavior commonly referenced in implementations). (Hugging Face)
5) The most reliable way to verify: read model_type from the .model itself
If you want an answer that is immune to:
- confusing filenames,
- wrapper docs,
- “fast vs slow” implementation details,
then inspect the SentencePiece ModelProto inside the tokenizer model file.
SentencePiece stores an enum like:
UNIGRAM = 1BPE = 2
insentencepiece_model.proto. (GitHub)
And the options doc confirms the conceptual mapping (model_type: unigram, bpe, word, char). (GitHub)
Practical approach (Python)
Conceptually:
- download the tokenizer model file (often
sentencepiece.bpe.modelortokenizer.model) - parse it as a SentencePiece
ModelProto - read
trainer_spec.model_type
You can do this with the protobuf definition and a small script. The key is: the proto field is the truth, regardless of filenames. (GitHub)
6) What “another SentencePiece-y algorithm” could mean (and why it does not apply here)
SentencePiece can also operate in:
wordmodecharmode
But those are explicitly named, and neither XLM-R nor LLaMA are trained that way in the canonical descriptions. (GitHub)
So your choices for these two models really are the two you listed:
- XLM-R: Unigram
- LLaMA: BPE
7) Curated high-quality links (primary + practical)
Primary sources
- XLM-R paper PDF (states SentencePiece + unigram LM; 250K vocab): https://arxiv.org/pdf/1911.02116
- SentencePiece repo (states it implements BPE + unigram LM): https://github.com/google/sentencepiece (GitHub)
- SentencePiece options (
model_type=unigram|bpe|word|char): https://github.com/google/sentencepiece/blob/master/doc/options.md (GitHub) - SentencePiece experiments doc (explicitly contrasts
--model_type=bpevs--model_type=unigram): https://github.com/google/sentencepiece/blob/master/doc/experiments.md (GitHub) - SentencePiece proto enum (
UNIGRAM,BPE): https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto (GitHub)
LLaMA implementation-facing docs
- Hugging Face LLaMA docs (states SentencePiece-based BPE; mentions ByteFallback in tokenizer docs): https://huggingface.co/docs/transformers/en/model_doc/llama (Hugging Face)
“Why is XLM-R’s file called sentencepiece.bpe.model?” confusion thread
- PyTorch forum thread showing the exact naming confusion and related training pitfalls: https://discuss.pytorch.org/t/train-a-tokenizer-like-xlm-roberta-tokenizer/155641 (PyTorch Forums)
Bullet summary
- XLM-R (
xlm-roberta-base): SentencePiece Unigram (Unigram LM). Paper states it directly. - LLaMA 7B (
huggyllama/llama-7b): SentencePiece BPE, commonly with byte fallback behavior. HF docs state BPE and ByteFallback; paper states BPE + SentencePiece. (Hugging Face) - If you want “cannot be argued with” verification, parse the SentencePiece model file and read
trainer_spec.model_typefrom the proto. (GitHub)
ok thank you, but do you know this for sure or only prompted to chatgpt? Thanks in advance!
This is information I asked ChatGPT for, based on the information I had on hand and the search results I provided. So it’s somewhere in between…? I didn’t try it myself.
ok thank you, I need to dig a bit deeper though as this is quite important for me.
I tested the actual model’s behavior with code for now. If you need to dig deeper for some reason, I think it’s better to contact the author directly rather than using forums…
FacebookAI/xlm-roberta-baseuses SentencePiece Unigram LM (not BPE).huggyllama/llama-7buses SentencePiece BPE, with byte fallback enabled (and “no normalization/identity” behavior in practice).
There is no “third, unnamed SentencePiece-y algorithm” involved here—these two models fall cleanly into Unigram vs BPE as defined by SentencePiece itself.
Why this is definitive
A SentencePiece tokenizer is stored as a serialized protobuf (*.model). That file contains:
-
trainer_spec.model_typewith an enum:UNIGRAM = 1BPE = 2- (also
WORD = 3,CHAR = 4) (GitHub)
-
trainer_spec.byte_fallback(whether unknown UTF-8 characters can be decomposed into bytes) (GitHub) -
normalizer_spec.name(e.g.,nmt_nfkc,identity) (GitHub) -
piece types such as
NORMAL,CONTROL, andBYTE(theBYTEpieces appear whenbyte_fallbackis true) (GitHub)
So reading trainer_spec.model_type from the .model file is the ground-truth way to distinguish SentencePiece Unigram vs SentencePiece BPE. (GitHub)
Evidence for XLM-R (xlm-roberta-base) → Unigram
Paper evidence
The XLM-R paper explicitly says they tokenize raw text using SentencePiece with a unigram language model, and they use a 250K vocabulary.
What your verifier run shows
Your run printed:
trainer_spec.model_type: 1 -> unigramvocab_size: 250000byte_fallback: Falsenormalizer_spec.name: 'nmt_nfkc'
Those fields map directly to the SentencePiece schema (Unigram = 1). (GitHub)
Note on the confusing filename: the file is named sentencepiece.bpe.model in that repo, but filenames are not authoritative; the embedded trainer_spec.model_type is. (GitHub)
Evidence for LLaMA (huggyllama/llama-7b) → BPE + byte fallback
Paper evidence
The LLaMA paper states they tokenize using the byte-pair encoding (BPE) algorithm, using SentencePiece, and they “fallback to bytes to decompose unknown UTF-8 characters.”
HF docs evidence (implementation-facing)
Hugging Face’s LLaMA tokenizer docs describe it as Byte-Pair-Encoding and note it uses ByteFallback and no normalization. (Hugging Face)
Older HF docs also explicitly say: “The LLaMA tokenizer is a BPE model based on sentencepiece.” (Hugging Face)
What your verifier run shows
Your run printed:
trainer_spec.model_type: 2 -> bpetrainer_spec.byte_fallback: Truepiece.type distributionincludedBYTE: 256
This matches the SentencePiece schema: BPE = 2, and byte_fallback causes BYTE pieces (typically 256 of them, <0x00> … <0xFF>) to appear. (GitHub)
Context: what “SentencePiece Unigram” vs “SentencePiece BPE” means
SentencePiece supports both BPE and Unigram LM tokenization. (GitHub)
- Unigram LM: chooses a segmentation by optimizing a probabilistic model over a vocabulary.
- BPE: uses merge rules learned during training to build subword units.
For these two models:
- XLM-R: Unigram + large 250K vocab + NFKC-style normalization (
nmt_nfkc). - LLaMA 7B: BPE + byte fallback + “identity/no normalization” behavior.
Bottom line
- XLM-R (
xlm-roberta-base): SentencePiece Unigram LM (model_type=1). - LLaMA (
huggyllama/llama-7b): SentencePiece BPE (model_type=2) with byte fallback.
#!/usr/bin/env python3
"""
SentencePiece model-type verifier (Unigram vs BPE) for Hugging Face repos.
CPU/GPU safety:
- Does NOT load any model weights (no torch import), only downloads the tokenizer *.model (typically KB–MB).
Deps:
pip install -U "huggingface_hub>=0.20" "sentencepiece>=0.1.99" "protobuf>=4.21"
Reference URLs:
- SentencePiece ModelProto schema (defines trainer_spec.model_type enum UNIGRAM=1, BPE=2, etc.):
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto
- hf_hub_download docs (download a single file from a repo):
https://huggingface.co/docs/huggingface_hub/en/package_reference/file_download
"""
from __future__ import annotations
from collections import Counter
from pathlib import Path
from huggingface_hub import hf_hub_download, list_repo_files
from huggingface_hub.utils import EntryNotFoundError, GatedRepoError, RepositoryNotFoundError
# sentencepiece ships the protobuf bindings used to parse *.model into ModelProto
try:
import sentencepiece.sentencepiece_model_pb2 as sp_pb2 # type: ignore
except Exception:
# fallback (some environments expose it differently)
from sentencepiece import sentencepiece_model_pb2 as sp_pb2 # type: ignore
# Model types per sentencepiece_model.proto
MODEL_TYPE = {1: "unigram", 2: "bpe", 3: "word", 4: "char"}
# Piece types per sentencepiece_model.proto
PIECE_TYPE = {1: "NORMAL", 2: "UNKNOWN", 3: "CONTROL", 4: "USER_DEFINED", 5: "UNUSED", 6: "BYTE"}
def download_spm(repo_id: str, revision: str | None = None) -> Path:
"""
Download the SentencePiece *.model file from a repo (tries common filenames first,
otherwise finds any *.model in repo listing). Returns local cached path.
"""
candidates = (
"tokenizer.model",
"sentencepiece.model",
"spiece.model",
"sentencepiece.bpe.model",
)
for name in candidates:
try:
return Path(hf_hub_download(repo_id=repo_id, filename=name, revision=revision))
except EntryNotFoundError:
pass
# Fallback: list files and pick a *.model
files = list_repo_files(repo_id=repo_id, revision=revision)
model_files = [f for f in files if f.endswith(".model")]
if not model_files:
raise FileNotFoundError(f"No *.model tokenizer file found in repo {repo_id}")
# Prefer likely names if present
for pref in candidates:
if pref in model_files:
return Path(hf_hub_download(repo_id=repo_id, filename=pref, revision=revision))
# Otherwise just take the first *.model
return Path(hf_hub_download(repo_id=repo_id, filename=model_files[0], revision=revision))
def inspect_spm(model_path: Path) -> dict:
"""Parse ModelProto and return a compact summary."""
mp = sp_pb2.ModelProto()
mp.ParseFromString(model_path.read_bytes())
model_type_id = int(getattr(mp.trainer_spec, "model_type", 0))
model_type_name = MODEL_TYPE.get(model_type_id, f"unknown({model_type_id})")
byte_fallback = bool(getattr(mp.trainer_spec, "byte_fallback", False))
vocab_size = len(mp.pieces)
# Count piece.type values (NORMAL, CONTROL, BYTE, ...)
type_counts = Counter(int(p.type) for p in mp.pieces)
type_counts_named = {PIECE_TYPE.get(k, str(k)): v for k, v in sorted(type_counts.items())}
normalizer_name = getattr(mp.normalizer_spec, "name", "")
return {
"model_path": str(model_path),
"model_type_id": model_type_id,
"model_type": model_type_name,
"byte_fallback": byte_fallback,
"vocab_size": vocab_size,
"piece_type_counts": type_counts_named,
"normalizer": normalizer_name,
}
def verify_repo(repo_id: str, revision: str | None = None) -> None:
"""Download tokenizer *.model, parse it, and print key fields."""
print(f"Repo: {repo_id}")
try:
model_path = download_spm(repo_id, revision=revision)
info = inspect_spm(model_path)
print(f" tokenizer file: {info['model_path']}")
print(f" trainer_spec.model_type: {info['model_type_id']} -> {info['model_type']}")
print(f" trainer_spec.byte_fallback: {info['byte_fallback']}")
print(f" vocab_size (len(pieces)): {info['vocab_size']}")
print(f" normalizer_spec.name: {info['normalizer']!r}")
print(f" piece.type distribution: {info['piece_type_counts']}")
except GatedRepoError:
print(" ERROR: repo is gated. Set HF_TOKEN or accept the repo terms on Hugging Face.")
except RepositoryNotFoundError:
print(" ERROR: repo not found (or private without access).")
except Exception as e:
print(f" ERROR: {type(e).__name__}: {e}")
print()
def main() -> None:
# Public examples discussed:
verify_repo("FacebookAI/xlm-roberta-base")
verify_repo("huggyllama/llama-7b")
# Optional: verify any local SentencePiece model file:
# local = Path("/path/to/tokenizer.model")
# print(inspect_spm(local))
if __name__ == "__main__":
main()
"""
Repo: FacebookAI/xlm-roberta-base
tokenizer file: /root/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/sentencepiece.bpe.model
trainer_spec.model_type: 1 -> unigram
trainer_spec.byte_fallback: False
vocab_size (len(pieces)): 250000
normalizer_spec.name: 'nmt_nfkc'
piece.type distribution: {'NORMAL': 249997, 'UNKNOWN': 1, 'CONTROL': 2}
Repo: huggyllama/llama-7b
tokenizer file: /root/.cache/huggingface/hub/models--huggyllama--llama-7b/snapshots/4782ad278652c7c71b72204d462d6d01eaaf7549/tokenizer.model
trainer_spec.model_type: 2 -> bpe
trainer_spec.byte_fallback: True
vocab_size (len(pieces)): 32000
normalizer_spec.name: 'identity'
piece.type distribution: {'NORMAL': 31741, 'UNKNOWN': 1, 'CONTROL': 2, 'BYTE': 256}
"""