XLM-R vs llama-7b tokenization

I can’t seem to find anywhere the information about the exact type of tokenization for models xlm-roberta-base and huggyllama/llama-7b. I know they are SentencePiece but they could be SentencePiece BPE or SentencePiece Unigram. Do they fall into any of those two categories or do they use another SentencePiece-y algorithm that cannot be named as one of those?
Please help!

1 Like

hmm…?


They do fall cleanly into the two main SentencePiece categories.

  • xlm-roberta-base (XLM-R) uses SentencePiece Unigram (Unigram LM).
  • huggyllama/llama-7b (LLaMA 7B family) uses SentencePiece BPE (Byte-Pair Encoding), with byte fallback behavior in common implementations. (Hugging Face)

No “third unnamed SentencePiece-y algorithm” is needed to explain either model.


1) Background: “SentencePiece” is a library, not one algorithm

SentencePiece is a tokenizer toolkit that can train and apply different subword model types. In particular, it implements:

  • BPE (merge-based)
  • Unigram LM (probabilistic vocab + best segmentation)

The SentencePiece project states it “implements subword units” including BPE and unigram language model. (GitHub)
The official options list shows --model_type can be unigram, bpe, word, or char. (GitHub)

So the correct question is exactly what you asked: “is this SentencePiece model Unigram or BPE?”


2) The two algorithms in plain English

SentencePiece Unigram (Unigram LM)

Think “pick the best segmentation from a fixed vocabulary.”

  • Training builds a vocabulary and assigns each token a probability.
  • Tokenization finds the most likely sequence of tokens that composes the input (often via a Viterbi-style dynamic program).
  • Practical effect: it can choose among multiple valid segmentations by likelihood.

SentencePiece explicitly names this “Unigram language model” and exposes it as --model_type=unigram. (GitHub)

SentencePiece BPE

Think “start with characters then repeatedly merge the most useful pairs.”

  • Training starts from smaller symbols and learns merge rules.
  • Tokenization applies those merges (greedily, in practice).

SentencePiece exposes this as --model_type=bpe. (GitHub)

Important nuance for LLaMA-style BPE

Many LLaMA-family tokenizers are described as BPE with byte fallback. That means “if a Unicode character is out of vocabulary coverage, fall back to encoding it as bytes so you do not emit <unk>.” Hugging Face’s LLaMA tokenizer docs call out ByteFallback explicitly. (Hugging Face)


3) XLM-R (xlm-roberta-base) tokenization type

The XLM-R paper is explicit:

  • “We apply subword tokenization directly on raw text data using **SentencePiece … with a unigram language model.”
  • It also notes a 250K vocabulary size for XLM-R (a key design choice for multilingual coverage).

So XLM-R is not “SentencePiece maybe-BPE.” It is SentencePiece Unigram.

Why it still confuses people

Two common sources of confusion:

  1. Misleading filenames / artifacts
    People run into files named like sentencepiece.bpe.model for XLM-R checkpoints and assume “must be BPE,” then hit odd training or loading issues. This exact confusion shows up in community threads. (PyTorch Forums)

  2. “Fast tokenizer” documentation blur
    Hugging Face’s XLM-R docs describe the slow tokenizer as SentencePiece-based, and the fast tokenizer section uses wording like “Based on BPE,” which can be read as “the model is BPE.” That wording is a known source of misunderstanding because the underlying SentencePiece model for XLM-R is Unigram per the paper. (Hugging Face)

Bottom line: treat the paper and/or the SentencePiece model’s internal model_type as authoritative.


4) LLaMA 7B (huggyllama/llama-7b) tokenization type

For LLaMA-family models, the standard tokenizer is SentencePiece BPE.

Two strong anchors:

  1. Hugging Face model docs (implementation-facing)
  • “The tokenizer is a byte-pair encoding model based on SentencePiece.” (Hugging Face)
  • The tokenizer class documentation also states: “Based on byte-level Byte-Pair-Encoding. This uses notably ByteFallback and no normalization.” (Hugging Face)
  1. LLaMA paper (training-facing)
  • The LLaMA paper states they tokenize with the BPE algorithm, using the SentencePiece implementation. (arXiv)

So for your specific model ID (huggyllama/llama-7b): it is in the LLaMA 1 lineage and uses the same tokenizer design: SentencePiece BPE (with byte fallback behavior commonly referenced in implementations). (Hugging Face)


5) The most reliable way to verify: read model_type from the .model itself

If you want an answer that is immune to:

  • confusing filenames,
  • wrapper docs,
  • “fast vs slow” implementation details,

then inspect the SentencePiece ModelProto inside the tokenizer model file.

SentencePiece stores an enum like:

  • UNIGRAM = 1
  • BPE = 2
    in sentencepiece_model.proto. (GitHub)

And the options doc confirms the conceptual mapping (model_type: unigram, bpe, word, char). (GitHub)

Practical approach (Python)

Conceptually:

  1. download the tokenizer model file (often sentencepiece.bpe.model or tokenizer.model)
  2. parse it as a SentencePiece ModelProto
  3. read trainer_spec.model_type

You can do this with the protobuf definition and a small script. The key is: the proto field is the truth, regardless of filenames. (GitHub)


6) What “another SentencePiece-y algorithm” could mean (and why it does not apply here)

SentencePiece can also operate in:

  • word mode
  • char mode

But those are explicitly named, and neither XLM-R nor LLaMA are trained that way in the canonical descriptions. (GitHub)

So your choices for these two models really are the two you listed:

  • XLM-R: Unigram
  • LLaMA: BPE

7) Curated high-quality links (primary + practical)

Primary sources

LLaMA implementation-facing docs

“Why is XLM-R’s file called sentencepiece.bpe.model?” confusion thread


Bullet summary

  • XLM-R (xlm-roberta-base): SentencePiece Unigram (Unigram LM). Paper states it directly.
  • LLaMA 7B (huggyllama/llama-7b): SentencePiece BPE, commonly with byte fallback behavior. HF docs state BPE and ByteFallback; paper states BPE + SentencePiece. (Hugging Face)
  • If you want “cannot be argued with” verification, parse the SentencePiece model file and read trainer_spec.model_type from the proto. (GitHub)

ok thank you, but do you know this for sure or only prompted to chatgpt? Thanks in advance!

1 Like

This is information I asked ChatGPT for, based on the information I had on hand and the search results I provided. So it’s somewhere in between…? I didn’t try it myself.

ok thank you, I need to dig a bit deeper though as this is quite important for me.

1 Like

I tested the actual model’s behavior with code for now. If you need to dig deeper for some reason, I think it’s better to contact the author directly rather than using forums…


  • FacebookAI/xlm-roberta-base uses SentencePiece Unigram LM (not BPE).
  • huggyllama/llama-7b uses SentencePiece BPE, with byte fallback enabled (and “no normalization/identity” behavior in practice).

There is no “third, unnamed SentencePiece-y algorithm” involved here—these two models fall cleanly into Unigram vs BPE as defined by SentencePiece itself.


Why this is definitive

A SentencePiece tokenizer is stored as a serialized protobuf (*.model). That file contains:

  • trainer_spec.model_type with an enum:

    • UNIGRAM = 1
    • BPE = 2
    • (also WORD = 3, CHAR = 4) (GitHub)
  • trainer_spec.byte_fallback (whether unknown UTF-8 characters can be decomposed into bytes) (GitHub)

  • normalizer_spec.name (e.g., nmt_nfkc, identity) (GitHub)

  • piece types such as NORMAL, CONTROL, and BYTE (the BYTE pieces appear when byte_fallback is true) (GitHub)

So reading trainer_spec.model_type from the .model file is the ground-truth way to distinguish SentencePiece Unigram vs SentencePiece BPE. (GitHub)


Evidence for XLM-R (xlm-roberta-base) → Unigram

Paper evidence

The XLM-R paper explicitly says they tokenize raw text using SentencePiece with a unigram language model, and they use a 250K vocabulary.

What your verifier run shows

Your run printed:

  • trainer_spec.model_type: 1 -> unigram
  • vocab_size: 250000
  • byte_fallback: False
  • normalizer_spec.name: 'nmt_nfkc'

Those fields map directly to the SentencePiece schema (Unigram = 1). (GitHub)

Note on the confusing filename: the file is named sentencepiece.bpe.model in that repo, but filenames are not authoritative; the embedded trainer_spec.model_type is. (GitHub)


Evidence for LLaMA (huggyllama/llama-7b) → BPE + byte fallback

Paper evidence

The LLaMA paper states they tokenize using the byte-pair encoding (BPE) algorithm, using SentencePiece, and they “fallback to bytes to decompose unknown UTF-8 characters.”

HF docs evidence (implementation-facing)

Hugging Face’s LLaMA tokenizer docs describe it as Byte-Pair-Encoding and note it uses ByteFallback and no normalization. (Hugging Face)
Older HF docs also explicitly say: “The LLaMA tokenizer is a BPE model based on sentencepiece.” (Hugging Face)

What your verifier run shows

Your run printed:

  • trainer_spec.model_type: 2 -> bpe
  • trainer_spec.byte_fallback: True
  • piece.type distribution included BYTE: 256

This matches the SentencePiece schema: BPE = 2, and byte_fallback causes BYTE pieces (typically 256 of them, <0x00> … <0xFF>) to appear. (GitHub)


Context: what “SentencePiece Unigram” vs “SentencePiece BPE” means

SentencePiece supports both BPE and Unigram LM tokenization. (GitHub)

  • Unigram LM: chooses a segmentation by optimizing a probabilistic model over a vocabulary.
  • BPE: uses merge rules learned during training to build subword units.

For these two models:

  • XLM-R: Unigram + large 250K vocab + NFKC-style normalization (nmt_nfkc).
  • LLaMA 7B: BPE + byte fallback + “identity/no normalization” behavior.

Bottom line

  • XLM-R (xlm-roberta-base): SentencePiece Unigram LM (model_type=1).
  • LLaMA (huggyllama/llama-7b): SentencePiece BPE (model_type=2) with byte fallback.

#!/usr/bin/env python3
"""
SentencePiece model-type verifier (Unigram vs BPE) for Hugging Face repos.

CPU/GPU safety:
- Does NOT load any model weights (no torch import), only downloads the tokenizer *.model (typically KB–MB).
Deps:
  pip install -U "huggingface_hub>=0.20" "sentencepiece>=0.1.99" "protobuf>=4.21"

Reference URLs:
- SentencePiece ModelProto schema (defines trainer_spec.model_type enum UNIGRAM=1, BPE=2, etc.):
  https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto
- hf_hub_download docs (download a single file from a repo):
  https://huggingface.co/docs/huggingface_hub/en/package_reference/file_download
"""

from __future__ import annotations

from collections import Counter
from pathlib import Path

from huggingface_hub import hf_hub_download, list_repo_files
from huggingface_hub.utils import EntryNotFoundError, GatedRepoError, RepositoryNotFoundError

# sentencepiece ships the protobuf bindings used to parse *.model into ModelProto
try:
    import sentencepiece.sentencepiece_model_pb2 as sp_pb2  # type: ignore
except Exception:
    # fallback (some environments expose it differently)
    from sentencepiece import sentencepiece_model_pb2 as sp_pb2  # type: ignore

# Model types per sentencepiece_model.proto
MODEL_TYPE = {1: "unigram", 2: "bpe", 3: "word", 4: "char"}

# Piece types per sentencepiece_model.proto
PIECE_TYPE = {1: "NORMAL", 2: "UNKNOWN", 3: "CONTROL", 4: "USER_DEFINED", 5: "UNUSED", 6: "BYTE"}


def download_spm(repo_id: str, revision: str | None = None) -> Path:
    """
    Download the SentencePiece *.model file from a repo (tries common filenames first,
    otherwise finds any *.model in repo listing). Returns local cached path.
    """
    candidates = (
        "tokenizer.model",
        "sentencepiece.model",
        "spiece.model",
        "sentencepiece.bpe.model",
    )

    for name in candidates:
        try:
            return Path(hf_hub_download(repo_id=repo_id, filename=name, revision=revision))
        except EntryNotFoundError:
            pass

    # Fallback: list files and pick a *.model
    files = list_repo_files(repo_id=repo_id, revision=revision)
    model_files = [f for f in files if f.endswith(".model")]
    if not model_files:
        raise FileNotFoundError(f"No *.model tokenizer file found in repo {repo_id}")

    # Prefer likely names if present
    for pref in candidates:
        if pref in model_files:
            return Path(hf_hub_download(repo_id=repo_id, filename=pref, revision=revision))

    # Otherwise just take the first *.model
    return Path(hf_hub_download(repo_id=repo_id, filename=model_files[0], revision=revision))


def inspect_spm(model_path: Path) -> dict:
    """Parse ModelProto and return a compact summary."""
    mp = sp_pb2.ModelProto()
    mp.ParseFromString(model_path.read_bytes())

    model_type_id = int(getattr(mp.trainer_spec, "model_type", 0))
    model_type_name = MODEL_TYPE.get(model_type_id, f"unknown({model_type_id})")

    byte_fallback = bool(getattr(mp.trainer_spec, "byte_fallback", False))
    vocab_size = len(mp.pieces)

    # Count piece.type values (NORMAL, CONTROL, BYTE, ...)
    type_counts = Counter(int(p.type) for p in mp.pieces)
    type_counts_named = {PIECE_TYPE.get(k, str(k)): v for k, v in sorted(type_counts.items())}

    normalizer_name = getattr(mp.normalizer_spec, "name", "")

    return {
        "model_path": str(model_path),
        "model_type_id": model_type_id,
        "model_type": model_type_name,
        "byte_fallback": byte_fallback,
        "vocab_size": vocab_size,
        "piece_type_counts": type_counts_named,
        "normalizer": normalizer_name,
    }


def verify_repo(repo_id: str, revision: str | None = None) -> None:
    """Download tokenizer *.model, parse it, and print key fields."""
    print(f"Repo: {repo_id}")
    try:
        model_path = download_spm(repo_id, revision=revision)
        info = inspect_spm(model_path)

        print(f"  tokenizer file: {info['model_path']}")
        print(f"  trainer_spec.model_type: {info['model_type_id']} -> {info['model_type']}")
        print(f"  trainer_spec.byte_fallback: {info['byte_fallback']}")
        print(f"  vocab_size (len(pieces)): {info['vocab_size']}")
        print(f"  normalizer_spec.name: {info['normalizer']!r}")
        print(f"  piece.type distribution: {info['piece_type_counts']}")
    except GatedRepoError:
        print("  ERROR: repo is gated. Set HF_TOKEN or accept the repo terms on Hugging Face.")
    except RepositoryNotFoundError:
        print("  ERROR: repo not found (or private without access).")
    except Exception as e:
        print(f"  ERROR: {type(e).__name__}: {e}")
    print()


def main() -> None:
    # Public examples discussed:
    verify_repo("FacebookAI/xlm-roberta-base")
    verify_repo("huggyllama/llama-7b")

    # Optional: verify any local SentencePiece model file:
    # local = Path("/path/to/tokenizer.model")
    # print(inspect_spm(local))


if __name__ == "__main__":
    main()

"""
Repo: FacebookAI/xlm-roberta-base
  tokenizer file: /root/.cache/huggingface/hub/models--FacebookAI--xlm-roberta-base/snapshots/e73636d4f797dec63c3081bb6ed5c7b0bb3f2089/sentencepiece.bpe.model
  trainer_spec.model_type: 1 -> unigram
  trainer_spec.byte_fallback: False
  vocab_size (len(pieces)): 250000
  normalizer_spec.name: 'nmt_nfkc'
  piece.type distribution: {'NORMAL': 249997, 'UNKNOWN': 1, 'CONTROL': 2}

Repo: huggyllama/llama-7b
  tokenizer file: /root/.cache/huggingface/hub/models--huggyllama--llama-7b/snapshots/4782ad278652c7c71b72204d462d6d01eaaf7549/tokenizer.model
  trainer_spec.model_type: 2 -> bpe
  trainer_spec.byte_fallback: True
  vocab_size (len(pieces)): 32000
  normalizer_spec.name: 'identity'
  piece.type distribution: {'NORMAL': 31741, 'UNKNOWN': 1, 'CONTROL': 2, 'BYTE': 256}
"""