German
Romansh
nllb

Logo of University of Zurich and Lia Rumantscha

Romansh NLLB-200 1.3B CT2 Model Card

Romansh NLLB is a multilingual NMT model finetuned from NLLB-200-Distilled 1.3 model (https://huggingface.co/facebook/nllb-200-distilled-1.3B). It can translate between German and six Romansh varieties, which makes it the first model to generate fluent translations in those individual Romansh varieties.

The data used for the finetuning are monolingual data in all Romansh varieties, which we back-translated into German using Gemini 2.5 Flash, as well as a smaller amount of natively parallel data. A more extensive overview over the data used can be found further below.

The model has been extended by six language tokens, representing each of the Romansh varieties. The exact usage of the model is described below.

How to use

This model shows how to translate a sentence from German into Romansh,

Before using the model, make sure that ctranslate2 is installed. This library allows for efficient inference. This model is already converted to work with ctranslate2.

Installation:

pip install ctranslate2

Further information on ctranslate2 can be found under this link: https://github.com/OpenNMT/CTranslate2


Follow the below steps to translate a sentence.

  1. Load model and tokenizer from repo.
import ctranslate2
from huggingface_hub import snapshot_download
from transformers import NllbTokenizer

# Download the model from the HF hub. The model is already converted to CT2 format
model_path = snapshot_download(repo_id="ZurichNLP/romansh-nllb-1.3b-ct2")

translator = ctranslate2.Translator(model_path, device="cpu")
tokenizer = NllbTokenizer.from_pretrained("ZurichNLP/romansh-nllb-1.3b-ct2")
  1. Prepare model inputs.
source_lang_code = "deu_Latn" # German
target_lang_code = "rm-puter" # Puter

source_sentence = "Gestern hat es geschneit."

tokenizer.set_src_lang_special_tokens(source_lang_code)

source_token_ids = tokenizer.encode(source_sentence, add_special_tokens=True)
source_tokens = tokenizer.convert_ids_to_tokens(source_token_ids)
  1. Run the model as below.
results = translator.translate_batch(
    [source_tokens],
    target_prefix=[[target_lang_code]],
    beam_size=4,
)
  1. Postprocess translation as described below.
result = results[0]
target_tokens = result.hypotheses[0]
translation = tokenizer.decode(tokenizer.convert_tokens_to_ids(target_tokens), skip_special_tokens=True)
translation
# Her ho que navieu.

The overview over the six varieties and their respective language codes can be found in the below table.

Romansh variety Language Code
Rumantsch Grischun rm-rumgr
Sursilvan rm-sursilv
Sutsilvan rm-sutsilv
Surmiran rm-surmiran
Puter rm-puter
Vallader rm-vallader

Training Data

Parallel data

Name Romansh varieties URL RM tokens DE tokens
Dictionary data Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, Vallader https://pledarigrond.ch/rumantschgrischun 3,298,051 6,530,231
Mediomatix Sursilvan, Sutsilvan, Surmiran, Puter, Vallader https://huggingface.co/datasets/ZurichNLP/mediomatix 6,946,947 -
Press releases of Canton Grisons Rumantsch Grischun https://www.gr.ch/RM/medias/communicaziuns/MMStaka/Seiten/AktuelleMeldungen.aspx 1,840,537 1,397,196
SwissLawTranslations Rumantsch Grischun https://huggingface.co/datasets/joelniklaus/SwissLawTranslations 607,517 494,980
Parallel data contributed by RTR Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, Vallader - 646,856 -
Storyweaver Sursilvan, Sutsilvan, Surmiran, Puter, Vallader - 48,475 9,170
Total 13,926,557 8,833,624

Monolingual data

Name Romansh varieties URL Tokens
FineWeb2 Rumantsch Grischun https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 48,340,896
La Quotidiana (1997–2008, 2021–2025) Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, Vallader https://huggingface.co/datasets/ZurichNLP/quotidiana 38,993,608
FinePDFs Rumantsch Grischun https://huggingface.co/datasets/HuggingFaceFW/finepdfs 18,856,327
Mediomatix (unaligned) Sursilvan, Sutsilvan, Surmiran, Puter, Vallader https://huggingface.co/datasets/ZurichNLP/mediomatix-raw 4,639,563
FineWiki Rumantsch Grischun https://huggingface.co/datasets/HuggingFaceFW/finewiki 2,827,954
Audio transcriptions by RTR Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, Vallader https://developer.srgssr.ch/en/apis/rtr-linguistic 1,467,957
Theater plays Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, Vallader https://huggingface.co/datasets/ZurichNLP/romansh_theater_plays 1,079,943
Municipal documents Sursilvan, Sutsilvan, Surmiran, Vallader https://huggingface.co/datasets/ZurichNLP/romansh-municipal-text-corpus 318,308
Historical Dictionary of Switzerland Rumantsch Grischun https://hls-dhs-dss.ch/rm/ 234,715
Revista digl noss Sulom Surmiran - 187,247
Babulins Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, Vallader - 49,444
Total 116,995,962

Evaluation Results

German to Romansh BLEU:

System Rumantsch Grischun Sursilvan Sutsilvan Surmiran Puter Vallader
Gemini 2.5 Flash 40.3 28.0 10.6 16.9 22.0 27.3
Gemini 3 Flash (preview) 42.1 32.8 12.7 21.3 26.4 29.8
Gemini 3 Pro (preview) 45.3 37.1 17.3 27.1 34.7 36.1
Romansh NLLB-1.3b-ct2 48.4 44.5 40.5 43.0 44.9 44.6

Romansh to German BLEU:

System Rumantsch Grischun Sursilvan Sutsilvan Surmiran Puter Vallader
Gemini 3 Flash (preview) 55.2 50.1 47.4 52.2 53.0 60.7
Gemini 3 Pro (preview) 55.8 50.8 48.7 52.2 53.8 60.8
Romansh NLLB-1.3b-ct2 49.5 45.3 43.2 46.1 48.1 53.9

Romansh to German COMET:

System Rumantsch Grischun Sursilvan Sutsilvan Surmiran Puter Vallader
Gemini 2.5 Flash 93.7 93.1 89.8 91.7 92.3 92.7
Gemini 3 Flash (preview) 93.9 94.0 92.7 93.4 92.8 93.9
Gemini 3 Pro (preview) 93.8 93.8 92.5 93.4 93.0 93.8
Romansh NLLB-1.3b-ct2 91.2 90.9 89.2 89.6 89.9 89.9

Acknowledgements

We thank RTR and Fundaziun Patrimoni Cultural RTR for their support. We are grateful to Zachary Hopton, Diana Merkle, Anna Rutkiewicz and Sudehsna Sivakumar for help with data curation, Uniun dals Grischs for contributing dictionary data for Puter and Vallader, and Giuanna Caviezel, Not Soliva and their seminar participants for helpful feedback. We also acknowledge the contribution of the native speakers who participated in the human evaluation study. For this publication, use was made of media data made available via Swissdox@LiRI by the Linguistic Research Infrastructure of the University of Zurich (see (https://www.liri.uzh.ch/en/services/swissdox.html) for more information).

License

CC-BY-NC-4.0

Downloads last month
35
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ZurichNLP/romansh-nllb-1.3b-ct2

Finetuned
(21)
this model

Collection including ZurichNLP/romansh-nllb-1.3b-ct2