Model Card: Tajik POS Tagger – XLM-RoBERTa + LoRA

This model is a fine-tuned version of xlm-roberta-large with LoRA adapters for Tajik part-of-speech tagging. It achieves 61.6% accuracy and 56.3% weighted F1 on the test set.

Model Details

Model Description

Developed by: Mullosharaf K. Arabov
Funded by: Kazan Federal University (self-funded research)
Model type: Transformer-based token classification with LoRA adapters
Language(s) (NLP): Tajik (Cyrillic)
License: apache-2.0
Finetuned from model: xlm-roberta-large

Model Sources [optional]

Repository: GitHub repository (if you have one, otherwise omit)
Paper: To be published
Demo: Not yet

Uses

Direct Use

This model can be used for automatic part-of-speech tagging of Tajik text. It takes a sentence as input and outputs a sequence of POS tags corresponding to each token.

Example:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model = AutoModelForTokenClassification.from_pretrained("TajikNLPWorld/tajik-pos-xlm-roberta-lora")
tokenizer = AutoTokenizer.from_pretrained("TajikNLPWorld/tajik-pos-xlm-roberta-lora")

text = "Ин ҷумла барои санҷиш аст."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=-1)

# Convert predictions to tags
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
pred_tags = [id2label[p.item()] for p in predictions[0]]
print(list(zip(tokens, pred_tags)))

Downstream Use

The model can be integrated into larger NLP pipelines for Tajik, such as machine translation, information extraction, or syntactic analysis.

Out-of-Scope Use

The model is trained on literary examples from the TajPersParallelLexicalCorpus. It may not perform well on highly domain‑specific texts (e.g., social media, legal documents) without further adaptation.

Bias, Risks, and Limitations

Data imbalance: The training data has a strong class imbalance (most frequent POS are nouns and adjectives). As a result, rare POS tags (e.g., interjections, conjunctions) have very low F1 scores.
Out-of-vocabulary: The model may not handle unseen words or subword splits perfectly.
Limited domain: The model was trained on examples from classical and modern literature; performance on other genres may degrade.

Recommendations

Users should evaluate the model on their own data before deployment. For applications where rare POS tags are critical, consider collecting more annotated examples or applying data augmentation.

How to Get Started with the Model

Use the code snippet above. For batch processing, you can use the Hugging Face pipeline:

from transformers import pipeline

pipe = pipeline("token-classification", model="TajikNLPWorld/tajik-pos-xlm-roberta-lora")
result = pipe("Ин ҷумла барои санҷиш аст.")
print(result)

Training Details

Training Data

The model was fine‑tuned on the training split of the TajPersParallelLexicalCorpus (34,529 sentences). The sentences were extracted from the examples field of the dataset. For words that were not found in the POS dictionary, they were ignored (label -100) during training.

Training Procedure

Preprocessing

Sentences were tokenized using the XLM-RoBERTa tokenizer.
Word‑level labels were aligned to subword tokens; only the first subword of each word receives the label, others are ignored (-100).
Maximum sequence length: 128 tokens.

Training Hyperparameters

Batch size: 16
Learning rate: 2e-5
Optimizer: AdamW
Scheduler: Linear with warmup (0 warmup steps)
Epochs: 3
LoRA configuration: rank r=8, lora_alpha=16, target modules: ["query", "value"]
Training regime: fp32

Speeds, Sizes, Times

Hardware: NVIDIA GPU (CUDA)
Training time: Approximately 4 hours for 3 epochs (on a single GPU, due to larger model size)
Model size: ~2.24 GB (base model + LoRA adapters)

Evaluation

Testing Data, Factors & Metrics

Testing Data

The test set consists of 4,317 sentences held out from the original dataset (10% of the data).

Factors

Domains: Literary texts (classical and modern Tajik literature)
Token‑level evaluation: Only non‑ignored tokens (i.e., those with a gold label) are considered.

Metrics

Accuracy: Percentage of correctly predicted tokens.
Macro F1: Unweighted average of F1 over all POS tags.
Weighted F1: Average weighted by the number of true instances per tag.

Results

Metric	Value
Accuracy	0.616
F1 Macro	0.098
F1 Weighted	0.563

Per‑class F1 scores are as follows (only tags with non‑zero support in test set):

Tag (Tajik)	English	F1
исм	noun	0.730
сифат	adjective	0.451

All other tags have zero F1 due to extreme class imbalance (very few test examples). More details can be found in the paper (to be published).

Summary

The model achieves a weighted F1 of 56.3%, which is a strong baseline for the first Tajik POS tagger. The macro F1 is low because rare tags are not correctly predicted.

Model Examination [optional]

We analyzed the most common confusions: nouns and adjectives are often misclassified for each other, reflecting the flexible nature of these categories in Tajik.

Environmental Impact

Hardware Type: NVIDIA T4 or similar (single GPU)
Hours used: ~4 hours (training) + evaluation
Cloud Provider: Local / on‑premises
Compute Region: Not applicable
Carbon Emitted: Negligible for this small‑scale experiment

Technical Specifications [optional]

Model Architecture and Objective

Base model: xlm-roberta-large (24 layers, 1024 hidden size, 16 heads). LoRA adapters are added to the query and value projection matrices in each self‑attention layer. The classification head is randomly initialized and trained jointly with the adapters.

Compute Infrastructure

Hardware

Single NVIDIA GPU (CUDA 11.8).

Software

PyTorch 2.0+
Transformers 4.35+
PEFT (LoRA)

Citation [optional]

BibTeX:

@misc{tajik_pos_xlmr_2026,
    title = {Tajik POS Tagger – XLM-RoBERTa + LoRA},
    author = {Arabov, Mullosharaf Kurbonovich},
    year = {2026},
    publisher = {Hugging Face},
    url = {https://huggingface.co/TajikNLPWorld/tajik-pos-xlm-roberta-lora}
}

APA:

Arabov, M. K. (2026). Tajik POS Tagger – XLM-RoBERTa + LoRA [Model]. Hugging Face. https://huggingface.co/TajikNLPWorld/tajik-pos-xlm-roberta-lora

Glossary [optional]

POS tagging: Part‑of‑speech tagging – assigning a grammatical category (noun, verb, adjective, etc.) to each word in a sentence.
LoRA: Low‑Rank Adaptation – a parameter‑efficient fine‑tuning method that adds trainable rank‑decomposition matrices to certain layers.

More Information [optional]

The dataset used for training and evaluation is available at TajPersParallelLexicalCorpus. The complete experimental code will be released upon publication.

Model Card Authors [optional]

Mullosharaf K. Arabov

Model Card Contact

For questions or collaborations, please contact TajikNLPWorld or open an issue on the repository.

Downloads last month: -; Downloads are not tracked for this model. How to track