Model Card: Tajik POS Tagger – XLM-RoBERTa + LoRA
This model is a fine-tuned version of xlm-roberta-large with LoRA adapters for Tajik part-of-speech tagging. It achieves 61.6% accuracy and 56.3% weighted F1 on the test set.
Model Details
Model Description
- Developed by: Mullosharaf K. Arabov
- Funded by: Kazan Federal University (self-funded research)
- Model type: Transformer-based token classification with LoRA adapters
- Language(s) (NLP): Tajik (Cyrillic)
- License: apache-2.0
- Finetuned from model:
xlm-roberta-large
Model Sources [optional]
- Repository: GitHub repository (if you have one, otherwise omit)
- Paper: To be published
- Demo: Not yet
Uses
Direct Use
This model can be used for automatic part-of-speech tagging of Tajik text. It takes a sentence as input and outputs a sequence of POS tags corresponding to each token.
Example:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model = AutoModelForTokenClassification.from_pretrained("TajikNLPWorld/tajik-pos-xlm-roberta-lora")
tokenizer = AutoTokenizer.from_pretrained("TajikNLPWorld/tajik-pos-xlm-roberta-lora")
text = "Ин ҷумла барои санҷиш аст."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=-1)
# Convert predictions to tags
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
pred_tags = [id2label[p.item()] for p in predictions[0]]
print(list(zip(tokens, pred_tags)))
Downstream Use
The model can be integrated into larger NLP pipelines for Tajik, such as machine translation, information extraction, or syntactic analysis.
Out-of-Scope Use
The model is trained on literary examples from the TajPersParallelLexicalCorpus. It may not perform well on highly domain‑specific texts (e.g., social media, legal documents) without further adaptation.
Bias, Risks, and Limitations
- Data imbalance: The training data has a strong class imbalance (most frequent POS are nouns and adjectives). As a result, rare POS tags (e.g., interjections, conjunctions) have very low F1 scores.
- Out-of-vocabulary: The model may not handle unseen words or subword splits perfectly.
- Limited domain: The model was trained on examples from classical and modern literature; performance on other genres may degrade.
Recommendations
Users should evaluate the model on their own data before deployment. For applications where rare POS tags are critical, consider collecting more annotated examples or applying data augmentation.
How to Get Started with the Model
Use the code snippet above. For batch processing, you can use the Hugging Face pipeline:
from transformers import pipeline
pipe = pipeline("token-classification", model="TajikNLPWorld/tajik-pos-xlm-roberta-lora")
result = pipe("Ин ҷумла барои санҷиш аст.")
print(result)
Training Details
Training Data
The model was fine‑tuned on the training split of the TajPersParallelLexicalCorpus (34,529 sentences). The sentences were extracted from the examples field of the dataset. For words that were not found in the POS dictionary, they were ignored (label -100) during training.
Training Procedure
Preprocessing
- Sentences were tokenized using the XLM-RoBERTa tokenizer.
- Word‑level labels were aligned to subword tokens; only the first subword of each word receives the label, others are ignored (
-100). - Maximum sequence length: 128 tokens.
Training Hyperparameters
- Batch size: 16
- Learning rate: 2e-5
- Optimizer: AdamW
- Scheduler: Linear with warmup (0 warmup steps)
- Epochs: 3
- LoRA configuration: rank
r=8,lora_alpha=16, target modules:["query", "value"] - Training regime: fp32
Speeds, Sizes, Times
- Hardware: NVIDIA GPU (CUDA)
- Training time: Approximately 4 hours for 3 epochs (on a single GPU, due to larger model size)
- Model size: ~2.24 GB (base model + LoRA adapters)
Evaluation
Testing Data, Factors & Metrics
Testing Data
The test set consists of 4,317 sentences held out from the original dataset (10% of the data).
Factors
- Domains: Literary texts (classical and modern Tajik literature)
- Token‑level evaluation: Only non‑ignored tokens (i.e., those with a gold label) are considered.
Metrics
- Accuracy: Percentage of correctly predicted tokens.
- Macro F1: Unweighted average of F1 over all POS tags.
- Weighted F1: Average weighted by the number of true instances per tag.
Results
| Metric | Value |
|---|---|
| Accuracy | 0.616 |
| F1 Macro | 0.098 |
| F1 Weighted | 0.563 |
Per‑class F1 scores are as follows (only tags with non‑zero support in test set):
| Tag (Tajik) | English | F1 |
|---|---|---|
| исм | noun | 0.730 |
| сифат | adjective | 0.451 |
All other tags have zero F1 due to extreme class imbalance (very few test examples). More details can be found in the paper (to be published).
Summary
The model achieves a weighted F1 of 56.3%, which is a strong baseline for the first Tajik POS tagger. The macro F1 is low because rare tags are not correctly predicted.
Model Examination [optional]
We analyzed the most common confusions: nouns and adjectives are often misclassified for each other, reflecting the flexible nature of these categories in Tajik.
Environmental Impact
- Hardware Type: NVIDIA T4 or similar (single GPU)
- Hours used: ~4 hours (training) + evaluation
- Cloud Provider: Local / on‑premises
- Compute Region: Not applicable
- Carbon Emitted: Negligible for this small‑scale experiment
Technical Specifications [optional]
Model Architecture and Objective
Base model: xlm-roberta-large (24 layers, 1024 hidden size, 16 heads). LoRA adapters are added to the query and value projection matrices in each self‑attention layer. The classification head is randomly initialized and trained jointly with the adapters.
Compute Infrastructure
Hardware
Single NVIDIA GPU (CUDA 11.8).
Software
- PyTorch 2.0+
- Transformers 4.35+
- PEFT (LoRA)
Citation [optional]
BibTeX:
@misc{tajik_pos_xlmr_2026,
title = {Tajik POS Tagger – XLM-RoBERTa + LoRA},
author = {Arabov, Mullosharaf Kurbonovich},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/TajikNLPWorld/tajik-pos-xlm-roberta-lora}
}
APA:
Arabov, M. K. (2026). Tajik POS Tagger – XLM-RoBERTa + LoRA [Model]. Hugging Face. https://huggingface.co/TajikNLPWorld/tajik-pos-xlm-roberta-lora
Glossary [optional]
- POS tagging: Part‑of‑speech tagging – assigning a grammatical category (noun, verb, adjective, etc.) to each word in a sentence.
- LoRA: Low‑Rank Adaptation – a parameter‑efficient fine‑tuning method that adds trainable rank‑decomposition matrices to certain layers.
More Information [optional]
The dataset used for training and evaluation is available at TajPersParallelLexicalCorpus. The complete experimental code will be released upon publication.
Model Card Authors [optional]
Mullosharaf K. Arabov
Model Card Contact
For questions or collaborations, please contact TajikNLPWorld or open an issue on the repository.