π Sindhi-Qwen2.5-7B v2 β The World's Most Advanced Sindhi LLM
Ψ³ΩΪΩ ΪͺΩΩΩ β Ψ³ΩΪΩ Ω»ΩΩΩΨ‘Ω ΩΨ§Ψ‘Ω Ψ―ΩΩΨ§ Ψ¬Ω Ψ³Ϊ Ϊ©Ψ§Ω ΨͺΨ±ΩΩ ΩΨ§ΩΨͺΩ Ω»ΩΩΩ Ω Ψ§ΪΩ
A state-of-the-art Sindhi language model built through continued pretraining and instruction tuning of Qwen2.5-7B on the largest Sindhi dataset ever assembled.
π What's New in v2
| Feature | v1 | v2 |
|---|---|---|
| Pretraining Corpus | 200M tokens, 3 sources | 1B+ tokens, 8+ sources |
| Instruction Data | 18 examples | 490,392 conversations |
| Data Sources | 3 (FineWeb-2, Wikipedia, English replay) | 12+ (news, legal, encyclopedia, religious, literary, SFT collections) |
| Replay Languages | English only (7%) | English (7.1%) + Urdu (5%) |
| Deduplication | None | Hash-based exact dedup |
| Unicode Cleaning | None | NFC normalization + OCR noise removal |
| Sequence Length | 2048 | 4096 |
| LoRA Config | r=128, alpha=32 | r=128, alpha=256, rs-lora |
| Embedding Training | Same LR as LoRA | Differential LR (10x lower) |
| Evaluation | Manual prompts | Automated benchmark suite |
π Resources
| Resource | Link | Size |
|---|---|---|
| v2 Pretraining Corpus | shaikhsalman/sindhi-llm-corpus-v2 | 1B+ tokens |
| v2 Instruction Dataset | shaikhsalman/sindhi-instruct-v2 | 490,392 conversations |
| Original Corpus | shaikhsalman/sindhi-llm-corpus | ~200M tokens |
| Base Model | Qwen/Qwen2.5-7B | 7.6B params |
ποΈ Training Pipeline
Qwen2.5-7B (151K vocab, Arabic script)
β
[Stage 1: CPT β 1B+ Sindhi tokens + En/Ur replay]
LoRA r=128, rs-lora, all-linear + embed + lm_head
Differential LR: 2e-5 main / 2e-6 embeddings
Sequence length: 4096, packing + FlashAttention-2
β
Sindhi-Qwen2.5-7B-v2-cpt
β
[Stage 2: SFT β 490K instruction examples]
Same LoRA, LR 5e-5, Linear scheduler, 2 epochs
assistant_only_loss, ChatML format
β
Sindhi-Qwen2.5-7B-v2 (Final)
π¦ Pretraining Corpus v2 (1B+ tokens)
Assembled from 8+ sources spanning news, literature, religion, legal, encyclopedia, and web text:
| Source | Dataset | Documents | Size |
|---|---|---|---|
| Web Corpus | aakashMeghwar01/sindhi-corpus-505m |
742K | 850 MB |
| Mixed Corpus | jamalimubashirali/sindhi-pretraining-corpus-part1 |
1M+ | 1.77 GB |
| Previous Corpus | shaikhsalman/sindhi-llm-corpus |
293K | 638 MB |
| Religious Texts | arnizamani/Sindhi-texts-big-dataset |
β | 310 MB |
| News (AwamiAwaz) | DanishMahdi/Sindhi_News_Corpus_AwamiAwaz |
β | 14 MB |
| News (SindhExpress) | DanishMahdi/Sindhi_News_Corpus_SindhExpress |
β | 5 MB |
| Encyclopedia | DanishMahdi/Encyclopedia_Sindhiana_text_corpus |
β | 8 MB |
| Legal Documents | DanishMahdi/Sindhi_Legal_2 |
β | 22 MB |
| English Replay | FineWeb-2 eng_Latn |
~100K | 7.1% |
| Urdu Replay | FineWeb-2 urd_Arab |
~80K | 5% |
Processing: Unicode NFC normalization, OCR noise removal, URL/email removal, quality filtering, hash-based exact deduplication.
π¬ Instruction Dataset v2 (490,392 conversations)
| Source | Conversations | % |
|---|---|---|
aakashMeghwar01/Sindhi-Intelligence-Core-SFT |
356,814 | 72.8% |
saillab/alpaca-sindhi-cleaned |
51,991 | 10.6% |
Telugu-LLM-Labs/sindhi_alpaca_yahma (Sindhi) |
28,908 | 5.9% |
Telugu-LLM-Labs/sindhi_alpaca_yahma (English) |
28,908 | 5.9% |
aakashMeghwar01/Sindhi-Intelligence-Core-SFT-v2 |
14,975 | 3.1% |
Tensoic/GPTeacher-Sindhi |
8,679 | 1.8% |
aakashMeghwar01/Sindhi-Factual-QA |
99 | <0.1% |
shaikhsalman/sindhi-instruct-dataset |
18 | <0.1% |
Format: ChatML messages format, compatible with Qwen2.5 chat template.
βοΈ Training Recipe
Based on Qalb (arxiv:2601.08141, SOTA Urdu) and ChocoLlama (arxiv:2412.07633):
Stage 1: Continued Pretraining
| Parameter | Value | Source |
|---|---|---|
| Base Model | Qwen/Qwen2.5-7B | Sailor2 |
| LoRA Rank | 128 | Qalb |
| LoRA Alpha | 256 (rs-lora) | LoRA Without Regret |
| Target Modules | all-linear | ChocoLlama |
| modules_to_save | embed_tokens, lm_head | ChocoLlama |
| Main LR | 2e-5 | Qalb |
| Embedding LR | 2e-6 (10x lower) | Qalb |
| Optimizer | AdamW-8bit | Qalb |
| Scheduler | Cosine, 5% warmup | Qalb |
| Sequence Length | 4096 | β |
| Batch Size | 32 (4Γ8) | β |
| Precision | bfloat16 | β |
Stage 2: SFT
| Parameter | Value |
|---|---|
| LR | 5e-5 |
| Scheduler | Linear |
| Epochs | 2 |
| Batch Size | 64 (8Γ8) |
| Max Length | 2048 |
π Quick Start
Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "shaikhsalman/Sindhi-Qwen2.5-7B-v2")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
messages = [{"role": "user", "content": "Ψ΄Ψ§ΩΩ ΨΉΨ¨Ψ―Ψ§ΩΩΨ·ΩΩ ΪΩ½Ψ§Ψ¦Ω ΪͺΩΨ± ΩΩΨ"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Training
pip install transformers trl peft datasets bitsandbytes accelerate flash-attn trackio
python scripts/pretrain_sindhi_v2.py # Stage 1: CPT
python scripts/sft_sindhi_v2.py # Stage 2: SFT
python scripts/evaluate_sindhi_v2.py # Evaluation
python scripts/tokenizer_fertility.py # Tokenizer analysis
π₯οΈ Hardware
| Stage | GPU | Est. Time | Est. Cost |
|---|---|---|---|
| CPT | A100 80GB | 8-12h | ~$35-50 |
| SFT | A100 80GB | 4-6h | ~$15-25 |
| Eval | T4 16GB | 30min | ~$1 |
π References
- Qalb β arxiv:2601.08141 β SOTA Urdu LLM recipe
- Sailor2 β arxiv:2502.12982 β Multilingual CPT on Qwen2.5
- ChocoLlama β arxiv:2412.07633 β LoRA language adaptation
- Nemotron-Mini-Hindi β arxiv:2410.14815 β CPT for target-language accuracy
π Acknowledgments
Built on the work of: aakashMeghwar01, DanishMahdi, arnizamani, jamalimubashirali, Tensoic, saillab, Telugu-LLM-Labs, HuggingFaceFW
π Mission
Ψ³ΩΪΩ ΨͺΩΨ°ΩΨ¨Ψ Ψ§Ψ―Ψ¨Ψ Ψ΄Ψ§ΨΉΨ±Ω Ϋ½ Ψ«ΩΨ§ΩΨͺ Ϊ©Ω AI Ψ°Ψ±ΩΨΉΩ Ω ΨΩΩΨΈ ΪͺΨ±Ϊ» Ϋ½ Ψ§Ϊ³ΨͺΩ ΩΪΨ§Ψ¦Ϊ»
Preserving Sindhi civilization through AI β for current and future generations.
License
Apache 2.0 (from Qwen2.5-7B)
Model tree for shaikhsalman/Sindhi-Qwen2.5-7B
Base model
Qwen/Qwen2.5-7B