🌙 Sindhi-Qwen2.5-7B v2 — The World's Most Advanced Sindhi LLM

سنڌي ڪوين — سنڌي ٻوليءَ لاءِ دنيا جو سڀ کان ترقي يافته ٻولي ماڊل

A state-of-the-art Sindhi language model built through continued pretraining and instruction tuning of Qwen2.5-7B on the largest Sindhi dataset ever assembled.

🆕 What's New in v2

Feature	v1	v2
Pretraining Corpus	200M tokens, 3 sources	1B+ tokens, 8+ sources
Instruction Data	18 examples	490,392 conversations
Data Sources	3 (FineWeb-2, Wikipedia, English replay)	12+ (news, legal, encyclopedia, religious, literary, SFT collections)
Replay Languages	English only (7%)	English (7.1%) + Urdu (5%)
Deduplication	None	Hash-based exact dedup
Unicode Cleaning	None	NFC normalization + OCR noise removal
Sequence Length	2048	4096
LoRA Config	r=128, alpha=32	r=128, alpha=256, rs-lora
Embedding Training	Same LR as LoRA	Differential LR (10x lower)
Evaluation	Manual prompts	Automated benchmark suite

📊 Resources

Resource	Link	Size
v2 Pretraining Corpus	shaikhsalman/sindhi-llm-corpus-v2	1B+ tokens
v2 Instruction Dataset	shaikhsalman/sindhi-instruct-v2	490,392 conversations
Original Corpus	shaikhsalman/sindhi-llm-corpus	~200M tokens
Base Model	Qwen/Qwen2.5-7B	7.6B params

🏗️ Training Pipeline

Qwen2.5-7B (151K vocab, Arabic script)
    ↓
[Stage 1: CPT — 1B+ Sindhi tokens + En/Ur replay]
    LoRA r=128, rs-lora, all-linear + embed + lm_head
    Differential LR: 2e-5 main / 2e-6 embeddings
    Sequence length: 4096, packing + FlashAttention-2
    ↓
Sindhi-Qwen2.5-7B-v2-cpt
    ↓
[Stage 2: SFT — 490K instruction examples]
    Same LoRA, LR 5e-5, Linear scheduler, 2 epochs
    assistant_only_loss, ChatML format
    ↓
Sindhi-Qwen2.5-7B-v2 (Final)

📦 Pretraining Corpus v2 (1B+ tokens)

Assembled from 8+ sources spanning news, literature, religion, legal, encyclopedia, and web text:

Source	Dataset	Documents	Size
Web Corpus	`aakashMeghwar01/sindhi-corpus-505m`	742K	850 MB
Mixed Corpus	`jamalimubashirali/sindhi-pretraining-corpus-part1`	1M+	1.77 GB
Previous Corpus	`shaikhsalman/sindhi-llm-corpus`	293K	638 MB
Religious Texts	`arnizamani/Sindhi-texts-big-dataset`	—	310 MB
News (AwamiAwaz)	`DanishMahdi/Sindhi_News_Corpus_AwamiAwaz`	—	14 MB
News (SindhExpress)	`DanishMahdi/Sindhi_News_Corpus_SindhExpress`	—	5 MB
Encyclopedia	`DanishMahdi/Encyclopedia_Sindhiana_text_corpus`	—	8 MB
Legal Documents	`DanishMahdi/Sindhi_Legal_2`	—	22 MB
English Replay	`FineWeb-2 eng_Latn`	~100K	7.1%
Urdu Replay	`FineWeb-2 urd_Arab`	~80K	5%

Processing: Unicode NFC normalization, OCR noise removal, URL/email removal, quality filtering, hash-based exact deduplication.

💬 Instruction Dataset v2 (490,392 conversations)

Source	Conversations	%
`aakashMeghwar01/Sindhi-Intelligence-Core-SFT`	356,814	72.8%
`saillab/alpaca-sindhi-cleaned`	51,991	10.6%
`Telugu-LLM-Labs/sindhi_alpaca_yahma` (Sindhi)	28,908	5.9%
`Telugu-LLM-Labs/sindhi_alpaca_yahma` (English)	28,908	5.9%
`aakashMeghwar01/Sindhi-Intelligence-Core-SFT-v2`	14,975	3.1%
`Tensoic/GPTeacher-Sindhi`	8,679	1.8%
`aakashMeghwar01/Sindhi-Factual-QA`	99	<0.1%
`shaikhsalman/sindhi-instruct-dataset`	18	<0.1%

Format: ChatML messages format, compatible with Qwen2.5 chat template.

⚙️ Training Recipe

Based on Qalb (arxiv:2601.08141, SOTA Urdu) and ChocoLlama (arxiv:2412.07633):

Stage 1: Continued Pretraining

Parameter	Value	Source
Base Model	Qwen/Qwen2.5-7B	Sailor2
LoRA Rank	128	Qalb
LoRA Alpha	256 (rs-lora)	LoRA Without Regret
Target Modules	all-linear	ChocoLlama
modules_to_save	embed_tokens, lm_head	ChocoLlama
Main LR	2e-5	Qalb
Embedding LR	2e-6 (10x lower)	Qalb
Optimizer	AdamW-8bit	Qalb
Scheduler	Cosine, 5% warmup	Qalb
Sequence Length	4096	—
Batch Size	32 (4×8)	—
Precision	bfloat16	—

Stage 2: SFT

Parameter	Value
LR	5e-5
Scheduler	Linear
Epochs	2
Batch Size	64 (8×8)
Max Length	2048

🚀 Quick Start

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "shaikhsalman/Sindhi-Qwen2.5-7B-v2")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")

messages = [{"role": "user", "content": "شاهه عبداللطيف ڀٽائي ڪير هو؟"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Training

pip install transformers trl peft datasets bitsandbytes accelerate flash-attn trackio

python scripts/pretrain_sindhi_v2.py    # Stage 1: CPT
python scripts/sft_sindhi_v2.py         # Stage 2: SFT
python scripts/evaluate_sindhi_v2.py    # Evaluation
python scripts/tokenizer_fertility.py   # Tokenizer analysis

🖥️ Hardware

Stage	GPU	Est. Time	Est. Cost
CPT	A100 80GB	8-12h	~$35-50
SFT	A100 80GB	4-6h	~$15-25
Eval	T4 16GB	30min	~$1

📖 References

Qalb — arxiv:2601.08141 — SOTA Urdu LLM recipe
Sailor2 — arxiv:2502.12982 — Multilingual CPT on Qwen2.5
ChocoLlama — arxiv:2412.07633 — LoRA language adaptation
Nemotron-Mini-Hindi — arxiv:2410.14815 — CPT for target-language accuracy

🙏 Acknowledgments

Built on the work of: aakashMeghwar01, DanishMahdi, arnizamani, jamalimubashirali, Tensoic, saillab, Telugu-LLM-Labs, HuggingFaceFW

📜 Mission

سنڌي تهذيب، ادب، شاعري ۽ ثقافت کي AI ذريعي محفوظ ڪرڻ ۽ اڳتي وڌائڻ

Preserving Sindhi civilization through AI — for current and future generations.

License

Apache 2.0 (from Qwen2.5-7B)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for shaikhsalman/Sindhi-Qwen2.5-7B

Base model

Qwen/Qwen2.5-7B

Adapter

(451)

this model

Papers for shaikhsalman/Sindhi-Qwen2.5-7B

Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training

Paper • 2601.08141 • Published Jan 13 • 1

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Paper • 2502.12982 • Published Feb 18, 2025 • 19

ChocoLlama: Lessons Learned From Teaching Llamas Dutch

Paper • 2412.07633 • Published Dec 10, 2024 • 1

Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

Paper • 2410.14815 • Published Oct 18, 2024 • 1