πŸŒ™ Sindhi-Qwen2.5-7B v2 β€” The World's Most Advanced Sindhi LLM

Ψ³Ω†ΪŒΩŠ ΪͺΩˆΩŠΩ† β€” Ψ³Ω†ΪŒΩŠ Ω»ΩˆΩ„ΩŠΨ‘ΩŽ لاِؑ Ψ―Ω†ΩŠΨ§ جو Ψ³Ϊ€ Ϊ©Ψ§Ω† ΨͺΨ±Ω‚ΩŠ يافΨͺΩ‡ Ω»ΩˆΩ„ΩŠ Ω…Ψ§ΪŠΩ„

A state-of-the-art Sindhi language model built through continued pretraining and instruction tuning of Qwen2.5-7B on the largest Sindhi dataset ever assembled.

πŸ†• What's New in v2

Feature v1 v2
Pretraining Corpus 200M tokens, 3 sources 1B+ tokens, 8+ sources
Instruction Data 18 examples 490,392 conversations
Data Sources 3 (FineWeb-2, Wikipedia, English replay) 12+ (news, legal, encyclopedia, religious, literary, SFT collections)
Replay Languages English only (7%) English (7.1%) + Urdu (5%)
Deduplication None Hash-based exact dedup
Unicode Cleaning None NFC normalization + OCR noise removal
Sequence Length 2048 4096
LoRA Config r=128, alpha=32 r=128, alpha=256, rs-lora
Embedding Training Same LR as LoRA Differential LR (10x lower)
Evaluation Manual prompts Automated benchmark suite

πŸ“Š Resources

Resource Link Size
v2 Pretraining Corpus shaikhsalman/sindhi-llm-corpus-v2 1B+ tokens
v2 Instruction Dataset shaikhsalman/sindhi-instruct-v2 490,392 conversations
Original Corpus shaikhsalman/sindhi-llm-corpus ~200M tokens
Base Model Qwen/Qwen2.5-7B 7.6B params

πŸ—οΈ Training Pipeline

Qwen2.5-7B (151K vocab, Arabic script)
    ↓
[Stage 1: CPT β€” 1B+ Sindhi tokens + En/Ur replay]
    LoRA r=128, rs-lora, all-linear + embed + lm_head
    Differential LR: 2e-5 main / 2e-6 embeddings
    Sequence length: 4096, packing + FlashAttention-2
    ↓
Sindhi-Qwen2.5-7B-v2-cpt
    ↓
[Stage 2: SFT β€” 490K instruction examples]
    Same LoRA, LR 5e-5, Linear scheduler, 2 epochs
    assistant_only_loss, ChatML format
    ↓
Sindhi-Qwen2.5-7B-v2 (Final)

πŸ“¦ Pretraining Corpus v2 (1B+ tokens)

Assembled from 8+ sources spanning news, literature, religion, legal, encyclopedia, and web text:

Source Dataset Documents Size
Web Corpus aakashMeghwar01/sindhi-corpus-505m 742K 850 MB
Mixed Corpus jamalimubashirali/sindhi-pretraining-corpus-part1 1M+ 1.77 GB
Previous Corpus shaikhsalman/sindhi-llm-corpus 293K 638 MB
Religious Texts arnizamani/Sindhi-texts-big-dataset β€” 310 MB
News (AwamiAwaz) DanishMahdi/Sindhi_News_Corpus_AwamiAwaz β€” 14 MB
News (SindhExpress) DanishMahdi/Sindhi_News_Corpus_SindhExpress β€” 5 MB
Encyclopedia DanishMahdi/Encyclopedia_Sindhiana_text_corpus β€” 8 MB
Legal Documents DanishMahdi/Sindhi_Legal_2 β€” 22 MB
English Replay FineWeb-2 eng_Latn ~100K 7.1%
Urdu Replay FineWeb-2 urd_Arab ~80K 5%

Processing: Unicode NFC normalization, OCR noise removal, URL/email removal, quality filtering, hash-based exact deduplication.

πŸ’¬ Instruction Dataset v2 (490,392 conversations)

Source Conversations %
aakashMeghwar01/Sindhi-Intelligence-Core-SFT 356,814 72.8%
saillab/alpaca-sindhi-cleaned 51,991 10.6%
Telugu-LLM-Labs/sindhi_alpaca_yahma (Sindhi) 28,908 5.9%
Telugu-LLM-Labs/sindhi_alpaca_yahma (English) 28,908 5.9%
aakashMeghwar01/Sindhi-Intelligence-Core-SFT-v2 14,975 3.1%
Tensoic/GPTeacher-Sindhi 8,679 1.8%
aakashMeghwar01/Sindhi-Factual-QA 99 <0.1%
shaikhsalman/sindhi-instruct-dataset 18 <0.1%

Format: ChatML messages format, compatible with Qwen2.5 chat template.

βš™οΈ Training Recipe

Based on Qalb (arxiv:2601.08141, SOTA Urdu) and ChocoLlama (arxiv:2412.07633):

Stage 1: Continued Pretraining

Parameter Value Source
Base Model Qwen/Qwen2.5-7B Sailor2
LoRA Rank 128 Qalb
LoRA Alpha 256 (rs-lora) LoRA Without Regret
Target Modules all-linear ChocoLlama
modules_to_save embed_tokens, lm_head ChocoLlama
Main LR 2e-5 Qalb
Embedding LR 2e-6 (10x lower) Qalb
Optimizer AdamW-8bit Qalb
Scheduler Cosine, 5% warmup Qalb
Sequence Length 4096 β€”
Batch Size 32 (4Γ—8) β€”
Precision bfloat16 β€”

Stage 2: SFT

Parameter Value
LR 5e-5
Scheduler Linear
Epochs 2
Batch Size 64 (8Γ—8)
Max Length 2048

πŸš€ Quick Start

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "shaikhsalman/Sindhi-Qwen2.5-7B-v2")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")

messages = [{"role": "user", "content": "Ψ΄Ψ§Ω‡Ω‡ ΨΉΨ¨Ψ―Ψ§Ω„Ω„Ψ·ΩŠΩ Ϊ€Ω½Ψ§Ψ¦ΩŠ Ϊͺير Ω‡ΩˆΨŸ"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Training

pip install transformers trl peft datasets bitsandbytes accelerate flash-attn trackio

python scripts/pretrain_sindhi_v2.py    # Stage 1: CPT
python scripts/sft_sindhi_v2.py         # Stage 2: SFT
python scripts/evaluate_sindhi_v2.py    # Evaluation
python scripts/tokenizer_fertility.py   # Tokenizer analysis

πŸ–₯️ Hardware

Stage GPU Est. Time Est. Cost
CPT A100 80GB 8-12h ~$35-50
SFT A100 80GB 4-6h ~$15-25
Eval T4 16GB 30min ~$1

πŸ“– References

  1. Qalb β€” arxiv:2601.08141 β€” SOTA Urdu LLM recipe
  2. Sailor2 β€” arxiv:2502.12982 β€” Multilingual CPT on Qwen2.5
  3. ChocoLlama β€” arxiv:2412.07633 β€” LoRA language adaptation
  4. Nemotron-Mini-Hindi β€” arxiv:2410.14815 β€” CPT for target-language accuracy

πŸ™ Acknowledgments

Built on the work of: aakashMeghwar01, DanishMahdi, arnizamani, jamalimubashirali, Tensoic, saillab, Telugu-LLM-Labs, HuggingFaceFW

πŸ“œ Mission

Ψ³Ω†ΪŒΩŠ ΨͺΩ‡Ψ°ΩŠΨ¨ΨŒ ادب، شاعري Ϋ½ ثقافΨͺ کي AI ذريعي Ω…Ψ­ΩΩˆΨΈ ΪͺΨ±Ϊ» Ϋ½ Ψ§Ϊ³Ψͺي وڌائڻ

Preserving Sindhi civilization through AI β€” for current and future generations.

License

Apache 2.0 (from Qwen2.5-7B)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shaikhsalman/Sindhi-Qwen2.5-7B

Base model

Qwen/Qwen2.5-7B
Adapter
(451)
this model

Papers for shaikhsalman/Sindhi-Qwen2.5-7B