Model Card for Model ID
Model Description
This is a fine-tuned version of Llama-3.1-8B-Instruct for Text Normalization (TN) on noisy, informal, and code-mixed Hinglish (Hindi-English) text. It takes informal, noisy, code-mixed input (Roman/Devanagari scripts, with spelling variations, abbreviations, transliteration errors, etc.) and produces standardized, fluent, and consistent outputs.
The model preserves semantic meaning and naturalness while correcting spelling, script inconsistencies, informal contractions, and code-mixing artifacts. It shows strong performance on the COMI-LINGUA text normalization subset, significantly outperforming zero-shot and one-shot prompting baselines from both open- and closed-weight LLMs.
- Model type: LoRA-adapted Transformer LLM (8B params, ~32M trainable)
- License: apache-2.0
- Finetuned from model: meta-llama/Llama-3.1-8B-Instruct (strong performer on multilingual and code-mixed tasks)
Model Sources
- Paper: COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
- Demo: Integrated in Demo Portal (check for normalization support)
Uses
Preprocessing noisy Hinglish text from social media, chat applications, user-generated content for downstream NLP (sentiment analysis, search, summarization).
Normalization in multilingual pipelines, chatbots, content moderation systems handling informal Indian language.
Example inference prompt:
Normalize the following noisy Hinglish sentence into three standardized formats: Standard English, Romanized Hindi, and Devanagari Hindi.
Input: "kal mujhe offis jaana hai bt traffic bhot bura hga bhai"
Output : "Kal mujhe office jana hai but traffic bhot bura hoga bhai"
Training Details
Training Data
COMI-LINGUA Dataset Card — Text Normalization (TN) subset: sentence-level normalization of noisy, informal, code-mixed Hinglish to standardized forms.
Training Hyperparameters
- Regime: PEFT LoRA (rank=32, alpha=64, dropout=0.1)
- Epochs: 3
- Batch: 4 (accum=8, effective=32)
- LR: 2e-4 (cosine + warmup=0.1)
- Weight decay: 0.01
Evaluation
Testing Data
COMI-LINGUA TN test set (~5K instances).
Results
Summary: Achieves substantial gains over prompting baselines, with particularly strong performance on Devanagari output due to script standardization. Outperforms zero-shot/one-shot prompting of LLaMA-3.1-8B-Instruct and approaches or exceeds several closed-weight LLMs on noisy Hinglish normalization.
Bias, Risks, and Limitations
This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures.
May occasionally over-normalize culturally specific slang or fail on highly domain-specific informal terms not well-represented in training data.
Model Card Contact
Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in
Citation
If you use this model, please cite the following work:
@inproceedings{sheth-etal-2025-comi,
title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
author = "Sheth, Rajvee and
Beniwal, Himanshu and
Singh, Mayank",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.422/",
pages = "7973--7992",
ISBN = "979-8-89176-335-7",
}
- Downloads last month
- 24
Model tree for LingoIITGN/COMI-LINGUA-TN
Base model
meta-llama/Llama-3.1-8B