---
base_model: meta-llama/Llama-3.1-8B-Instruct
library_name: peft
license: apache-2.0
datasets:
- LingoIITGN/COMI-LINGUA
language:
- hi
- en
tags:
- code-mixing
- Hinglish
- text-normalization
- multilingual-normalization
metrics:
- exact match
- chrF
- BLEU (for reference)
pipeline_tag: text-generation
---

# Model Card for Model ID
### Model Description

This is a fine-tuned version of Llama-3.1-8B-Instruct for **Text Normalization (TN)** on noisy, informal, and code-mixed **Hinglish** (Hindi-English) text. It takes informal, noisy, code-mixed input (Roman/Devanagari scripts, with spelling variations, abbreviations, transliteration errors, etc.) and produces standardized, fluent, and consistent outputs.

The model preserves semantic meaning and naturalness while correcting spelling, script inconsistencies, informal contractions, and code-mixing artifacts. It shows strong performance on the COMI-LINGUA text normalization subset, significantly outperforming zero-shot and one-shot prompting baselines from both open- and closed-weight LLMs.

- **Model type:** LoRA-adapted Transformer LLM (8B params, ~32M trainable)
- **License:** apache-2.0
- **Finetuned from model:** meta-llama/Llama-3.1-8B-Instruct (strong performer on multilingual and code-mixed tasks)

### Model Sources
- **Paper:** [COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing](https://aclanthology.org/2025.findings-emnlp.422.pdf)
- **Demo:** Integrated in [Demo Portal](https://lingo.iitgn.ac.in/comi-lingua/) (check for normalization support)

## Uses
- Preprocessing noisy Hinglish text from social media, chat applications, user-generated content for downstream NLP (sentiment analysis, search, summarization).
- Normalization in multilingual pipelines, chatbots, content moderation systems handling informal Indian language.

- **Example inference prompt:**
```
Normalize the following noisy Hinglish sentence into three standardized formats: Standard English, Romanized Hindi, and Devanagari Hindi.
Input: "kal mujhe offis jaana hai bt traffic bhot bura hga bhai"
Output : "Kal mujhe office jana hai but traffic bhot bura hoga bhai"
```

## Training Details
### Training Data
[COMI-LINGUA Dataset Card](https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA) — Text Normalization (TN) subset: sentence-level normalization of noisy, informal, code-mixed Hinglish to standardized forms.


#### Training Hyperparameters
- **Regime:** PEFT LoRA (rank=32, alpha=64, dropout=0.1)
- **Epochs:** 3
- **Batch:** 4 (accum=8, effective=32)
- **LR:** 2e-4 (cosine + warmup=0.1)
- **Weight decay:** 0.01

## Evaluation
#### Testing Data
COMI-LINGUA TN test set (~5K instances).

### Results
**Summary:** Achieves substantial gains over prompting baselines, with particularly strong performance on Devanagari output due to script standardization. Outperforms zero-shot/one-shot prompting of LLaMA-3.1-8B-Instruct and approaches or exceeds several closed-weight LLMs on noisy Hinglish normalization.

## Bias, Risks, and Limitations
<span style="color:red"> This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures.</span>

May occasionally over-normalize culturally specific slang or fail on highly domain-specific informal terms not well-represented in training data.

## Model Card Contact
[Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br>
Mail at: [lingo@iitgn.ac.in](mailto:lingo@iitgn.ac.in)

## Citation
If you use this model, please cite the following work:
```
@inproceedings{sheth-etal-2025-comi,
    title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
    author = "Sheth, Rajvee and
      Beniwal, Himanshu and
      Singh, Mayank",
    editor = "Christodoulopoulos, Christos and
      Chakraborty, Tanmoy and
      Rose, Carolyn and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.422/",
    pages = "7973--7992",
    ISBN = "979-8-89176-335-7",
}
```