Model Card for Model ID

Model Description

This is a fine-tuned version of Llama-3.1-8B-Instruct for Text Normalization (TN) on noisy, informal, and code-mixed Hinglish (Hindi-English) text. It takes informal, noisy, code-mixed input (Roman/Devanagari scripts, with spelling variations, abbreviations, transliteration errors, etc.) and produces standardized, fluent, and consistent outputs.

The model preserves semantic meaning and naturalness while correcting spelling, script inconsistencies, informal contractions, and code-mixing artifacts. It shows strong performance on the COMI-LINGUA text normalization subset, significantly outperforming zero-shot and one-shot prompting baselines from both open- and closed-weight LLMs.

  • Model type: LoRA-adapted Transformer LLM (8B params, ~32M trainable)
  • License: apache-2.0
  • Finetuned from model: meta-llama/Llama-3.1-8B-Instruct (strong performer on multilingual and code-mixed tasks)

Model Sources

Uses

  • Preprocessing noisy Hinglish text from social media, chat applications, user-generated content for downstream NLP (sentiment analysis, search, summarization).

  • Normalization in multilingual pipelines, chatbots, content moderation systems handling informal Indian language.

  • Example inference prompt:

Normalize the following noisy Hinglish sentence into three standardized formats: Standard English, Romanized Hindi, and Devanagari Hindi.
Input: "kal mujhe offis jaana hai bt traffic bhot bura hga bhai"
Output : "Kal mujhe office jana hai but traffic bhot bura hoga bhai"

Training Details

Training Data

COMI-LINGUA Dataset Card — Text Normalization (TN) subset: sentence-level normalization of noisy, informal, code-mixed Hinglish to standardized forms.

Training Hyperparameters

  • Regime: PEFT LoRA (rank=32, alpha=64, dropout=0.1)
  • Epochs: 3
  • Batch: 4 (accum=8, effective=32)
  • LR: 2e-4 (cosine + warmup=0.1)
  • Weight decay: 0.01

Evaluation

Testing Data

COMI-LINGUA TN test set (~5K instances).

Results

Summary: Achieves substantial gains over prompting baselines, with particularly strong performance on Devanagari output due to script standardization. Outperforms zero-shot/one-shot prompting of LLaMA-3.1-8B-Instruct and approaches or exceeds several closed-weight LLMs on noisy Hinglish normalization.

Bias, Risks, and Limitations

This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures.

May occasionally over-normalize culturally specific slang or fail on highly domain-specific informal terms not well-represented in training data.

Model Card Contact

Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in

Citation

If you use this model, please cite the following work:

@inproceedings{sheth-etal-2025-comi,
    title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
    author = "Sheth, Rajvee and
      Beniwal, Himanshu and
      Singh, Mayank",
    editor = "Christodoulopoulos, Christos and
      Chakraborty, Tanmoy and
      Rose, Carolyn and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.422/",
    pages = "7973--7992",
    ISBN = "979-8-89176-335-7",
}
Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LingoIITGN/COMI-LINGUA-TN

Adapter
(1782)
this model

Dataset used to train LingoIITGN/COMI-LINGUA-TN