---
language:
- te
license: apache-2.0
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- whisper
- telugu
- asr
- speech-recognition
- indian-languages
- ai4bharat
base_model: openai/whisper-small
datasets:
- ai4bharat/Kathbath
metrics:
- wer
- cer
model-index:
- name: vanshnawander/whisper-small-telugu
  results:
  - task:
      type: automatic-speech-recognition
      name: Speech Recognition
    dataset:
      name: Shrutilipi (Telugu)
      type: ai4bharat/Shrutilipi
    metrics:
    - type: wer
      value: 69.7
      name: Word Error Rate
    - type: cer
      value: 28.9
      name: Character Error Rate
---

# vanshnawander/whisper-small-telugu

This is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) for Telugu automatic speech recognition (ASR).

## Model Description

- **Base Model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)
- **Language:** Telugu (te)
- **Task:** Automatic Speech Recognition (transcribe)
- **Training Data:** [ai4bharat/Kathbath](https://huggingface.co/datasets/ai4bharat/Kathbath)
- **Fine-tuning Framework:** Transformers + Custom DALI Pipeline

## Training Details

The model was fine-tuned on the Kathbath Telugu dataset with the following configuration:
- **Epochs:** 3
- **Batch Size:** 16 (effective ~96 with gradient accumulation)
- **Learning Rate:** 1e-5
- **Mixed Precision:** FP16
- **Gradient Checkpointing:** Enabled

## Evaluation Results

Evaluated on the [Shrutilipi benchmark](https://huggingface.co/datasets/ai4bharat/Shrutilipi) - a large-scale ASR dataset for Indian languages.

| Model | WER | CER | Improvement |
|-------|-----|-----|-------------|
| Base (openai/whisper-small) | N/A% | N/A% | - |
| **This Model** | **69.7%** | **28.9%** |  |

## Usage

### Basic Usage

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("vanshnawander/whisper-small-telugu")
model = WhisperForConditionalGeneration.from_pretrained("vanshnawander/whisper-small-telugu")

# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)

# Transcribe
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
generated_ids = model.generate(input_features, language="te", task="transcribe")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(transcription)
```

### Using Pipeline

```python
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="vanshnawander/whisper-small-telugu",
    chunk_length_s=30,
)

result = pipe("audio.wav", generate_kwargs={"language": "te", "task": "transcribe"})
print(result["text"])
```

## Limitations

- Optimized for Telugu speech; may not perform well on other languages
- Best performance on clear audio with minimal background noise
- May struggle with very fast speech or heavy code-mixing

## Citation

If you use this model, please cite:

```bibtex
@misc{vanshnawander_whisper_small_telugu},
  author = {Vansh Nawander},
  title = {vanshnawander/whisper-small-telugu},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/vanshnawander/whisper-small-telugu}
}
```

## Acknowledgments

- [OpenAI Whisper](https://github.com/openai/whisper) for the base model
- [AI4Bharat](https://ai4bharat.iitm.ac.in/) for the Kathbath and Shrutilipi datasets
- [Hugging Face](https://huggingface.co/) for the transformers library