--- language: - te license: apache-2.0 library_name: transformers pipeline_tag: automatic-speech-recognition tags: - whisper - telugu - asr - speech-recognition - indian-languages - ai4bharat base_model: openai/whisper-small datasets: - ai4bharat/Kathbath metrics: - wer - cer model-index: - name: vanshnawander/whisper-small-telugu results: - task: type: automatic-speech-recognition name: Speech Recognition dataset: name: Shrutilipi (Telugu) type: ai4bharat/Shrutilipi metrics: - type: wer value: 69.7 name: Word Error Rate - type: cer value: 28.9 name: Character Error Rate --- # vanshnawander/whisper-small-telugu This is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) for Telugu automatic speech recognition (ASR). ## Model Description - **Base Model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small) - **Language:** Telugu (te) - **Task:** Automatic Speech Recognition (transcribe) - **Training Data:** [ai4bharat/Kathbath](https://huggingface.co/datasets/ai4bharat/Kathbath) - **Fine-tuning Framework:** Transformers + Custom DALI Pipeline ## Training Details The model was fine-tuned on the Kathbath Telugu dataset with the following configuration: - **Epochs:** 3 - **Batch Size:** 16 (effective ~96 with gradient accumulation) - **Learning Rate:** 1e-5 - **Mixed Precision:** FP16 - **Gradient Checkpointing:** Enabled ## Evaluation Results Evaluated on the [Shrutilipi benchmark](https://huggingface.co/datasets/ai4bharat/Shrutilipi) - a large-scale ASR dataset for Indian languages. | Model | WER | CER | Improvement | |-------|-----|-----|-------------| | Base (openai/whisper-small) | N/A% | N/A% | - | | **This Model** | **69.7%** | **28.9%** | | ## Usage ### Basic Usage ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import librosa # Load model and processor processor = WhisperProcessor.from_pretrained("vanshnawander/whisper-small-telugu") model = WhisperForConditionalGeneration.from_pretrained("vanshnawander/whisper-small-telugu") # Load audio audio, sr = librosa.load("audio.wav", sr=16000) # Transcribe input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features generated_ids = model.generate(input_features, language="te", task="transcribe") transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(transcription) ``` ### Using Pipeline ```python from transformers import pipeline pipe = pipeline( "automatic-speech-recognition", model="vanshnawander/whisper-small-telugu", chunk_length_s=30, ) result = pipe("audio.wav", generate_kwargs={"language": "te", "task": "transcribe"}) print(result["text"]) ``` ## Limitations - Optimized for Telugu speech; may not perform well on other languages - Best performance on clear audio with minimal background noise - May struggle with very fast speech or heavy code-mixing ## Citation If you use this model, please cite: ```bibtex @misc{vanshnawander_whisper_small_telugu}, author = {Vansh Nawander}, title = {vanshnawander/whisper-small-telugu}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/vanshnawander/whisper-small-telugu} } ``` ## Acknowledgments - [OpenAI Whisper](https://github.com/openai/whisper) for the base model - [AI4Bharat](https://ai4bharat.iitm.ac.in/) for the Kathbath and Shrutilipi datasets - [Hugging Face](https://huggingface.co/) for the transformers library