---
base_model:
- openai/whisper-large-v3
language:
- th
license: mit
arxiv: 2601.13044
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

# Typhoon Whisper Large v3

**Typhoon Whisper Large v3** is a state-of-the-art Thai Automatic Speech Recognition (ASR) model fine-tuned on the OpenAI Whisper Large v3 architecture. It delivers exceptional accuracy on Thai speech recognition tasks, achieving superior performance through comprehensive training on diverse Thai audio data.

- **Paper:** [Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition](https://huggingface.co/papers/2601.13044)
- **Project Page:** [opentyphoon.ai](https://opentyphoon.ai/model/typhoon-asr-realtime)
- **GitHub:** [scb-10x/typhoon-asr](https://github.com/scb-10x/typhoon-asr)

The model was trained on approximately **10 million data points** (~11,000 hours) of Thai audio, curated and normalized using the Typhoon data pipeline to ensure consistent handling of Thai numbers, repetition markers, and context-dependent ambiguities.

## Model Overview

- **Architecture:** Whisper Large v3 (32 decoder layers, full model)
- **Language:** Thai
- **Dataset:** ~10M training samples of normalized Thai speech (Gigaspeech2, CommonVoice, Internal Curated Public Media)
- **Task:** Automatic Speech Recognition (ASR)
- **License:** MIT (inherited from OpenAI Whisper)

## Performance

Typhoon Whisper Large v3 achieves state-of-the-art performance on Thai speech recognition benchmarks.

<img src="https://storage.googleapis.com/typhoon-public/assets/typhoon_asr/thai_asr_pareto_frontier.png" alt="Thai ASR Model Performance - Pareto Frontier comparing accuracy vs inference speed">

> **Note:** Lower CER (Character Error Rate) is better. Results on Gigaspeech2 (Clean/Academic), TVSpeech (Noisy/In-the-wild), and Google Fleurs (Thai) testset.

## Usage

You can use this model directly with the Hugging Face `transformers` library.

### Installation

```bash
pip install transformers torch accelerate
```

### Example Code

```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

model_id = "scb10x/typhoon-whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=448,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# Transcribe
result = pipe("path_to_audio.wav", generate_kwargs={"language": "thai"})
print(result["text"])
```

## Training Data

The model was trained on approximately **10 million training samples** (~11,000 hours) of Thai audio data, including:

- **Gigaspeech2**: Clean, academic-style speech
- **CommonVoice**: Crowd-sourced diverse speech samples
- **Internal Curated Public Media**: Proprietary datasets curated by Typhoon Team, SCB 10X

All data was normalized using the Typhoon data pipeline to ensure:
- Consistent handling of Thai numbers
- Proper treatment of repetition markers
- Resolution of context-dependent ambiguities

## Model Architecture

Typhoon Whisper Large v3 is based on **OpenAI Whisper Large v3**, the full-scale model featuring:
- **Full architecture:** 32 decoder layers for maximum representational capacity
- **State-of-the-art accuracy:** Optimized for best possible transcription quality
- **Robust performance:** Handles diverse acoustic conditions and speaking styles

This is the flagship model in the Typhoon ASR family, prioritizing accuracy over inference speed.

## Limitations

- The model is optimized specifically for **Thai language** speech recognition
- Performance may vary on dialects or accents not well-represented in the training data
- Requires more computational resources compared to lighter variants (e.g., Typhoon Whisper Turbo)
- Best suited for offline transcription where accuracy is prioritized over latency

## License

This model is released under the **MIT License**, inherited from OpenAI Whisper.

## Citation

If you use this model in your research or application, please cite our technical report:

[![arXiv](https://img.shields.io/badge/arXiv-2601.13044-b31b1b.svg)](https://arxiv.org/abs/2601.13044)

```bibtex
@misc{warit2026typhoonasr,
      title={Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition}, 
      author={Warit Sirichotedumrong and Adisai Na-Thalang and Potsawee Manakul and Pittawat Taveekitworachai and Sittipong Sripaisarnmongkol and Kunat Pipatanakul},
      year={2026},
      eprint={2601.13044},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.13044}, 
}
```

## Contact

For questions or feedback, please visit [our website](http://opentyphoon.ai) or open an issue on the model's repository.

---

**Developed by:** Typhoon Team, SCB 10X  
**Model Card Version:** 1.0  
**Last Updated:** January 2026