Whisper might be stuck in the worst possible situation for this model…?
Why this setting is hard for “vanilla” Whisper
Code-switching breaks the model’s strongest assumptions
Whisper-style models are trained to produce one coherent transcript from a window of audio. In code-switch speech, the model must decide (often multiple times per second) whether the next token should come from Arabic script or Latin script, while also handling shared phonetics and loanwords. When the evidence is weak (fast speech, noise, accent), the decoder tends to “commit” to one language and then keep sampling from that language’s token distribution, which can spill across the true switch boundary.
Indian-accented speech increases phonetic ambiguity
Accents affect:
- vowel/consonant realizations,
- stress timing,
- coarticulation patterns,
- and segment durations.
For short, noisy messages, these shifts are enough to push the model into “low-evidence” decoding where it starts relying more on its language model prior than the acoustic signal. That’s when you see substitutions, omissions, or fluent but wrong text.
Auto language detection is fragile on short/noisy audio
In faster-whisper, language detection is performed using the first ~30 seconds if you don’t set language=... explicitly. That is a known source of wrong-language outputs if the beginning includes silence/noise or code-switching. (GitHub)
This interacts badly with your setting: once the model “picks” the wrong language early, later chunks are biased.
Repetition/hallucination loops are a known failure mode with silence/gaps
Two community-validated mitigations for “stuck repeating / hallucinating after a gap” are:
- split audio with VAD,
- set
condition_on_previous_text=False. (GitHub)
This matters for voice messages because they often have pauses and trailing non-speech.
Practical pipeline changes that usually move the needle
1) Make segmentation your primary quality lever
For short messages, segmentation quality often dominates model size.
Target behavior: feed the decoder 1–8s windows that are “mostly speech”, with small padding and minimal trailing non-speech.
Recommended segmentation recipe
- VAD to find speech islands
- add padding (e.g., 150–300ms)
- add overlap (e.g., 100–250ms) to protect word boundaries
- explicit tail trimming after VAD (energy/RMS-based) to remove long quiet endings that trigger hallucinations
- cap maximum segment length (e.g., 8–12s); long segments increase drift and LID errors
This aligns with common “hallucination fix” guidance: VAD slicing plus disabling conditioning reduces loops when the model can’t find evidence in the current window. (GitHub)
2) Default to “guardrails ON” for production decoding
A simple but important rule: don’t disable the thresholds unless you’re deliberately building a repro.
When thresholds are enabled, Whisper-style decoding has mechanisms to suppress output during no-speech/low-confidence regions (implemented in the reference transcribe logic). (GitHub)
Practical defaults for voice messages
condition_on_previous_text=False (prevents “carryover text” into gaps) (GitHub)
- keep
no_speech_threshold, log_prob_threshold, compression_ratio_threshold enabled (don’t set them to None)
- use
temperature=0.0 for determinism while tuning; add temperature fallback only if needed
3) Constrain language behavior by design (don’t rely on auto-LID)
Auto-LID being computed on the first ~30s is a known limitation; multiple issues report wrong-language outputs under auto-detection. (GitHub)
There’s also an open request for “limit detection to a subset of languages,” which does not exist as a first-class feature in faster-whisper today. (GitHub)
Workarounds that actually help
This directly addresses “unstable language ID outputs” without needing new model features.
4) Post-processing that is code-switch aware
Avoid “English-only cleanup” or “Arabic-only cleanup”; mixed script requires a mixed strategy.
Low-risk post-processing ideas
5) If Whisper still struggles: consider an alternate base model as a reference
Two candidates worth testing as “sanity checks”:
-
Meta SeamlessM4T v2: supports Arabic variants (e.g., Modern Standard Arabic, Egyptian, Moroccan) in its published supported language list, and is explicitly evaluated for ASR tasks. (Hugging Face)
Use case: as a comparison point or fallback for Arabic-heavy segments (not necessarily best at code-switching out of the box).
-
NVIDIA Canary v2: strong multilingual ASR for its supported languages, but public materials emphasize European language coverage; Arabic support is inconsistent across deployments per community reports. (Hugging Face)
Use case: less compelling if Arabic is core.
Existing fine-tuned models you can start from
These are not a perfect match (Arabic↔English + Indian accent), but they’re useful starting points.
Arabic–English code-switch Whisper models
-
MohamedRashad/Arabic-Whisper-CodeSwitching-Edition
Fine-tuned on an Arabic-English code-switch dataset; explicitly intended for Arabic speech with embedded English words. License shown as GPL-3.0 (often problematic for commercial use). (Hugging Face)
-
azeem23/whisper-small-codeswitching-ArabicEnglish
A smaller Whisper variant fine-tuned for Arabic-English code-switching, based on the same dataset. (Hugging Face)
Indian-accent English Whisper model
Tejveer12/Indian-Accent-English-Whisper-Finetuned
Fine-tuned on the Indian-accent English dataset (WillHeld/india_accent_cv). (Hugging Face)
The model repository indicates an MIT license in its metadata/commit history. (Hugging Face)
How to use these in practice
- Use the Indian-accent model as an English-pass decoder for English-dominant segments.
- Use a code-switch model as the Arabic-pass decoder (especially for Arabic-dominant segments with English insertions).
- Or: use these as initialization targets for your own adapter fine-tune (next section).
Fine-tuning Whisper for your exact data (practical recipe)
1) Use adapter-style fine-tuning (LoRA) first
Full fine-tuning of large Whisper checkpoints is expensive and easy to overfit. For accent + code-switch adaptation, LoRA usually gets you most of the gain with lower risk.
The Hugging Face PEFT guide shows an int8 + LoRA training approach for Whisper ASR specifically. (Hugging Face)
Why LoRA helps here
- You’re adapting pronunciation + boundary behavior, not learning a new language.
- You want to preserve general robustness while nudging the model toward your accent and code-switch distribution.
2) Build a training mix that matches your deployment distribution
Aim for three buckets:
- In-domain: your actual voice messages (even 10–50 hours helps if transcripts are consistent)
- Indian-accent English: augment English segments with accent data (e.g.,
WillHeld/india_accent_cv) (Hugging Face)
- Arabic–English code-switch: add code-switch examples (e.g., MohamedRashad dataset/models; also consider Mixat for methodology even if dialect differs) (Hugging Face)
If you lack real Arabic↔English code-switch hours, synthetic code-switch generation is an active research direction (phrase-level mixing) and can be used to bootstrap. (isca-archive.org)
3) Keep transcript conventions strict and stable
For code-switch, consistency matters more than perfection:
- keep Arabic in Arabic script and English in Latin script
- avoid random transliterations
- normalize punctuation and casing rules consistently across the dataset
4) Training choices that matter most for your case
-
Start from a multilingual checkpoint (e.g., Whisper small/medium/large-v3 depending on budget)
-
Use task="transcribe" (not translate)
-
Ensure audio is standardized to 16kHz mono
-
Filter or downweight:
- clips with extremely low SNR,
- clips with unreliable transcripts,
- clips with long non-speech tails (or trim them)
5) Evaluation: don’t rely on one WER number
Use at least:
- overall WER
- English-only WER on English spans
- Arabic-only WER/CER on Arabic spans
- a “switch-boundary” check (simple proxy): count how often the script flips in the right neighborhood of known switch points (even a heuristic boundary test catches regressions quickly)
High-quality references to follow end-to-end
- Hugging Face blog: “Fine-Tune Whisper For Multilingual ASR with Transformers” (step-by-step). (Hugging Face)
- PEFT int8 + LoRA ASR guide for Whisper (T4-friendly training approach). (Hugging Face)
- Whisper hallucination mitigation discussion: VAD slicing +
condition_on_previous_text=False. (GitHub)
- Code-switch dataset methodology reference: Mixat paper (how they build and analyze code-mixed Arabic/English speech). (ACL Anthology)
faster-whisper language detection limitations and wrong-language reports. (GitHub)
A concrete “starting plan” for your production pipeline
-
Segment aggressively (VAD + pad + overlap + explicit tail trim) before decoding. (GitHub)
-
Decode with guardrails on and condition_on_previous_text=False by default for voice messages. (GitHub)
-
Two-pass language strategy per segment:
- run forced Arabic decode, forced English decode
- choose output by script sanity + repetition penalty (+ score proxy if available)
-
Fallback policy: if output is suspicious (wrong script, repetition, text in low energy), re-decode with stricter thresholds and/or shorter segment.
-
Fine-tune via LoRA using your in-domain audio + Indian-accent English + Arabic-English code-switch data. (Hugging Face)