Whisper for Arabic–English speech with Indian accent

Hi all,

I’m using Whisper to transcribe short audio messages that are quite challenging:

  • Speakers often mix Arabic and English within the same sentence.

  • Many speakers have an Indian accent (both for Arabic and English).

  • Speech is fast, and the recordings are sometimes noisy (background sounds, imperfect mic).

I’ve already tried some straightforward improvements (basic denoising, VAD, tuning decoding parameters, using larger Whisper models), but the transcription quality is still not good enough, especially with Indian accents.

I’m looking for:

  • Practical tips that have worked for you in similar conditions (pre‑processing, decoding settings, post‑processing, etc.).

  • Any existing fine‑tuned Whisper models for Arabic–English code‑switching with Indian accents.

  • Guidance or references on how to fine‑tune Whisper (or a similar ASR model) specifically for this kind of data.

Thanks in advance for any pointers or examples!

1 Like

Accent and multilingual speech can definitely challenge models. Has anyone tried fine-tuning Whisper variants with mixed English-Arabic Indian-accent data for better accuracy?

1 Like

Whisper might be stuck in the worst possible situation for this model…?


Why this setting is hard for “vanilla” Whisper

Code-switching breaks the model’s strongest assumptions

Whisper-style models are trained to produce one coherent transcript from a window of audio. In code-switch speech, the model must decide (often multiple times per second) whether the next token should come from Arabic script or Latin script, while also handling shared phonetics and loanwords. When the evidence is weak (fast speech, noise, accent), the decoder tends to “commit” to one language and then keep sampling from that language’s token distribution, which can spill across the true switch boundary.

Indian-accented speech increases phonetic ambiguity

Accents affect:

  • vowel/consonant realizations,
  • stress timing,
  • coarticulation patterns,
  • and segment durations.

For short, noisy messages, these shifts are enough to push the model into “low-evidence” decoding where it starts relying more on its language model prior than the acoustic signal. That’s when you see substitutions, omissions, or fluent but wrong text.

Auto language detection is fragile on short/noisy audio

In faster-whisper, language detection is performed using the first ~30 seconds if you don’t set language=... explicitly. That is a known source of wrong-language outputs if the beginning includes silence/noise or code-switching. (GitHub)
This interacts badly with your setting: once the model “picks” the wrong language early, later chunks are biased.

Repetition/hallucination loops are a known failure mode with silence/gaps

Two community-validated mitigations for “stuck repeating / hallucinating after a gap” are:

  • split audio with VAD,
  • set condition_on_previous_text=False. (GitHub)
    This matters for voice messages because they often have pauses and trailing non-speech.

Practical pipeline changes that usually move the needle

1) Make segmentation your primary quality lever

For short messages, segmentation quality often dominates model size.

Target behavior: feed the decoder 1–8s windows that are “mostly speech”, with small padding and minimal trailing non-speech.

Recommended segmentation recipe

  • VAD to find speech islands
  • add padding (e.g., 150–300ms)
  • add overlap (e.g., 100–250ms) to protect word boundaries
  • explicit tail trimming after VAD (energy/RMS-based) to remove long quiet endings that trigger hallucinations
  • cap maximum segment length (e.g., 8–12s); long segments increase drift and LID errors

This aligns with common “hallucination fix” guidance: VAD slicing plus disabling conditioning reduces loops when the model can’t find evidence in the current window. (GitHub)

2) Default to “guardrails ON” for production decoding

A simple but important rule: don’t disable the thresholds unless you’re deliberately building a repro.

When thresholds are enabled, Whisper-style decoding has mechanisms to suppress output during no-speech/low-confidence regions (implemented in the reference transcribe logic). (GitHub)

Practical defaults for voice messages

  • condition_on_previous_text=False (prevents “carryover text” into gaps) (GitHub)
  • keep no_speech_threshold, log_prob_threshold, compression_ratio_threshold enabled (don’t set them to None)
  • use temperature=0.0 for determinism while tuning; add temperature fallback only if needed

3) Constrain language behavior by design (don’t rely on auto-LID)

Auto-LID being computed on the first ~30s is a known limitation; multiple issues report wrong-language outputs under auto-detection. (GitHub)
There’s also an open request for “limit detection to a subset of languages,” which does not exist as a first-class feature in faster-whisper today. (GitHub)

Workarounds that actually help

  • If you know it’s always Arabic+English, use a two-pass strategy:

    1. attempt language="ar" decode

    2. attempt language="en" decode

    3. pick the better result using a small heuristic:

      • script sanity (Arabic chars ratio vs Latin ratio),
      • repetition score,
      • average logprob proxy (if available),
      • “text produced in low-energy region” penalty.

This directly addresses “unstable language ID outputs” without needing new model features.

4) Post-processing that is code-switch aware

Avoid “English-only cleanup” or “Arabic-only cleanup”; mixed script requires a mixed strategy.

Low-risk post-processing ideas

  • script-aware normalization

    • normalize Arabic punctuation variants (e.g., Arabic comma/Latin comma)
    • normalize tatweel and repeated diacritics only if you see them
  • repetition filters

    • detect repeated bigrams/trigrams over a threshold and either truncate or mark as suspect
  • segment-level confidence flags

    • mark segments suspicious if:

      • very long text produced while energy is low,
      • script doesn’t match forced language pass,
      • high repetition compression-like behavior

5) If Whisper still struggles: consider an alternate base model as a reference

Two candidates worth testing as “sanity checks”:

  • Meta SeamlessM4T v2: supports Arabic variants (e.g., Modern Standard Arabic, Egyptian, Moroccan) in its published supported language list, and is explicitly evaluated for ASR tasks. (Hugging Face)
    Use case: as a comparison point or fallback for Arabic-heavy segments (not necessarily best at code-switching out of the box).

  • NVIDIA Canary v2: strong multilingual ASR for its supported languages, but public materials emphasize European language coverage; Arabic support is inconsistent across deployments per community reports. (Hugging Face)
    Use case: less compelling if Arabic is core.


Existing fine-tuned models you can start from

These are not a perfect match (Arabic↔English + Indian accent), but they’re useful starting points.

Arabic–English code-switch Whisper models

  • MohamedRashad/Arabic-Whisper-CodeSwitching-Edition
    Fine-tuned on an Arabic-English code-switch dataset; explicitly intended for Arabic speech with embedded English words. License shown as GPL-3.0 (often problematic for commercial use). (Hugging Face)

  • azeem23/whisper-small-codeswitching-ArabicEnglish
    A smaller Whisper variant fine-tuned for Arabic-English code-switching, based on the same dataset. (Hugging Face)

Indian-accent English Whisper model

  • Tejveer12/Indian-Accent-English-Whisper-Finetuned
    Fine-tuned on the Indian-accent English dataset (WillHeld/india_accent_cv). (Hugging Face)
    The model repository indicates an MIT license in its metadata/commit history. (Hugging Face)

How to use these in practice

  • Use the Indian-accent model as an English-pass decoder for English-dominant segments.
  • Use a code-switch model as the Arabic-pass decoder (especially for Arabic-dominant segments with English insertions).
  • Or: use these as initialization targets for your own adapter fine-tune (next section).

Fine-tuning Whisper for your exact data (practical recipe)

1) Use adapter-style fine-tuning (LoRA) first

Full fine-tuning of large Whisper checkpoints is expensive and easy to overfit. For accent + code-switch adaptation, LoRA usually gets you most of the gain with lower risk.

The Hugging Face PEFT guide shows an int8 + LoRA training approach for Whisper ASR specifically. (Hugging Face)

Why LoRA helps here

  • You’re adapting pronunciation + boundary behavior, not learning a new language.
  • You want to preserve general robustness while nudging the model toward your accent and code-switch distribution.

2) Build a training mix that matches your deployment distribution

Aim for three buckets:

  1. In-domain: your actual voice messages (even 10–50 hours helps if transcripts are consistent)
  2. Indian-accent English: augment English segments with accent data (e.g., WillHeld/india_accent_cv) (Hugging Face)
  3. Arabic–English code-switch: add code-switch examples (e.g., MohamedRashad dataset/models; also consider Mixat for methodology even if dialect differs) (Hugging Face)

If you lack real Arabic↔English code-switch hours, synthetic code-switch generation is an active research direction (phrase-level mixing) and can be used to bootstrap. (isca-archive.org)

3) Keep transcript conventions strict and stable

For code-switch, consistency matters more than perfection:

  • keep Arabic in Arabic script and English in Latin script
  • avoid random transliterations
  • normalize punctuation and casing rules consistently across the dataset

4) Training choices that matter most for your case

  • Start from a multilingual checkpoint (e.g., Whisper small/medium/large-v3 depending on budget)

  • Use task="transcribe" (not translate)

  • Ensure audio is standardized to 16kHz mono

  • Filter or downweight:

    • clips with extremely low SNR,
    • clips with unreliable transcripts,
    • clips with long non-speech tails (or trim them)

5) Evaluation: don’t rely on one WER number

Use at least:

  • overall WER
  • English-only WER on English spans
  • Arabic-only WER/CER on Arabic spans
  • a “switch-boundary” check (simple proxy): count how often the script flips in the right neighborhood of known switch points (even a heuristic boundary test catches regressions quickly)

High-quality references to follow end-to-end

  • Hugging Face blog: “Fine-Tune Whisper For Multilingual ASR with Transformers” (step-by-step). (Hugging Face)
  • PEFT int8 + LoRA ASR guide for Whisper (T4-friendly training approach). (Hugging Face)
  • Whisper hallucination mitigation discussion: VAD slicing + condition_on_previous_text=False. (GitHub)
  • Code-switch dataset methodology reference: Mixat paper (how they build and analyze code-mixed Arabic/English speech). (ACL Anthology)
  • faster-whisper language detection limitations and wrong-language reports. (GitHub)

A concrete “starting plan” for your production pipeline

  1. Segment aggressively (VAD + pad + overlap + explicit tail trim) before decoding. (GitHub)

  2. Decode with guardrails on and condition_on_previous_text=False by default for voice messages. (GitHub)

  3. Two-pass language strategy per segment:

    • run forced Arabic decode, forced English decode
    • choose output by script sanity + repetition penalty (+ score proxy if available)
  4. Fallback policy: if output is suspicious (wrong script, repetition, text in low energy), re-decode with stricter thresholds and/or shorter segment.

  5. Fine-tune via LoRA using your in-domain audio + Indian-accent English + Arabic-English code-switch data. (Hugging Face)

1 Like