It barely looks like a bug, but it’s got some pretty peculiar quirks…
What you’re calling (and why “default transcribe” isn’t behaving)
/static-proxy?url=https%3A%2F%2Frouter.huggingface.co%2Fhf-inference%2Fmodels%2F%26lt%3Bmodel%26gt%3B%3C%2Fcode%3E is the Inference Providers router, and hf-inference is Hugging Face’s serverless provider (“HF Inference”), which Hugging Face notes is the service formerly called the serverless “Inference API”. (Hugging Face)
That matters because there are two different layers:
-
Whisper in Transformers (local / your own endpoint code)
Whisper exposes Whisper-specific generation controls like:
task: "transcribe" vs "translate"
language: "en", "english", etc.
These are documented in the Whisper model docs and model cards. (Hugging Face)
-
HF Inference Providers “Automatic Speech Recognition” HTTP schema (router endpoint)
The router endpoint’s public ASR schema does not include Whisper-specific task or language. It only allows:
inputs (base64 audio)
parameters.return_timestamps
parameters.generation_parameters (generic decoding knobs like temperature, do_sample, top_p, num_beams, etc.) (Hugging Face)
So you’re seeing a mismatch: Whisper supports task/language, but this particular serverless ASR interface does not expose them.
Why your attempts failed
1) Headers like Accept-Language, task, language don’t control Whisper here
They are not part of the documented ASR request. The router only documents auth + payload fields; it does not define Whisper controls via headers. (Hugging Face)
2) Your JSON failed because task isn’t a valid ASR parameter
Your payload:
{"parameters": {"task": "transcribe"}}
fails because the ASR schema does not list task (or language) under parameters. (Hugging Face)
This is also consistent with other users asking how to set language/task on the inference API and not finding a supported format. (Hugging Face)
Why you get “translation” and “random languages”
There are two common, separate phenomena that can look like “translate”:
A) The serving backend can effectively behave like task="translate"
Multiple users have reported that the Inference API started returning outputs consistent with translation-to-English behavior, and the local/Transformers workaround was explicitly setting generate_kwargs={"task":"transcribe"}. (Hugging Face)
On serverless HF Inference, you cannot reliably apply that workaround because the router ASR schema doesn’t allow Whisper’s task.
B) Language detection instability (often triggered by input audio / decoding)
If you see output in “random languages”, that is frequently language ID + decoding instability, not “translation mode” (Whisper’s translate mode is specifically “translate to English”).
This gets worse when:
- audio is short / noisy / silence-heavy
- the codec/container decode is brittle (mobile-recorded m4a variants are common culprits)
- decoding is non-deterministic (sampling) rather than greedy
Serverless HF Inference does let you reduce decoding randomness via generation_parameters (see below). (Hugging Face)
What you can and cannot fix on router.huggingface.co/hf-inference
You cannot guarantee “always transcribe” (task control) with this serverless API today
Because the published ASR schema does not accept Whisper’s task/language, there is no supported way to force <|transcribe|> vs <|translate|> or to pin language at the API level. (Hugging Face)
You can improve stability and reduce “random language” outputs
You can:
- switch to the JSON+base64 form (so you can pass
generation_parameters)
- make decoding deterministic (
do_sample: false, temperature: 0)
- normalize audio (prefer WAV/FLAC 16kHz mono)
Those steps often remove the “random” aspect even if you still can’t force task.
The practical fixes (ranked by “guarantee”)
Fix 1 (only real guarantee): Use a dedicated endpoint / self-host where you can set task="transcribe"
If you need hard guarantees (“always transcribe; never translate; always English”), you need an environment where Whisper’s native controls are exposed—e.g. a dedicated Inference Endpoint (managed deployment) or your own service. Hugging Face positions Inference Endpoints as the “dedicated and autoscaling infrastructure” option. (Hugging Face)
Then you can run Transformers and explicitly do:
generate_kwargs={"task":"transcribe", "language":"english"} (or just task) as documented in the model card/docs. (Hugging Face)
This is exactly the knob people used to fix “it defaults to translate” behavior. (Hugging Face)
Fix 2 (works well if your audio language is fixed): use an English-only checkpoint
If your input is always English, pick a model that is English-only, so it can’t wander into other languages. Whisper has “English-only vs multilingual” variants (the multilingual ones were trained for both ASR and translation). (Hugging Face)
A practical option is Distil-Whisper, whose Hub org checkpoints are stated to “currently only support English” in a discussion. (Hugging Face)
This does not “force transcribe” as a formal API parameter, but it often eliminates “random languages” in practice.
Fix 3 (mitigation on serverless HF Inference): deterministic decoding + better audio
This won’t give you a formal “transcribe-only” guarantee, but it usually stabilizes outputs.
Important rule from HF docs:
If you want to send parameters, you must send base64 in JSON (inputs), not raw bytes. Raw bytes are only documented when no parameters are provided. (Hugging Face)
Example request body (JSON)
Use deterministic decoding knobs (all part of generation_parameters in the ASR schema): (Hugging Face)
{
"inputs": "BASE64_AUDIO",
"parameters": {
"generation_parameters": {
"do_sample": false,
"temperature": 0,
"top_p": 1,
"num_beams": 1,
"max_new_tokens": 448
}
}
}
Dart sketch (adapt your code)
- Change
Content-Type to application/json
- Base64 the audio bytes
- Put parameters under
generation_parameters
final uri = Uri.parse(
"/static-proxy?url=https%3A%2F%2Frouter.huggingface.co%2Fhf-inference%2Fmodels%2Fopenai%2Fwhisper-large-v3-turbo",
);
final payload = jsonEncode({
"inputs": base64Encode(audioBytes),
"parameters": {
"generation_parameters": {
"do_sample": false,
"temperature": 0,
"top_p": 1,
"num_beams": 1,
"max_new_tokens": 448,
}
}
});
final req = http.Request("POST", uri)
..headers["Authorization"] = "Bearer $hfToken"
..headers["Content-Type"] = "application/json"
..body = payload;
Audio tip: try sending a WAV/FLAC test clip first. The official ASR examples use FLAC (sample1.flac). (Hugging Face)
How to decide which fix you need (quick diagnosis)
-
Is the output always English regardless of input language?
That aligns with “translate-like” behavior reported by others (and fixed locally via task="transcribe"). (Hugging Face)
→ If you need guarantees, use Fix 1.
-
Does the output change across repeated calls on the exact same audio?
That points to decoding randomness / unstable input decoding.
→ Apply Fix 3 (deterministic decoding + WAV/FLAC).
-
Is your audio always English?
→ Fix 2 is often the simplest way to eliminate “random languages.”
Bottom line for your exact requirement (“always and only transcribe”)
On router.huggingface.co/hf-inference there is no supported task="transcribe" parameter in the published ASR API schema, so you cannot enforce that behavior purely via request options. (Hugging Face)
If you need a true guarantee, the robust solution is a dedicated deployment (Inference Endpoints or self-host) where you can apply Whisper’s task/language controls as documented. (Hugging Face)