Whisper Hugging Face Inference API bug

Please, help!

Although the default Whisper API setting is Transcribing, I receive a Translation (besides into different random languages!?!). Both whisper-large-v3 and whisper-large-v3-turbo. Why?

final request = http.Request(‘POST’, url); //router.huggingface.co/hf-inference/models/openai/whisper-large-v3-turbo
request.headers[‘Authorization’] = ‘Bearer $_hfToken’;
request.headers[‘Content-Type’] = ‘audio/m4a’;
request.bodyBytes = audioBytes;

*** I tried to add additional headers:

request.headers[‘Accept-Language’] = ‘en,en-US’;
request.headers[‘language’] = ‘en’;
request.headers[‘task’] = ‘transcribe’;

… but it didn’t help.

*** I also tried through json payload like:
{
‘inputs’: base64Audio,
‘parameters’: {
‘task’: ‘transcribe’,
‘language’: ‘en’,
},
}

… but returned an error "unexpected keyword argument ‘task’ "… yes ‘task’ is not in the public HF api’s parameters but it’s strange that there are no standard ones of the Whisper API.

How can I fix this issue? I need to receive always and only Transcribe (not Translate to random language) as stated as the default behavior in Whisper Hugging Face Inference API.

1 Like

It barely looks like a bug, but it’s got some pretty peculiar quirks…


What you’re calling (and why “default transcribe” isn’t behaving)

/static-proxy?url=https%3A%2F%2Frouter.huggingface.co%2Fhf-inference%2Fmodels%2F%26lt%3Bmodel%26gt%3B%3C%2Fcode%3E is the Inference Providers router, and hf-inference is Hugging Face’s serverless provider (“HF Inference”), which Hugging Face notes is the service formerly called the serverless “Inference API”. (Hugging Face)

That matters because there are two different layers:

  1. Whisper in Transformers (local / your own endpoint code)
    Whisper exposes Whisper-specific generation controls like:

    • task: "transcribe" vs "translate"
    • language: "en", "english", etc.
      These are documented in the Whisper model docs and model cards. (Hugging Face)
  2. HF Inference Providers “Automatic Speech Recognition” HTTP schema (router endpoint)
    The router endpoint’s public ASR schema does not include Whisper-specific task or language. It only allows:

    • inputs (base64 audio)
    • parameters.return_timestamps
    • parameters.generation_parameters (generic decoding knobs like temperature, do_sample, top_p, num_beams, etc.) (Hugging Face)

So you’re seeing a mismatch: Whisper supports task/language, but this particular serverless ASR interface does not expose them.


Why your attempts failed

1) Headers like Accept-Language, task, language don’t control Whisper here

They are not part of the documented ASR request. The router only documents auth + payload fields; it does not define Whisper controls via headers. (Hugging Face)

2) Your JSON failed because task isn’t a valid ASR parameter

Your payload:

{"parameters": {"task": "transcribe"}}

fails because the ASR schema does not list task (or language) under parameters. (Hugging Face)

This is also consistent with other users asking how to set language/task on the inference API and not finding a supported format. (Hugging Face)


Why you get “translation” and “random languages”

There are two common, separate phenomena that can look like “translate”:

A) The serving backend can effectively behave like task="translate"

Multiple users have reported that the Inference API started returning outputs consistent with translation-to-English behavior, and the local/Transformers workaround was explicitly setting generate_kwargs={"task":"transcribe"}. (Hugging Face)

On serverless HF Inference, you cannot reliably apply that workaround because the router ASR schema doesn’t allow Whisper’s task.

B) Language detection instability (often triggered by input audio / decoding)

If you see output in “random languages”, that is frequently language ID + decoding instability, not “translation mode” (Whisper’s translate mode is specifically “translate to English”).
This gets worse when:

  • audio is short / noisy / silence-heavy
  • the codec/container decode is brittle (mobile-recorded m4a variants are common culprits)
  • decoding is non-deterministic (sampling) rather than greedy

Serverless HF Inference does let you reduce decoding randomness via generation_parameters (see below). (Hugging Face)


What you can and cannot fix on router.huggingface.co/hf-inference

You cannot guarantee “always transcribe” (task control) with this serverless API today

Because the published ASR schema does not accept Whisper’s task/language, there is no supported way to force <|transcribe|> vs <|translate|> or to pin language at the API level. (Hugging Face)

You can improve stability and reduce “random language” outputs

You can:

  • switch to the JSON+base64 form (so you can pass generation_parameters)
  • make decoding deterministic (do_sample: false, temperature: 0)
  • normalize audio (prefer WAV/FLAC 16kHz mono)

Those steps often remove the “random” aspect even if you still can’t force task.


The practical fixes (ranked by “guarantee”)

Fix 1 (only real guarantee): Use a dedicated endpoint / self-host where you can set task="transcribe"

If you need hard guarantees (“always transcribe; never translate; always English”), you need an environment where Whisper’s native controls are exposed—e.g. a dedicated Inference Endpoint (managed deployment) or your own service. Hugging Face positions Inference Endpoints as the “dedicated and autoscaling infrastructure” option. (Hugging Face)

Then you can run Transformers and explicitly do:

  • generate_kwargs={"task":"transcribe", "language":"english"} (or just task) as documented in the model card/docs. (Hugging Face)

This is exactly the knob people used to fix “it defaults to translate” behavior. (Hugging Face)

Fix 2 (works well if your audio language is fixed): use an English-only checkpoint

If your input is always English, pick a model that is English-only, so it can’t wander into other languages. Whisper has “English-only vs multilingual” variants (the multilingual ones were trained for both ASR and translation). (Hugging Face)

A practical option is Distil-Whisper, whose Hub org checkpoints are stated to “currently only support English” in a discussion. (Hugging Face)

This does not “force transcribe” as a formal API parameter, but it often eliminates “random languages” in practice.

Fix 3 (mitigation on serverless HF Inference): deterministic decoding + better audio

This won’t give you a formal “transcribe-only” guarantee, but it usually stabilizes outputs.

Important rule from HF docs:
If you want to send parameters, you must send base64 in JSON (inputs), not raw bytes. Raw bytes are only documented when no parameters are provided. (Hugging Face)

Example request body (JSON)

Use deterministic decoding knobs (all part of generation_parameters in the ASR schema): (Hugging Face)

{
  "inputs": "BASE64_AUDIO",
  "parameters": {
    "generation_parameters": {
      "do_sample": false,
      "temperature": 0,
      "top_p": 1,
      "num_beams": 1,
      "max_new_tokens": 448
    }
  }
}

Dart sketch (adapt your code)

  • Change Content-Type to application/json
  • Base64 the audio bytes
  • Put parameters under generation_parameters
final uri = Uri.parse(
  "/static-proxy?url=https%3A%2F%2Frouter.huggingface.co%2Fhf-inference%2Fmodels%2Fopenai%2Fwhisper-large-v3-turbo",
);

final payload = jsonEncode({
  "inputs": base64Encode(audioBytes),
  "parameters": {
    "generation_parameters": {
      "do_sample": false,
      "temperature": 0,
      "top_p": 1,
      "num_beams": 1,
      "max_new_tokens": 448,
    }
  }
});

final req = http.Request("POST", uri)
  ..headers["Authorization"] = "Bearer $hfToken"
  ..headers["Content-Type"] = "application/json"
  ..body = payload;

Audio tip: try sending a WAV/FLAC test clip first. The official ASR examples use FLAC (sample1.flac). (Hugging Face)


How to decide which fix you need (quick diagnosis)

  1. Is the output always English regardless of input language?
    That aligns with “translate-like” behavior reported by others (and fixed locally via task="transcribe"). (Hugging Face)
    → If you need guarantees, use Fix 1.

  2. Does the output change across repeated calls on the exact same audio?
    That points to decoding randomness / unstable input decoding.
    → Apply Fix 3 (deterministic decoding + WAV/FLAC).

  3. Is your audio always English?
    → Fix 2 is often the simplest way to eliminate “random languages.”


Bottom line for your exact requirement (“always and only transcribe”)

On router.huggingface.co/hf-inference there is no supported task="transcribe" parameter in the published ASR API schema, so you cannot enforce that behavior purely via request options. (Hugging Face)

If you need a true guarantee, the robust solution is a dedicated deployment (Inference Endpoints or self-host) where you can apply Whisper’s task/language controls as documented. (Hugging Face)

API request failed with status: 400
File name too long: ‘AAAAGGZ0eXBtcDQyAAAAAGlzb21tcDQyAAADpW1vb3YAAABsbXZoZAAAAAD…’

while audio file is 15 sec long

1 Like

Oh… There are discrepancies in the document’s claims…:scream:


What the File name too long: 'AAAAGGZ0eXBtcDQy...' error means

That string (AAAAGGZ0eXBtcDQy...) is the base64 of your .m4a file header (MP4 container). The server is not decoding it as base64 audio. Instead, it is treating the inputs string as a file path and trying to open it as a filename; because it’s thousands of characters long, the OS raises “file name too long”.

This is a known failure mode for ASR requests where a base64 string is accidentally interpreted as a path. (Hugging Face Forums)

So: your audio length (15 seconds) is not the issue. The request is being interpreted incorrectly on the server side.


Why this happens on this endpoint

1) The ASR schema says inputs is a base64 string (but some backends still treat strings as paths)

Hugging Face’s Inference Providers ASR documentation states:

  • inputs: base64-encoded audio string
  • or raw bytes if you don’t send parameters (Hugging Face)

However, some ASR serving wrappers (and older endpoint implementations) handle a string inputs as “path/URL to an audio file” first, and only decode bytes in other branches. When that happens, base64 gets misread as a path → “file name too long”. (Hugging Face Forums)

2) .m4a decoding and content-type handling can be brittle

For binary-audio tasks, HF has historically relied on “content-type guessing” or backend-specific decoding paths; inconsistencies between serverless inference and other deployments are documented as a practical pitfall. (Hugging Face)


Fixes (choose based on whether you must send parameters)

Fix A — Most reliable: send raw audio bytes (no JSON, no parameters)

Per the ASR docs, if you omit parameters, you can send raw bytes directly. (Hugging Face)

Dart (raw bytes)

final url = Uri.parse(
  '/static-proxy?url=https%3A%2F%2Frouter.huggingface.co%2Fhf-inference%2Fmodels%2Fopenai%2Fwhisper-large-v3-turbo',
);

final req = http.Request('POST', url)
  ..headers['Authorization'] = 'Bearer $_hfToken'
  ..headers['Accept'] = 'application/json'
  // For .m4a in an MP4 container, audio/mp4 is generally safer than audio/m4a.
  ..headers['Content-Type'] = 'audio/mp4'
  ..bodyBytes = audioBytes;

final streamed = await req.send();
final body = await streamed.stream.bytesToString();

if (streamed.statusCode != 200) {
  throw Exception('HTTP ${streamed.statusCode}: $body');
}

Why this works: the server receives bytes and can’t misinterpret them as a filename. (Hugging Face)

Tradeoff: you can’t pass generation_parameters in this “raw bytes” mode (the docs only describe raw bytes when no parameters are provided). (Hugging Face)


Fix B — If you need generation_parameters: pass a URL as inputs (JSON)

If your backend is treating string inputs as a path, then use that intentionally: provide an HTTPS URL to the audio file (ideally a short-lived signed URL), and keep your JSON parameters.

This aligns with HF’s client documentation: ASR inputs can be raw bytes, a local file, or a URL. (Hugging Face)

JSON body

{
  "inputs": "https://<signed-url>/audio.m4a",
  "parameters": {
    "generation_parameters": {
      "do_sample": false,
      "temperature": 0,
      "top_p": 1,
      "num_beams": 1,
      "max_new_tokens": 448
    }
  }
}

Dart (URL input + parameters)

final payload = jsonEncode({
  "inputs": signedAudioUrl, // https://...
  "parameters": {
    "generation_parameters": {
      "do_sample": false,
      "temperature": 0,
      "top_p": 1,
      "num_beams": 1,
      "max_new_tokens": 448,
    }
  }
});

final req = http.Request('POST', url)
  ..headers['Authorization'] = 'Bearer $_hfToken'
  ..headers['Content-Type'] = 'application/json'
  ..headers['Accept'] = 'application/json'
  ..body = payload;

Why this works: it avoids base64 entirely, and it matches the “string interpreted as path/URL” behavior that is causing your error. (Hugging Face Forums)


Fix C — If you need to force “transcribe” vs “translate” reliably: use a deployment that exposes Whisper’s task/language controls

Whisper supports explicit generation controls:

  • task: "transcribe" or "translate"
  • language: tokens like "en" / "english" (Hugging Face)

But the serverless ASR interface you’re calling is a generic ASR wrapper; even when generation_parameters is supported, Whisper-specific task/language may not be exposed the way Transformers exposes them. (Hugging Face)

If you must guarantee “never translate, always transcribe (and optionally force English)”, the robust approach is to run Whisper behind an endpoint where you control the inference code (so you can set task="transcribe" and language="english" explicitly). (Hugging Face)


Practical stability tips (to reduce “random languages”)

Even after you fix the request shape, “random language” output is often caused by audio decoding / language-ID instability. The highest-impact change is:

  • Convert to WAV PCM, mono, 16 kHz before sending (then use Content-Type: audio/wav).
  • If you keep .m4a, use audio/mp4 rather than audio/m4a (some stacks handle it more consistently).
  • Make decoding deterministic (temperature: 0, do_sample: false) — but that requires Fix B (URL input) or an endpoint that accepts parameters with bytes. (Hugging Face)

Recommendation for your exact situation

  1. First, switch to Fix A (raw bytes) and confirm transcription works consistently (this isolates request-format issues). (Hugging Face)
  2. If you need deterministic decoding knobs, move to Fix B (URL input) and keep generation_parameters. (Hugging Face)
  3. If you need a hard guarantee on transcribe vs translate, use a setup that exposes Whisper’s task/language controls directly. (Hugging Face)