Why am I facing this Error while running this code

1 Like

Based on that screenshot alone, I can’t pinpoint the exact cause…
But it might be a case of a Transformers version mismatch:


Plain-language answer

Your code is failing because the Hugging Face Whisper speech-recognition pipeline expected an internal value called num_frames, but that value was missing.

You did not forget to write num_frames yourself.
You are not supposed to pass num_frames manually.

The error is coming from inside the installed transformers package, not from your own print(result) line.

Your code is basically this:

from transformers import pipeline

transcriber = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v3"
)

result = transcriber(
    "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)

print(result)

The important error is:

KeyError: 'num_frames'

That means:

A part of the Whisper ASR pipeline tried to read num_frames, but the processed audio data did not contain that key.


What is happening behind the scenes

The simple-looking line:

transcriber = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v3"
)

builds a full speech-recognition system.

It does not only load the model. It also loads:

  • the Whisper model,
  • the tokenizer,
  • the feature extractor,
  • audio loading logic,
  • audio decoding logic,
  • preprocessing logic,
  • generation logic,
  • postprocessing logic.

The Hugging Face pipeline docs describe pipelines as high-level wrappers around model inference, and the ASR pipeline specifically works with audio files or raw waveforms. The docs also say audio-file input needs FFmpeg support for multiple audio formats. (Hugging Face)

So this call:

result = transcriber(".../mlk.flac")

does many hidden steps:

  1. download/read the .flac file,
  2. decode the audio,
  3. convert the audio into numerical features,
  4. send those features into Whisper,
  5. generate text,
  6. return the transcript.

Your error happens around step 3 or 4, before the final transcript is produced.


What num_frames means

num_frames is internal audio metadata.

A “frame” here means a small processed unit of audio. Whisper does not read the raw .flac file directly. The audio has to be converted into model-ready features first.

The pipeline uses frame-related metadata for things like:

  • audio length,
  • chunking,
  • timestamps,
  • batching,
  • long audio handling,
  • mapping generated text back to time positions.

So when you see:

KeyError: 'num_frames'

you can read it as:

The pipeline expected audio-length bookkeeping information, but the object it received did not include that information.

This usually points to a library/version mismatch or pipeline bug, not a mistake in your visible code.


Why your code is not obviously wrong

1. The model name is valid

This model is real:

"openai/whisper-large-v3"

The official model card says Whisper large-v3 is supported in Hugging Face Transformers. It also shows how to run it with AutoModelForSpeechSeq2Seq, AutoProcessor, and pipeline. (Hugging Face)

So the problem is probably not the model ID.


2. Passing an audio URL is allowed

The ASR pipeline supports a string input that is either:

  • a local audio file path, or
  • a public URL to an audio file.

The current ASR pipeline source says a string can be a filename or public URL, and the file is read at the correct sampling rate using FFmpeg. (GitHub)

So this is valid in principle:

transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

The URL style using /resolve/main/... is the correct “raw file” style.


3. The same error pattern exists online

There is a very similar public Transformers issue where a Whisper ASR pipeline fails with:

KeyError: 'num_frames'

inside the feature-extraction / pipeline code. The issue is labeled as a bug. (GitHub)

There are also related reports where batching fails because some audio examples contain num_frames and others do not. One issue says batch_size=1 worked, but batch_size>1 failed with a key mismatch involving num_frames. (GitHub)

That strongly suggests your case belongs to a known family of Whisper pipeline problems.


Most likely cause

The most likely cause is:

Your installed transformers / audio stack has a mismatch where the ASR pipeline expects num_frames, but the feature extractor path you are hitting does not return it.

This can happen because of:

  • an older transformers version,
  • a very new transformers version with a regression,
  • mixed package versions,
  • a notebook runtime that was upgraded without restart,
  • audio dependencies not matching the current ASR stack,
  • hidden changes in the hosted environment.

In simple terms:

Your code is small, but the environment underneath it is complicated.


Best solution path

Step 1: restart with a clean package setup

Run this first:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

Then restart the runtime/kernel.

This restart is important. Installing new packages while old modules are already imported can leave Python using stale code.

The official Whisper large-v3 model card recommends installing/upgrading transformers, datasets[audio], and accelerate before running the model. (Hugging Face)


Step 2: check your versions

After restarting, run:

import sys
import transformers
import torch

print("python:", sys.version)
print("transformers:", transformers.__version__)
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())

This tells you what you are actually running.

This matters because the same code may behave differently depending on:

  • Python version,
  • transformers version,
  • torch version,
  • audio decoding dependencies,
  • CPU vs GPU runtime.

Step 3: use the safer official-style code

Instead of the shortest pipeline(...) version, use the more explicit pattern.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "openai/whisper-large-v3"

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

transcriber = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

audio_url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"

result = transcriber(audio_url)

print(result["text"])

This matches the official model-card style more closely: load the model, load the processor, then pass the tokenizer and feature extractor into the pipeline explicitly. (Hugging Face)

This is better because the error involves audio feature extraction. Making the feature extractor explicit reduces hidden auto-loading ambiguity.


Why this code is safer

Your original code:

transcriber = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v3"
)

asks Transformers to infer everything automatically.

The safer code:

processor = AutoProcessor.from_pretrained(model_id)

transcriber = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
)

makes the important pieces visible:

  • the model,
  • the tokenizer,
  • the feature extractor.

That matters because num_frames is related to how the audio is processed before reaching the model.


Step 4: test whether the URL is involved

Your URL is probably not the main problem, but it is easy to test.

Try downloading the file first:

import requests

url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
path = "mlk.flac"

response = requests.get(url)
response.raise_for_status()

with open(path, "wb") as f:
    f.write(response.content)

result = transcriber(path)
print(result["text"])

Interpret the result like this:

Result Meaning
Local file works, URL fails The issue may be URL reading or remote audio decoding.
Local file also fails The issue is probably the Transformers/Whisper pipeline stack.
Both work after upgrade The issue was likely a package-version problem.

Step 5: make sure audio dependencies are present

For audio work, you may need a fuller audio stack:

pip install --upgrade soundfile librosa torchcodec

For Linux/Colab-style systems, also check FFmpeg:

ffmpeg -version

If FFmpeg is missing on a Debian/Ubuntu-style system:

sudo apt-get update
sudo apt-get install -y ffmpeg

Hugging Face Datasets audio decoding uses TorchCodec, which uses FFmpeg under the hood. (Hugging Face)


Step 6: do not add extra options yet

First make plain transcription work.

Avoid these at the beginning:

batch_size=8
chunk_length_s=30
return_timestamps=True
return_timestamps="word"
generate_kwargs={...}

Why?

Because num_frames is tied to pipeline bookkeeping for things like batching, timestamps, and chunking. The ASR source shows the pipeline handles chunking, stride, num_frames, timestamps, and postprocessing internally. (GitHub)

Start with:

result = transcriber(audio_url)
print(result["text"])

Then add features one by one.


If upgrading does not fix it

Try one of these controlled paths.

Option A: reinstall cleanly

pip install --upgrade --force-reinstall transformers datasets[audio] accelerate

Then restart the runtime.


Option B: install the newest code from GitHub

pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
pip install --upgrade datasets[audio] accelerate

Then restart the runtime.

This is useful when the bug has already been fixed in the repository but not yet in the normal pip release.


Option C: pin a known working version

If a specific version works, save it.

For example, after finding a working setup:

pip freeze | grep -E "transformers|torch|datasets|accelerate|torchcodec"

Then put those exact working versions in your notebook or requirements.txt.

Example format:

transformers==...
torch==...
datasets==...
accelerate==...
torchcodec==...

Use the actual versions that worked for you.


Guides worth opening

1. Whisper large-v3 model card

Use this for the official recommended code pattern for openai/whisper-large-v3. It shows the explicit AutoModelForSpeechSeq2Seq + AutoProcessor + pipeline approach. (Hugging Face)

2. Transformers pipeline docs

Use this to understand what pipeline(...) does and why a simple call can fail inside hidden preprocessing code. (Hugging Face)

3. ASR pipeline source code

Use this only when you want to compare your traceback to the actual internal code. It shows that ASR input can be a local file path, public URL, bytes, raw NumPy array, or dictionary with sampling rate. (GitHub)

4. Datasets audio loading docs

Use this when audio loading or decoding fails. It explains that Datasets audio decoding relies on TorchCodec and FFmpeg. (Hugging Face)

5. Related GitHub issues

Use these to confirm that num_frames errors are a real known problem family, especially around Whisper, batching, and pipeline internals. (GitHub)


What not to do

Do not manually pass num_frames

This is not the solution:

num_frames = ...

The missing value is inside the pipeline’s internal processed-audio object. It is not a parameter you are expected to provide.


Do not edit the installed package first

Avoid editing files inside:

site-packages/transformers/...

For example, changing:

processed.pop("num_frames")

to something else may hide the error, but it may break timestamp or chunking behavior later.

A package upgrade, clean reinstall, or explicit model/processor loading is safer.


Do not start with batching

Do not start with:

result = transcriber(list_of_audio_files, batch_size=8)

First make one audio file work. Related reports show num_frames can be involved in batching failures. (GitHub)


Recommended final code

Use this after upgrading and restarting:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "openai/whisper-large-v3"

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

transcriber = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

result = transcriber(
    "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)

print(result["text"])

Bottom line

Your error is best understood like this:

The Whisper ASR pipeline is trying to process audio, but the installed pipeline stack expects an internal num_frames value that is missing. Your code is not obviously wrong; the issue is most likely a transformers / audio dependency / runtime-version mismatch or bug.

Quick checklist

  • Upgrade:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
  • Restart runtime.
  • Use explicit AutoModelForSpeechSeq2Seq + AutoProcessor.
  • Test one audio file first.
  • Avoid batching/timestamps/chunking until the basic call works.
  • Check FFmpeg / audio dependencies if audio decoding fails.
  • Pin the working package versions once it runs.