Based on that screenshot alone, I can’t pinpoint the exact cause…
But it might be a case of a Transformers version mismatch:
Plain-language answer
Your code is failing because the Hugging Face Whisper speech-recognition pipeline expected an internal value called num_frames, but that value was missing.
You did not forget to write num_frames yourself.
You are not supposed to pass num_frames manually.
The error is coming from inside the installed transformers package, not from your own print(result) line.
Your code is basically this:
from transformers import pipeline
transcriber = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-large-v3"
)
result = transcriber(
"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)
print(result)
The important error is:
KeyError: 'num_frames'
That means:
A part of the Whisper ASR pipeline tried to read
num_frames, but the processed audio data did not contain that key.
What is happening behind the scenes
The simple-looking line:
transcriber = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-large-v3"
)
builds a full speech-recognition system.
It does not only load the model. It also loads:
- the Whisper model,
- the tokenizer,
- the feature extractor,
- audio loading logic,
- audio decoding logic,
- preprocessing logic,
- generation logic,
- postprocessing logic.
The Hugging Face pipeline docs describe pipelines as high-level wrappers around model inference, and the ASR pipeline specifically works with audio files or raw waveforms. The docs also say audio-file input needs FFmpeg support for multiple audio formats. (Hugging Face)
So this call:
result = transcriber(".../mlk.flac")
does many hidden steps:
- download/read the
.flacfile, - decode the audio,
- convert the audio into numerical features,
- send those features into Whisper,
- generate text,
- return the transcript.
Your error happens around step 3 or 4, before the final transcript is produced.
What num_frames means
num_frames is internal audio metadata.
A “frame” here means a small processed unit of audio. Whisper does not read the raw .flac file directly. The audio has to be converted into model-ready features first.
The pipeline uses frame-related metadata for things like:
- audio length,
- chunking,
- timestamps,
- batching,
- long audio handling,
- mapping generated text back to time positions.
So when you see:
KeyError: 'num_frames'
you can read it as:
The pipeline expected audio-length bookkeeping information, but the object it received did not include that information.
This usually points to a library/version mismatch or pipeline bug, not a mistake in your visible code.
Why your code is not obviously wrong
1. The model name is valid
This model is real:
"openai/whisper-large-v3"
The official model card says Whisper large-v3 is supported in Hugging Face Transformers. It also shows how to run it with AutoModelForSpeechSeq2Seq, AutoProcessor, and pipeline. (Hugging Face)
So the problem is probably not the model ID.
2. Passing an audio URL is allowed
The ASR pipeline supports a string input that is either:
- a local audio file path, or
- a public URL to an audio file.
The current ASR pipeline source says a string can be a filename or public URL, and the file is read at the correct sampling rate using FFmpeg. (GitHub)
So this is valid in principle:
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
The URL style using /resolve/main/... is the correct “raw file” style.
3. The same error pattern exists online
There is a very similar public Transformers issue where a Whisper ASR pipeline fails with:
KeyError: 'num_frames'
inside the feature-extraction / pipeline code. The issue is labeled as a bug. (GitHub)
There are also related reports where batching fails because some audio examples contain num_frames and others do not. One issue says batch_size=1 worked, but batch_size>1 failed with a key mismatch involving num_frames. (GitHub)
That strongly suggests your case belongs to a known family of Whisper pipeline problems.
Most likely cause
The most likely cause is:
Your installed
transformers/ audio stack has a mismatch where the ASR pipeline expectsnum_frames, but the feature extractor path you are hitting does not return it.
This can happen because of:
- an older
transformersversion, - a very new
transformersversion with a regression, - mixed package versions,
- a notebook runtime that was upgraded without restart,
- audio dependencies not matching the current ASR stack,
- hidden changes in the hosted environment.
In simple terms:
Your code is small, but the environment underneath it is complicated.
Best solution path
Step 1: restart with a clean package setup
Run this first:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
Then restart the runtime/kernel.
This restart is important. Installing new packages while old modules are already imported can leave Python using stale code.
The official Whisper large-v3 model card recommends installing/upgrading transformers, datasets[audio], and accelerate before running the model. (Hugging Face)
Step 2: check your versions
After restarting, run:
import sys
import transformers
import torch
print("python:", sys.version)
print("transformers:", transformers.__version__)
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
This tells you what you are actually running.
This matters because the same code may behave differently depending on:
- Python version,
transformersversion,torchversion,- audio decoding dependencies,
- CPU vs GPU runtime.
Step 3: use the safer official-style code
Instead of the shortest pipeline(...) version, use the more explicit pattern.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
model_id = "openai/whisper-large-v3"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
transcriber = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
audio_url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
result = transcriber(audio_url)
print(result["text"])
This matches the official model-card style more closely: load the model, load the processor, then pass the tokenizer and feature extractor into the pipeline explicitly. (Hugging Face)
This is better because the error involves audio feature extraction. Making the feature extractor explicit reduces hidden auto-loading ambiguity.
Why this code is safer
Your original code:
transcriber = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-large-v3"
)
asks Transformers to infer everything automatically.
The safer code:
processor = AutoProcessor.from_pretrained(model_id)
transcriber = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
)
makes the important pieces visible:
- the model,
- the tokenizer,
- the feature extractor.
That matters because num_frames is related to how the audio is processed before reaching the model.
Step 4: test whether the URL is involved
Your URL is probably not the main problem, but it is easy to test.
Try downloading the file first:
import requests
url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
path = "mlk.flac"
response = requests.get(url)
response.raise_for_status()
with open(path, "wb") as f:
f.write(response.content)
result = transcriber(path)
print(result["text"])
Interpret the result like this:
| Result | Meaning |
|---|---|
| Local file works, URL fails | The issue may be URL reading or remote audio decoding. |
| Local file also fails | The issue is probably the Transformers/Whisper pipeline stack. |
| Both work after upgrade | The issue was likely a package-version problem. |
Step 5: make sure audio dependencies are present
For audio work, you may need a fuller audio stack:
pip install --upgrade soundfile librosa torchcodec
For Linux/Colab-style systems, also check FFmpeg:
ffmpeg -version
If FFmpeg is missing on a Debian/Ubuntu-style system:
sudo apt-get update
sudo apt-get install -y ffmpeg
Hugging Face Datasets audio decoding uses TorchCodec, which uses FFmpeg under the hood. (Hugging Face)
Step 6: do not add extra options yet
First make plain transcription work.
Avoid these at the beginning:
batch_size=8
chunk_length_s=30
return_timestamps=True
return_timestamps="word"
generate_kwargs={...}
Why?
Because num_frames is tied to pipeline bookkeeping for things like batching, timestamps, and chunking. The ASR source shows the pipeline handles chunking, stride, num_frames, timestamps, and postprocessing internally. (GitHub)
Start with:
result = transcriber(audio_url)
print(result["text"])
Then add features one by one.
If upgrading does not fix it
Try one of these controlled paths.
Option A: reinstall cleanly
pip install --upgrade --force-reinstall transformers datasets[audio] accelerate
Then restart the runtime.
Option B: install the newest code from GitHub
pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
pip install --upgrade datasets[audio] accelerate
Then restart the runtime.
This is useful when the bug has already been fixed in the repository but not yet in the normal pip release.
Option C: pin a known working version
If a specific version works, save it.
For example, after finding a working setup:
pip freeze | grep -E "transformers|torch|datasets|accelerate|torchcodec"
Then put those exact working versions in your notebook or requirements.txt.
Example format:
transformers==...
torch==...
datasets==...
accelerate==...
torchcodec==...
Use the actual versions that worked for you.
Guides worth opening
1. Whisper large-v3 model card
Use this for the official recommended code pattern for openai/whisper-large-v3. It shows the explicit AutoModelForSpeechSeq2Seq + AutoProcessor + pipeline approach. (Hugging Face)
2. Transformers pipeline docs
Use this to understand what pipeline(...) does and why a simple call can fail inside hidden preprocessing code. (Hugging Face)
3. ASR pipeline source code
Use this only when you want to compare your traceback to the actual internal code. It shows that ASR input can be a local file path, public URL, bytes, raw NumPy array, or dictionary with sampling rate. (GitHub)
4. Datasets audio loading docs
Use this when audio loading or decoding fails. It explains that Datasets audio decoding relies on TorchCodec and FFmpeg. (Hugging Face)
5. Related GitHub issues
Use these to confirm that num_frames errors are a real known problem family, especially around Whisper, batching, and pipeline internals. (GitHub)
What not to do
Do not manually pass num_frames
This is not the solution:
num_frames = ...
The missing value is inside the pipeline’s internal processed-audio object. It is not a parameter you are expected to provide.
Do not edit the installed package first
Avoid editing files inside:
site-packages/transformers/...
For example, changing:
processed.pop("num_frames")
to something else may hide the error, but it may break timestamp or chunking behavior later.
A package upgrade, clean reinstall, or explicit model/processor loading is safer.
Do not start with batching
Do not start with:
result = transcriber(list_of_audio_files, batch_size=8)
First make one audio file work. Related reports show num_frames can be involved in batching failures. (GitHub)
Recommended final code
Use this after upgrading and restarting:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
model_id = "openai/whisper-large-v3"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
transcriber = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
result = transcriber(
"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)
print(result["text"])
Bottom line
Your error is best understood like this:
The Whisper ASR pipeline is trying to process audio, but the installed pipeline stack expects an internal
num_framesvalue that is missing. Your code is not obviously wrong; the issue is most likely atransformers/ audio dependency / runtime-version mismatch or bug.
Quick checklist
- Upgrade:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
- Restart runtime.
- Use explicit
AutoModelForSpeechSeq2Seq+AutoProcessor. - Test one audio file first.
- Avoid batching/timestamps/chunking until the basic call works.
- Check FFmpeg / audio dependencies if audio decoding fails.
- Pin the working package versions once it runs.
