Never ending inference until it reaches max context length

I am running gpt-oss-120b with vllm using the following command:

python -m vllm.entrypoints.openai.api_server --model openai/gpt-oss-120b --download-dir /home/huf/.cache --trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --override-generation-config ‘{“temperature”: 0.2}’ --tool-call-parser openai --port 8000

I am testing concurrent users, so it runs with 1 user fast and outputs, once it reaches 2 concurrent users, the inference never ends and once it hits context length (131000 or whatever i limit it to) it outputs. And the output is hallucination, just repeated words.

What exactly is the issue? is it big input context (each user sends 50000 tokens)? or is it cache?

GPU is H100 with 86gb Vram, the model takes about 65gb and 1 full context (131K tokens) is about 4gb, so it reaches 70gb

1 Like

hmm…?


What’s happening in your case

Your symptoms (fast with 1 user; with 2 users it “never ends” until it hits ~131k and then returns repetitive garbage) are best explained by three interacting factors:

  1. You are effectively allowing an extremely large completion (often tens of thousands of tokens), so the model keeps generating until it hits the length cap.
  2. You set --max-num-seqs 1, so you don’t actually have 2 requests running concurrently—your second user waits behind the first.
  3. With GPT-OSS-120B, there are known vLLM issues that can make streaming/output appear “stuck” (and in some versions, produce non-terminating repetitive output).

Separately, KV-cache pressure can make (2) and (3) much worse when you raise concurrency or keep very long contexts.


Background: prefill vs decode, and why “50k tokens” changes everything

vLLM (and transformer inference in general) has two main phases:

  • Prefill: process the entire prompt (your ~50k tokens) once and build the KV cache.
  • Decode: generate tokens one-by-one, reading/writing KV cache each step.

If a request is very long in prefill, it can block other requests from getting quick “time to first token” unless you use features like chunked prefill (vLLM can split large prefills into chunks to interleave with decoding). vLLM documents chunked prefill and why it exists. (vLLM)
There’s also an issue describing how long-context prefills can block subsequent requests and inflate latency. (GitHub)


1) The “never ends until max context” part: your completion budget is probably huge

On vLLM’s OpenAI-compatible server, if the client does not set max_tokens / max_completion_tokens, a commonly reported behavior is:

  • default_max_tokens = context_window − prompt_tokens (GitHub)

With your numbers:

  • Context window ≈ 131k
  • Prompt ≈ 50k
  • Default completion budget could be ≈ 80k tokens

That is “legal,” but it is also a reliable way to get:

  • very long runtimes, and
  • degeneration into repeated phrases near the end (common failure mode when you force extremely long continuations).

Why you only notice it at “2 concurrent users”: with --max-num-seqs 1, the first request can run for a very long time, and the second request just sits there, giving the impression of “everything is stuck”.


2) The “2 concurrent users makes it hang” part: --max-num-seqs 1 is a hard bottleneck

--max-num-seqs is “Maximum number of sequences per iteration.” (vLLM)

With --max-num-seqs 1:

  • Only one request is actively scheduled at a time.
  • A second “concurrent” user is queued until the first finishes.

So if request #1 is allowed to generate ~80k tokens (see above), request #2 will look like it “never ends”.


3) GPT-OSS-120B + tool calling + certain vLLM versions can produce “no output / blocked streaming / endless repetition”

You are running with:

  • --enable-auto-tool-choice
  • --tool-call-parser openai

There are multiple GPT-OSS-specific issues in vLLM that match variants of what you describe:

  • “Hangs indefinitely without a response; logs show generating but no output” (notably reported as a regression between versions). (GitHub)
  • “Endless repetitive output that never terminates” after upgrading to v0.11.x; switching back resolves. (GitHub)
  • Streaming blocked until the entire output is generated with gpt-oss-120B (makes it look like nothing is happening). (GitHub)

Tool calling + endpoint mismatch

vLLM’s own GPT-OSS recipe explicitly states:

  • The /v1/chat/completions endpoint: “No tool will be invoked …” (vLLM)

So if your clients are using Chat Completions while you enable auto-tool-choice/tool parsing, you can end up in a “supported-but-fragile” area depending on vLLM version and client expectations.


4) Is it “big input context” or “cache”? Mostly: output budget + scheduling; sometimes KV cache too

Big input context: yes, it’s a major contributor

50k-token prompts make prefill heavy and increase total KV cache required. Even if everything is correct, it increases latency and worsens queueing behind long requests. (GitHub)

KV cache: it can cause preemption (recompute/swap), which looks like stalls

vLLM will preempt sequences when there isn’t enough KV cache space. The docs show the exact warning and recommend increasing gpu_memory_utilization or parallelism if preemptions are frequent. (vLLM)

vLLM internally manages KV cache in blocks (“PagedAttention”); the KV-cache manager allocates and tracks blocks, and when it runs low, scheduling becomes painful. (vLLM Blog)

Your “131k tokens ≈ 4GB” estimate is likely too low for this model

KV memory per request scales roughly with:

  • number of layers × KV heads × head_dim × bytes_per_element × 2 (K+V) × sequence_length
    A vLLM issue discussing KV cache memory sizing uses this exact style of formula. (GitHub)

The actual numbers depend on the architecture + KV dtype + tensor parallelism, so the most reliable method is:

  • read vLLM startup logs (KV cache capacity in tokens), and
  • watch metrics (below).

5) Two command/config pitfalls in what you pasted

A) Your “temperature override” may not be applied

You used “smart quotes” in:

  • --override-generation-config ‘{“temperature”: 0.2}’

That often fails parsing. Also, vLLM’s OpenAI server will, by default, apply the model repo’s generation_config.json unless disabled. (vLLM)
So you can end up testing different sampling defaults than you think.

B) GPT-OSS-120B supports huge output, but that doesn’t mean you should allow it by default

OpenAI’s model page lists 131,072 context window and 131,072 max output tokens. (OpenAI Developers)
If your server defaults to “context − prompt”, it can legitimately try to generate tens of thousands of tokens unless you cap it.


What I would do (in order), to make your test behave normally

1) Cap the completion length in the client (non-negotiable for load tests)

Set max_tokens / max_completion_tokens to something like 256–2048 for concurrency benchmarking. This directly prevents “run until context limit.” (GitHub)

2) Set --max-num-seqs to the concurrency you actually want

If you want 2 concurrent users, set:

  • --max-num-seqs 2 (or higher)

Otherwise, you’re measuring queueing behind a single active request. (vLLM)

3) If you’re sending 50k-token prompts, enable chunked prefill

Chunked prefill exists specifically to reduce the “one huge prompt blocks everything” behavior. (vLLM)

4) Temporarily disable tool calling flags to isolate the problem

Run a control test without:

  • --enable-auto-tool-choice
  • --tool-call-parser openai

If the “hang / no streaming / weird repetition” disappears, you’re likely hitting a GPT-OSS tool/streaming edge case reported in vLLM issues. (GitHub)

5) Verify the endpoint you are using matches what you expect for tools

If you need tool invocation, follow the GPT-OSS recipe guidance; it notes /v1/chat/completions “No tool will be invoked”. (vLLM)

6) Check for KV cache preemption and confirm via /metrics

  • Look for “not enough KV cache space … preempted … recompute” warnings. (vLLM)
  • Query metrics; key ones include vllm:kv_cache_usage_perc and vllm:num_requests_running. (vLLM)

If kv_cache_usage_perc is near 1.0 with long prompts, you will see severe latency and/or preemption. (vLLM)

7) Consider vLLM version if you’re on a problematic release line

There are GPT-OSS-specific regressions reported (hangs / endless repetition) tied to certain versions. (GitHub)
Also, the GPT-OSS recipe shows a recommended deployment path using a newer vLLM docker tag (example: v0.12.0). (vLLM)


Direct answer: “big input context or cache?”

  • Primary cause of “never ending until max context”: completion not capped + large context window defaulting to a huge output budget. (GitHub)

  • Primary cause of “2 concurrent users triggers it”: --max-num-seqs 1 turns concurrency into queueing, so one long generation stalls everyone else. (vLLM)

  • Additional amplifiers:

    • GPT-OSS-120B streaming/tool-calling bugs in some vLLM versions. (GitHub)
    • KV-cache pressure leading to preemption/recompute under long contexts. (vLLM)

If you implement only two changes—(a) cap max_tokens in the client and (b) raise --max-num-seqs to match your target concurrency—you will usually eliminate the “runs until max context” behavior and make the remaining issue (tool/streaming bug vs KV preemption) immediately diagnosable using logs + /metrics. (vLLM)

thank you for your support!

I will do some improvements. As for**--max-num-seqs 1**

i did set it to 1, when i say concurrent i meant concurrent requests to LLM, not inference, i am sending 1 to 15 requests to LLM but only 1 in inference. I am actually testing the performance when i set it to 1, 2 and 4 --max-num-seqs

1 Like

Seems a possibility it could be a bug in the GPT-OSS model itself, so checking the metrics might be the fastest way to isolate the cause…


What your setup is actually testing

With --max-num-seqs 1, vLLM will run only one sequence/request on the GPU at a time (“maximum number of sequences per iteration”). (vLLM)

So when you send 1–15 “concurrent” HTTP requests, you’re mostly measuring:

  • queueing (requests waiting), plus
  • one request’s prefill + decode time, repeated.

This is consistent with your goal (“concurrent requests to the LLM, not concurrent inference”), but it also explains why a single bad/long-running request makes everything behind it look “hung”.

You can confirm queueing vs running with /metrics:

  • vllm:num_requests_running
  • vllm:num_requests_waiting
  • plus latency histograms (queue time, prefill time, decode time). (Red Hat Docs)

Why it “never ends until it hits max context length”

A) You’re likely allowing a massive completion (tens of thousands of tokens)

In vLLM’s OpenAI-compatible server, the default max_tokens behavior (when the client doesn’t explicitly set it) has been reported as:

default_max_tokens = context_window − prompt_tokens (GitHub)

With a 131k context window and a 50k prompt, that can mean ~80k output tokens are permitted. If you let a model talk for ~80k tokens, it’s common to see degeneration into repetition loops near the end, then finish_reason="length".

What this looks like in practice

  • With 1 request: it eventually returns, so it “works”.
  • With 2+ queued requests: request #2 waits “forever” because request #1 is still generating a huge completion.

How to prove it
Look at /metrics:

  • vllm:request_params_max_tokens
  • vllm:request_max_num_generation_tokens (Red Hat Docs)
    If those spike to very large values when you don’t set max_tokens, that’s your smoking gun.

Fix
Always set max_tokens / max_completion_tokens in the client load test (e.g., 256/512/1024) so “never ends” becomes impossible by construction.


Why it gets worse when you increase --max-num-seqs to 2 or 4

Once you allow multiple requests to be “RUNNING” on GPU, two large effects kick in:

B) Long prompts (50k) make prefill compute-heavy and can block responsiveness

vLLM users report that when a very long prompt approaches the context limit, prefill can cause other sessions to pause or get very slow; this is expected behavior. (vLLM Forums)

vLLM mitigates this using chunked prefill (splitting prompt processing into smaller chunks interleaved with decode), but tuning still matters. (vLLM Forums)

A common recommendation is to reduce max_num_batched_tokens (e.g., 2048–8192) so a single huge prefill doesn’t dominate the batch and starve other requests. (vLLM Forums)

C) KV-cache pressure → preemption/recompute → “stalls”

When multiple long-context sequences run concurrently, the KV cache requirement increases roughly proportional to:

  • number of concurrent sequences × (prompt length + generated length)

If vLLM doesn’t have enough KV cache space, it can preempt requests and later recompute them, which can drastically increase end-to-end latency. vLLM documents the exact warning and recommended actions. (vLLM)

The practical consequence:

  • --max-num-seqs 2 or 4 + 50k prompts can push you into frequent preemptions, which looks like “it’s stuck” (and throughput collapses).

How to confirm

  • Watch logs for: “preempted … because there is not enough KV cache space”.

  • Watch /metrics:

    • vllm:gpu_cache_usage_perc (near 1.0 means you’re full)
    • vllm:num_preemptions_total (Red Hat Docs)

vLLM’s recommended mitigations (in order of common use):

  • Increase gpu_memory_utilization
  • Decrease max_num_seqs or max_num_batched_tokens
  • Increase tensor/pipeline parallelism (more GPUs). (vLLM)

A separate high-probability issue in your exact flags: GPT-OSS + tool calling bugs/regressions

You’re running with:

  • --enable-auto-tool-choice
  • --tool-call-parser openai

There is an open vLLM bug report specifically for gpt-oss-120b where requests hang indefinitely with no output; the reporter explicitly used the same tool-calling flags and noted a regression (v0.10.2 worked; v0.11.0 didn’t). (GitHub)

So even if you fix max_tokens, you may still be hitting a tool-calling / output-path issue depending on your vLLM version.

Endpoint mismatch (important context)

vLLM’s GPT-OSS recipe says:

  • /v1/responses is recommended for tool use and uses openai-harmony for rendering/parsing. (vLLM)
  • /v1/chat/completions “No tool will be invoked…” for that endpoint in the recipe text. (vLLM)

If you’re load-testing /v1/chat/completions with tool-calling enabled, you’re operating in an area where behavior depends heavily on vLLM version and implementation details (and there are known GPT-OSS issues).

Practical recommendation for performance testing

  • If your benchmark does not require tool calls: disable auto tool choice / tool parsing to eliminate this entire class of failures.

One more configuration pitfall: your sampling overrides may not be applied

You used typographic quotes in:

  • --override-generation-config ‘{“temperature”: 0.2}’

Those often fail to parse in shells.

Also, vLLM’s OpenAI-compatible server applies generation_config.json from the HF repo by default, which can override sampling defaults unless you disable it with --generation-config vllm. (vLLM)

So you may not be running with the sampling behavior you think you are (which can affect repetition/degeneration under long generation).


Putting it together for your observed behavior

When --max-num-seqs 1

Most likely:

  1. One request is allowed to generate extremely long output (default max_tokens behavior). (GitHub)
  2. All other requests queue behind it (by design). (vLLM)
  3. The long-running request degenerates into repetition because it’s forced to produce tens of thousands of tokens.

When you raise --max-num-seqs to 2 or 4

You add:

  • prefill interference from 50k prompts (TTFT spikes for others). (vLLM Forums)
  • KV-cache contention and possibly preemption/recompute if cache is insufficient. (vLLM)
  • Potential GPT-OSS tool-calling regressions depending on version. (GitHub)

Concrete improvement plan (ordered; each step is independently useful)

1) Cap generation length in every request

For load tests, always include max_tokens (or max_completion_tokens) so outputs cannot run to 80k tokens by default. (GitHub)

2) Turn off tool calling for baseline throughput tests

Remove:

  • --enable-auto-tool-choice
  • --tool-call-parser openai

This avoids known GPT-OSS hangs/regressions tied to tool-call output paths. (GitHub)

3) Tune long-prompt responsiveness: set --max-num-batched-tokens

If you are mixing long prompts and multiple concurrent sequences, start with something like:

  • --max-num-batched-tokens 4096 (or 2048–8192)

This is specifically recommended to prevent long prefills from dominating and harming responsiveness. (vLLM Forums)

4) Use /metrics to classify the bottleneck

Key indicators:

  • queueing: vllm:num_requests_waiting rising (Red Hat Docs)
  • KV saturation: vllm:gpu_cache_usage_perc near 1.0 (Red Hat Docs)
  • preemption: vllm:num_preemptions_total increases (Red Hat Docs)
  • runaway output: vllm:request_params_max_tokens huge (Red Hat Docs)

5) Make sampling deterministic and correctly applied

  • Fix quotes in --override-generation-config (use plain ASCII quotes).
  • If you want to ignore HF generation_config.json, add --generation-config vllm. (vLLM)

6) If you truly need tools, align with the GPT-OSS recipe

  • Prefer /v1/responses for tool use (Harmony-based parsing) per the vLLM GPT-OSS recipe. (vLLM)
  • Expect that full streaming/stateful behavior is described as work-in-progress in the recipe. (vLLM)

Most likely answer to your “big input vs cache?” question

  • Primary cause of “never ending until max context”: runaway/too-large output budget (default max_tokens behavior) unless you explicitly cap it. (GitHub)
  • Primary cause of “it happens when I add concurrent requests”: queueing (when max-num-seqs=1) and/or prefill interference + KV-cache pressure (when max-num-seqs > 1). (vLLM Forums)
  • Additional risk specific to your flags: GPT-OSS tool-calling hang regressions in some vLLM versions. (GitHub)

If you implement only two changes—(1) cap max_tokens in the client, (2) disable tool calling for the baseline run—you should see the “never ending until 131k” behavior disappear immediately, and your remaining performance limits will show up clearly in /metrics (queue vs prefill vs KV/preemption). (Red Hat Docs)

When it comes to memory, i handled it well and made sure to give it enough for 1,2 and 4 num-seqs

As for vllm version, this might make huge improvements alone. I weirdly noticed that even with ollama, gpt-oss does the same behavior unless i go to an old version of ollama.

Thank you for your help.

I will try my best to optimize it and come back with good news hopefully!

1 Like