Seems a possibility it could be a bug in the GPT-OSS model itself, so checking the metrics might be the fastest way to isolate the cause…
What your setup is actually testing
With --max-num-seqs 1, vLLM will run only one sequence/request on the GPU at a time (“maximum number of sequences per iteration”). (vLLM)
So when you send 1–15 “concurrent” HTTP requests, you’re mostly measuring:
- queueing (requests waiting), plus
- one request’s prefill + decode time, repeated.
This is consistent with your goal (“concurrent requests to the LLM, not concurrent inference”), but it also explains why a single bad/long-running request makes everything behind it look “hung”.
You can confirm queueing vs running with /metrics:
vllm:num_requests_running
vllm:num_requests_waiting
- plus latency histograms (queue time, prefill time, decode time). (Red Hat Docs)
Why it “never ends until it hits max context length”
A) You’re likely allowing a massive completion (tens of thousands of tokens)
In vLLM’s OpenAI-compatible server, the default max_tokens behavior (when the client doesn’t explicitly set it) has been reported as:
default_max_tokens = context_window − prompt_tokens (GitHub)
With a 131k context window and a 50k prompt, that can mean ~80k output tokens are permitted. If you let a model talk for ~80k tokens, it’s common to see degeneration into repetition loops near the end, then finish_reason="length".
What this looks like in practice
- With 1 request: it eventually returns, so it “works”.
- With 2+ queued requests: request #2 waits “forever” because request #1 is still generating a huge completion.
How to prove it
Look at /metrics:
vllm:request_params_max_tokens
vllm:request_max_num_generation_tokens (Red Hat Docs)
If those spike to very large values when you don’t set max_tokens, that’s your smoking gun.
Fix
Always set max_tokens / max_completion_tokens in the client load test (e.g., 256/512/1024) so “never ends” becomes impossible by construction.
Why it gets worse when you increase --max-num-seqs to 2 or 4
Once you allow multiple requests to be “RUNNING” on GPU, two large effects kick in:
B) Long prompts (50k) make prefill compute-heavy and can block responsiveness
vLLM users report that when a very long prompt approaches the context limit, prefill can cause other sessions to pause or get very slow; this is expected behavior. (vLLM Forums)
vLLM mitigates this using chunked prefill (splitting prompt processing into smaller chunks interleaved with decode), but tuning still matters. (vLLM Forums)
A common recommendation is to reduce max_num_batched_tokens (e.g., 2048–8192) so a single huge prefill doesn’t dominate the batch and starve other requests. (vLLM Forums)
C) KV-cache pressure → preemption/recompute → “stalls”
When multiple long-context sequences run concurrently, the KV cache requirement increases roughly proportional to:
- number of concurrent sequences × (prompt length + generated length)
If vLLM doesn’t have enough KV cache space, it can preempt requests and later recompute them, which can drastically increase end-to-end latency. vLLM documents the exact warning and recommended actions. (vLLM)
The practical consequence:
--max-num-seqs 2 or 4 + 50k prompts can push you into frequent preemptions, which looks like “it’s stuck” (and throughput collapses).
How to confirm
vLLM’s recommended mitigations (in order of common use):
- Increase
gpu_memory_utilization
- Decrease
max_num_seqs or max_num_batched_tokens
- Increase tensor/pipeline parallelism (more GPUs). (vLLM)
A separate high-probability issue in your exact flags: GPT-OSS + tool calling bugs/regressions
You’re running with:
--enable-auto-tool-choice
--tool-call-parser openai
There is an open vLLM bug report specifically for gpt-oss-120b where requests hang indefinitely with no output; the reporter explicitly used the same tool-calling flags and noted a regression (v0.10.2 worked; v0.11.0 didn’t). (GitHub)
So even if you fix max_tokens, you may still be hitting a tool-calling / output-path issue depending on your vLLM version.
Endpoint mismatch (important context)
vLLM’s GPT-OSS recipe says:
/v1/responses is recommended for tool use and uses openai-harmony for rendering/parsing. (vLLM)
/v1/chat/completions “No tool will be invoked…” for that endpoint in the recipe text. (vLLM)
If you’re load-testing /v1/chat/completions with tool-calling enabled, you’re operating in an area where behavior depends heavily on vLLM version and implementation details (and there are known GPT-OSS issues).
Practical recommendation for performance testing
- If your benchmark does not require tool calls: disable auto tool choice / tool parsing to eliminate this entire class of failures.
One more configuration pitfall: your sampling overrides may not be applied
You used typographic quotes in:
--override-generation-config ‘{“temperature”: 0.2}’
Those often fail to parse in shells.
Also, vLLM’s OpenAI-compatible server applies generation_config.json from the HF repo by default, which can override sampling defaults unless you disable it with --generation-config vllm. (vLLM)
So you may not be running with the sampling behavior you think you are (which can affect repetition/degeneration under long generation).
Putting it together for your observed behavior
When --max-num-seqs 1
Most likely:
- One request is allowed to generate extremely long output (default max_tokens behavior). (GitHub)
- All other requests queue behind it (by design). (vLLM)
- The long-running request degenerates into repetition because it’s forced to produce tens of thousands of tokens.
When you raise --max-num-seqs to 2 or 4
You add:
- prefill interference from 50k prompts (TTFT spikes for others). (vLLM Forums)
- KV-cache contention and possibly preemption/recompute if cache is insufficient. (vLLM)
- Potential GPT-OSS tool-calling regressions depending on version. (GitHub)
Concrete improvement plan (ordered; each step is independently useful)
1) Cap generation length in every request
For load tests, always include max_tokens (or max_completion_tokens) so outputs cannot run to 80k tokens by default. (GitHub)
2) Turn off tool calling for baseline throughput tests
Remove:
--enable-auto-tool-choice
--tool-call-parser openai
This avoids known GPT-OSS hangs/regressions tied to tool-call output paths. (GitHub)
3) Tune long-prompt responsiveness: set --max-num-batched-tokens
If you are mixing long prompts and multiple concurrent sequences, start with something like:
--max-num-batched-tokens 4096 (or 2048–8192)
This is specifically recommended to prevent long prefills from dominating and harming responsiveness. (vLLM Forums)
4) Use /metrics to classify the bottleneck
Key indicators:
- queueing:
vllm:num_requests_waiting rising (Red Hat Docs)
- KV saturation:
vllm:gpu_cache_usage_perc near 1.0 (Red Hat Docs)
- preemption:
vllm:num_preemptions_total increases (Red Hat Docs)
- runaway output:
vllm:request_params_max_tokens huge (Red Hat Docs)
5) Make sampling deterministic and correctly applied
- Fix quotes in
--override-generation-config (use plain ASCII quotes).
- If you want to ignore HF
generation_config.json, add --generation-config vllm. (vLLM)
6) If you truly need tools, align with the GPT-OSS recipe
- Prefer
/v1/responses for tool use (Harmony-based parsing) per the vLLM GPT-OSS recipe. (vLLM)
- Expect that full streaming/stateful behavior is described as work-in-progress in the recipe. (vLLM)
Most likely answer to your “big input vs cache?” question
- Primary cause of “never ending until max context”: runaway/too-large output budget (default max_tokens behavior) unless you explicitly cap it. (GitHub)
- Primary cause of “it happens when I add concurrent requests”: queueing (when
max-num-seqs=1) and/or prefill interference + KV-cache pressure (when max-num-seqs > 1). (vLLM Forums)
- Additional risk specific to your flags: GPT-OSS tool-calling hang regressions in some vLLM versions. (GitHub)
If you implement only two changes—(1) cap max_tokens in the client, (2) disable tool calling for the baseline run—you should see the “never ending until 131k” behavior disappear immediately, and your remaining performance limits will show up clearly in /metrics (queue vs prefill vs KV/preemption). (Red Hat Docs)