vLLM Migration Guide: Qwen3-Next-80B on NVIDIA Blackwell (GB10)
From Ollama Unloading Issues to vLLM Success
π Overview
This guide documents our journey deploying Qwen3-Next-80B on the NVIDIA DGX Spark (GB10) - from encountering critical model unloading issues with Ollama, through failed TensorRT attempts, to finally achieving stable deployment with vLLM.
The Problem: On a single DGX Spark (GB10) system, Ollama would intermittently unload the 80B model, causing service disruptions and requiring constant monitoring. TensorRT-LLM attempts also failed due to compatibility issues with Qwen3-Next's MoE architecture.
The Solution: vLLM with OpenAI-compatible API, providing permanent VRAM reservation, PagedAttention parallelism, and native Blackwell (sm_121) acceleration.
π¨ The Journey: From Ollama to vLLM
Phase 1: The Ollama Unloading Problem
Initial Setup: We deployed Qwen3-Next-80B using Ollama on DGX Spark (GB10) with 120GB VRAM.
The Issue:
- Model would unload unexpectedly during idle periods
- Service would fail when requests arrived after model unload
- Required constant "keep-alive" requests to prevent unloading
- No reliable way to ensure model permanence in VRAM
Symptoms:
[Ollama] Model "nika" unloaded from memory
[Service] Request failed: Model not loaded
[Service] Attempting to reload model...
[Service] Reload takes 30-60 seconds
Root Cause: Ollama's memory management is designed for multi-model scenarios and doesn't guarantee permanent model retention, especially for large 80B models.
Phase 2: TensorRT-LLM Attempt
Why We Tried TensorRT-LLM:
- Promised better performance on Blackwell architecture
- Lower latency potential
- Better memory efficiency
The Failure:
- Qwen3-Next's MoE (Mixture of Experts) architecture not fully supported
- Shared expert layers caused loading failures
- Weight key mapping issues (
model.prefix mismatch) - Incomplete MoE routing implementation
Error Messages:
[TensorRT-LLM] Error: Shared expert layers not recognized
[TensorRT-LLM] Error: Weight key mismatch: model.layers.0 vs layers.0
[TensorRT-LLM] Error: MoE routing not implemented for Qwen3-Next
Decision: Abandoned TensorRT-LLM due to architectural incompatibility with Qwen3-Next's MoE structure.
Phase 3: vLLM Success
Why vLLM Worked:
- Native support for MoE architectures
- Permanent VRAM reservation (model stays loaded)
- OpenAI-compatible API (easy migration)
- PagedAttention for high throughput
- Active development and Qwen3-Next support
The Solution:
- vLLM server runs as a separate service
- Model loaded once at server start, stays in VRAM permanently
- Application connects via OpenAI-compatible API
- No more unloading issues
ποΈ Qwen3-Next-80B MoE Architecture
Understanding the architecture is crucial for deployment:
Architecture Overview
Qwen3-Next-80B uses a sophisticated Mixture of Experts (MoE) architecture:
- Total Parameters: ~80B
- Active Parameters (A3B): Only ~3B parameters are active per token
- Shared Experts: Hybrid approach with both "routed experts" and "shared experts"
- Benefits: Fast inference for its size while maintaining global knowledge
Why This Matters
- Memory Efficiency: Only 3B active parameters per token, but full 80B model must stay in VRAM
- Shared Experts: Maintain global knowledge across all tokens
- Weight Mapping: Special handling required for shared expert layers
π οΈ Critical Technical Challenges & Solutions
1. The Weight Key Mismatch (The model. Prefix)
Problem: Hugging Face checkpoints save weights with a model. prefix (e.g., model.layers.10...), while vLLM's internal Qwen2/3 implementation expects keys to start directly with layers.10....
Solution: vLLM's weight loader handles this automatically, but you may need to ensure checkpoint format compatibility.
Example:
# Weight key transformation needed
# Before: "model.layers.10.attention.q_proj.weight"
# After: "layers.10.attention.q_proj.weight"
2. Shared Expert Mapping
Problem: Standard loaders may fail to recognize shared_expert layers.
Solution: vLLM's AutoWeightsLoader correctly handles Qwen3-Next's MoE structure, including shared experts.
Example:
# Shared expert layers in Qwen3-Next
shared_expert_layers = [
"mlp.shared_expert.gate_proj.weight",
"mlp.shared_expert.up_proj.weight",
"mlp.shared_expert.down_proj.weight"
]
3. Model Name Resolution
Problem: Application uses friendly names like "nika", but vLLM needs actual model paths.
Solution: Implement model name resolution in VLLMService.
β‘ Blackwell (GB10) Specific Optimizations
The DGX Spark (GB10) is the first hardware to support NVFP4. To maximize performance for an 80B model:
1. Quantization Strategy: GPTQ-Int4A16
Using GPTQ Int4 weight-only quantization allows the 80B model to fit comfortably in 120GB VRAM:
- Model weights: ~40GB (quantized)
- KV Cache: ~40GB (for high throughput)
- System overhead: ~40GB
- Total: ~120GB (perfect fit for GB10)
2. Critical Environment Flags
For Blackwell (sm_121), these environment variables are mandatory:
# Force the Triton compiler to find the correct Blackwell ptxas
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export VLLM_USE_FLASHINFER_SAMPLER=1 # Enable Blackwell-optimized kernels
export VLLM_USE_FLASHINFER_MOE=0 # Temporary workaround for MoE kernels if needed
export TRITON_INTERPRET=0 # Disable interpreter mode (critical for performance)
export VLLM_WORKER_MULTIPROC_METHOD=spawn
π» Implementation Guide
1. vLLM Server Setup
File: bin/qwen3_next_80b_gptq.sh
#!/bin/bash
# Optimized for Qwen3-Next-80B-GPTQ-Int4 on DGX Spark (Blackwell)
# 1. CUDA 13.0 Pathing (Must include nvcc for JIT)
export CUDA_HOME=/usr/local/cuda-13.0
export CUDA_PATH=/usr/local/cuda-13.0
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export C_INCLUDE_PATH=$CUDA_HOME/include
export CPLUS_INCLUDE_PATH=$CUDA_HOME/include
# 2. Prevent Triton 'resize_()' errors by disabling Interpreter mode
export TRITON_INTERPRET=0
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export VLLM_WORKER_MULTIPROC_METHOD=spawn
# 3. Blackwell-specific optimizations
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_USE_FLASHINFER_MOE=0 # Disable if MoE kernels cause issues
export TRITON_DISABLE_LINE_INFO=1
export TORCH_COMPILE_DEBUG=0
export TORCHDYNAMO_DISABLE=1
# 4. Launch with Blackwell-specific tuning
# --enforce-eager bypasses CUDA graph capture overhead on new sm_121 arch
python3 -m vllm.entrypoints.openai.api_server \
--model dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16 \
--trust-remote-code \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.92 \
--max-model-len 8192 \
--enforce-eager \
--disable-custom-all-reduce \
--host 0.0.0.0 \
--port 1107
Key Parameters:
--gpu-memory-utilization 0.92: Leaves room for KV cache (critical for 80B model)--enforce-eager: Better stability on Blackwell (bypasses CUDA graph)--max-model-len 8192: Adjust based on your use case
2. Application Service Class
File: api/services/vllm_service.py
import requests
import time
import threading
from typing import List, Dict, Optional
class VLLMService:
"""vLLM API Service (OpenAI Compatible)"""
def __init__(self, host: str, model_name: str, embedding_model: str):
self.host = host.rstrip('/')
self.embedding_model = embedding_model
self.api_base = f"{self.host}/v1"
self._warmed_models = set()
self._warm_up_lock = threading.Lock()
# Convert model name to actual vLLM model path
self.model_name = self._resolve_model_name(model_name)
def _resolve_model_name(self, model_name: str) -> str:
"""
Convert model name to actual vLLM model path
- "nika" -> Actual model path available on vLLM server
- If already a full path, return as-is
"""
if "/" in model_name: # Already a full path
return model_name
try:
response = requests.get(f"{self.api_base}/models", timeout=5)
if response.status_code == 200:
models_data = response.json()
for model_info in models_data.get("data", []):
model_id = model_info.get("id")
# Special logic for NIKA (Qwen3-Next-80B)
if "nika" in model_name.lower() and "qwen3" in model_id.lower():
logger.info(f"β Resolved NIKA to Qwen3-Next-80B: {model_id}")
return model_id
if model_id and model_name.lower() in model_id.lower():
return model_id
except Exception as exc:
logger.warning(f"Failed to resolve model name '{model_name}': {exc}")
return model_name
def check_connection(self) -> bool:
"""Check vLLM server connection"""
try:
response = requests.get(f"{self.api_base}/models", timeout=5)
return response.status_code == 200
except Exception:
return False
def generate_response(
self,
messages: List[Dict[str, str]],
max_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.9,
top_k: Optional[int] = None,
num_ctx: Optional[int] = None,
stop: Optional[List[str]] = None,
model_name: Optional[str] = None
) -> Dict[str, any]:
"""Generate response using vLLM OpenAI-compatible API"""
request_id = f"req_{int(time.time() * 1000)}"
# Resolve model name if provided
if model_name:
model_to_use = self._resolve_model_name(model_name)
else:
model_to_use = self.model_name
logger.info(f"[{request_id}] Generating response via vLLM API... (model={model_to_use})")
# Convert messages to OpenAI-compatible format
openai_messages = []
for msg in messages:
role = msg.get("role", "user")
content = msg.get("content", "")
openai_messages.append({"role": role, "content": content})
# vLLM OpenAI-compatible API request
api_url = f"{self.api_base}/chat/completions"
# Stop sequence settings
default_stop = ["\n\nUser:", "Observation:", "<|endoftext|>", "<|eot_id|>", "\n\n\n"]
stop_sequences = stop if stop is not None else default_stop
payload = {
"model": model_to_use,
"messages": openai_messages,
"max_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"stop": stop_sequences,
"stream": False
}
try:
inference_start = time.time()
response = requests.post(api_url, json=payload, timeout=300)
inference_time = time.time() - inference_start
if response.status_code == 200:
result = response.json()
choices = result.get("choices", [])
if not choices:
raise RuntimeError("vLLM API returned empty choices")
choice = choices[0]
response_text = choice.get("message", {}).get("content", "")
# Token usage information
usage = result.get("usage", {})
prompt_tokens = usage.get("prompt_tokens", 0)
completion_tokens = usage.get("completion_tokens", 0)
total_tokens = usage.get("total_tokens", 0)
# Check if response was truncated
finish_reason = choice.get("finish_reason", "")
was_truncated = finish_reason == "length" or completion_tokens >= (max_tokens * 0.95)
logger.info(f"[{request_id}] [TIMING] vLLM API call: {inference_time*1000:.2f}ms")
logger.info(f"[{request_id}] [STATS] Prompt: {prompt_tokens} tokens | Completion: {completion_tokens} tokens")
return {
"response_text": response_text,
"was_truncated": was_truncated,
"eval_count": completion_tokens,
"max_tokens": max_tokens
}
else:
error_msg = f"vLLM API error: {response.status_code} - {response.text}"
logger.error(f"[{request_id}] {error_msg}")
raise RuntimeError(error_msg)
except requests.exceptions.RequestException as exc:
error_msg = f"Request error: {exc}"
logger.error(f"[{request_id}] {error_msg}")
raise
3. API Endpoint Migration
Before (Ollama):
# Ollama endpoint
response = requests.post(
f"{ollama_host}/api/chat",
json={
"model": "nika",
"messages": [...],
"options": {
"num_predict": 512,
"temperature": 0.7,
"keep_alive": -1 # Try to keep model loaded
}
}
)
After (vLLM):
# vLLM OpenAI-compatible endpoint
response = requests.post(
f"{vllm_host}/v1/chat/completions",
json={
"model": "/root/.cache/huggingface/hub/models--dazipe--Qwen3-Next-80B...",
"messages": [...],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.9
# No keep_alive needed - model stays loaded permanently
}
)
π Troubleshooting
Q: Why does nvidia-smi show "Failed to initialize NVML"?
A: This is often due to a "zombie" process holding the GB10 SoC. Run:
sudo fuser -v /dev/nvidia*
Kill the PIDs if found. If it persists, a cold reboot of the DGX Spark is required to re-initialize the Blackwell firmware.
Q: vLLM is extremely slow (1 token/sec).
A: You are likely in Triton Interpreter Mode. This happens if nvcc is not found during startup. Ensure cuda-toolkit-13-0 is installed and the PATH is correctly set:
which nvcc # Should output: /usr/local/cuda-13.0/bin/nvcc
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
Q: Model loading fails with "KeyError: 'model.layers.0'"
A: This is the weight key mismatch issue. The checkpoint has model. prefix but vLLM expects keys without it. Solutions:
- Use a checkpoint converter to strip the prefix
- Patch vLLM's weight loader to handle both formats
- Use a different checkpoint format (if available)
Q: Shared expert layers not loading
A: Ensure your vLLM version supports Qwen3-Next MoE architecture. You may need to:
- Update vLLM to the latest version
- Apply custom patches for shared expert handling
- Use
--trust-remote-codeflag (already in script)
Q: Model still unloads (like Ollama)
A: This shouldn't happen with vLLM. If it does:
- Check vLLM server logs for memory pressure
- Verify
--gpu-memory-utilizationis not too high - Check for other processes using GPU memory
- Ensure vLLM server process is not being killed
π Comparison: Ollama vs TensorRT vs vLLM
| Feature | Ollama | TensorRT-LLM | vLLM |
|---|---|---|---|
| 80B Model Loading | β Intermittent Unloading | β Failed (MoE issues) | β Permanent VRAM Reservation |
| Throughput | Sequential | N/A (failed) | β PagedAttention Parallelism |
| Blackwell Support | Generic | β Optimized | β sm_121 Native Acceleration |
| Architecture | Dense Focused | β MoE Incomplete | β MoE Optimized (Shared Experts) |
| Quantization | Limited Options | Good | β GPTQ-Int4A16 Optimized |
| Memory Efficiency | ~100GB+ VRAM | N/A | β ~80GB VRAM (with quantization) |
| API Compatibility | Custom | Custom | β OpenAI Compatible |
| Stability | β Unloading Issues | β Failed | β Stable |
π Key Learnings
Why Ollama Failed
- Memory Management: Ollama's design prioritizes flexibility over permanence
- Multi-Model Focus: Optimized for switching between models, not keeping one loaded
- No Guarantee: No reliable way to ensure 80B model stays in VRAM
Why TensorRT-LLM Failed
- MoE Support: Incomplete implementation for Qwen3-Next's MoE architecture
- Shared Experts: Not properly handled in TensorRT-LLM's weight loader
- Architecture Mismatch: Designed more for dense models
Why vLLM Succeeded
- Permanent Loading: Model loaded once, stays in VRAM permanently
- MoE Native: Full support for Qwen3-Next's MoE architecture
- OpenAI API: Easy migration path from existing code
- Active Development: Regular updates and Qwen3-Next support
- Blackwell Optimized: Native support for sm_121 architecture
π― Migration Checklist
Pre-Migration
- Identify all Ollama API calls in your codebase
- Document current model loading/unloading behavior
- Verify vLLM server can be installed and run
- Test vLLM with a smaller model first
Migration Steps
Install vLLM:
pip install vllmStart vLLM Server:
bash bin/qwen3_next_80b_gptq.shUpdate Service Class:
- Replace
OllamaServicewithVLLMService - Update API endpoints from
/api/chatto/v1/chat/completions - Update JSON payload format
- Replace
Update Configuration:
- Change
OLLAMA_HOSTtoVLLM_HOST - Update model name resolution logic
- Change
Test Thoroughly:
- Verify model stays loaded
- Test response generation
- Monitor memory usage
- Check for any unloading issues
Post-Migration
- Monitor for 24-48 hours to ensure stability
- Verify no model unloading occurs
- Check performance metrics
- Update documentation
π References
vLLM Documentation
Qwen3-Next Model
Blackwell Architecture
π Success Metrics
After migrating to vLLM:
- β Zero model unloading incidents (previously daily occurrences)
- β Stable service (no more "model not loaded" errors)
- β Better throughput (PagedAttention parallelism)
- β Lower latency (no model reload overhead)
- β Simpler architecture (no keep-alive logic needed)
Last Updated: 2025-12-31
Author: Jong-Seong Kim (κΉμ’
μ±)
Project: LANIKA / NIKA AI
Hardware: NVIDIA DGX Spark (GB10) - Single System Deployment