# vLLM Migration Guide: Qwen3-Next-80B on NVIDIA Blackwell (GB10) ### *From Ollama Unloading Issues to vLLM Success* ## πŸ“‹ Overview This guide documents our journey deploying **Qwen3-Next-80B** on the **NVIDIA DGX Spark (GB10)** - from encountering critical model unloading issues with Ollama, through failed TensorRT attempts, to finally achieving stable deployment with vLLM. **The Problem**: On a single DGX Spark (GB10) system, Ollama would intermittently unload the 80B model, causing service disruptions and requiring constant monitoring. TensorRT-LLM attempts also failed due to compatibility issues with Qwen3-Next's MoE architecture. **The Solution**: vLLM with OpenAI-compatible API, providing permanent VRAM reservation, PagedAttention parallelism, and native Blackwell (sm_121) acceleration. --- ## 🚨 The Journey: From Ollama to vLLM ### Phase 1: The Ollama Unloading Problem **Initial Setup**: We deployed Qwen3-Next-80B using Ollama on DGX Spark (GB10) with 120GB VRAM. **The Issue**: - Model would unload unexpectedly during idle periods - Service would fail when requests arrived after model unload - Required constant "keep-alive" requests to prevent unloading - No reliable way to ensure model permanence in VRAM **Symptoms**: ``` [Ollama] Model "nika" unloaded from memory [Service] Request failed: Model not loaded [Service] Attempting to reload model... [Service] Reload takes 30-60 seconds ``` **Root Cause**: Ollama's memory management is designed for multi-model scenarios and doesn't guarantee permanent model retention, especially for large 80B models. --- ### Phase 2: TensorRT-LLM Attempt **Why We Tried TensorRT-LLM**: - Promised better performance on Blackwell architecture - Lower latency potential - Better memory efficiency **The Failure**: - Qwen3-Next's MoE (Mixture of Experts) architecture not fully supported - Shared expert layers caused loading failures - Weight key mapping issues (`model.` prefix mismatch) - Incomplete MoE routing implementation **Error Messages**: ``` [TensorRT-LLM] Error: Shared expert layers not recognized [TensorRT-LLM] Error: Weight key mismatch: model.layers.0 vs layers.0 [TensorRT-LLM] Error: MoE routing not implemented for Qwen3-Next ``` **Decision**: Abandoned TensorRT-LLM due to architectural incompatibility with Qwen3-Next's MoE structure. --- ### Phase 3: vLLM Success **Why vLLM Worked**: - Native support for MoE architectures - Permanent VRAM reservation (model stays loaded) - OpenAI-compatible API (easy migration) - PagedAttention for high throughput - Active development and Qwen3-Next support **The Solution**: - vLLM server runs as a separate service - Model loaded once at server start, stays in VRAM permanently - Application connects via OpenAI-compatible API - No more unloading issues --- ## πŸ—οΈ Qwen3-Next-80B MoE Architecture Understanding the architecture is crucial for deployment: ### Architecture Overview Qwen3-Next-80B uses a sophisticated **Mixture of Experts (MoE)** architecture: - **Total Parameters**: ~80B - **Active Parameters (A3B)**: Only ~3B parameters are active per token - **Shared Experts**: Hybrid approach with both "routed experts" and "shared experts" - **Benefits**: Fast inference for its size while maintaining global knowledge ### Why This Matters 1. **Memory Efficiency**: Only 3B active parameters per token, but full 80B model must stay in VRAM 2. **Shared Experts**: Maintain global knowledge across all tokens 3. **Weight Mapping**: Special handling required for shared expert layers --- ## πŸ› οΈ Critical Technical Challenges & Solutions ### 1. The Weight Key Mismatch (The `model.` Prefix) **Problem**: Hugging Face checkpoints save weights with a `model.` prefix (e.g., `model.layers.10...`), while vLLM's internal `Qwen2/3` implementation expects keys to start directly with `layers.10...`. **Solution**: vLLM's weight loader handles this automatically, but you may need to ensure checkpoint format compatibility. **Example**: ```python # Weight key transformation needed # Before: "model.layers.10.attention.q_proj.weight" # After: "layers.10.attention.q_proj.weight" ``` ### 2. Shared Expert Mapping **Problem**: Standard loaders may fail to recognize `shared_expert` layers. **Solution**: vLLM's `AutoWeightsLoader` correctly handles Qwen3-Next's MoE structure, including shared experts. **Example**: ```python # Shared expert layers in Qwen3-Next shared_expert_layers = [ "mlp.shared_expert.gate_proj.weight", "mlp.shared_expert.up_proj.weight", "mlp.shared_expert.down_proj.weight" ] ``` ### 3. Model Name Resolution **Problem**: Application uses friendly names like "nika", but vLLM needs actual model paths. **Solution**: Implement model name resolution in `VLLMService`. --- ## ⚑ Blackwell (GB10) Specific Optimizations The **DGX Spark (GB10)** is the first hardware to support **NVFP4**. To maximize performance for an 80B model: ### 1. Quantization Strategy: GPTQ-Int4A16 Using **GPTQ Int4 weight-only quantization** allows the 80B model to fit comfortably in 120GB VRAM: - Model weights: ~40GB (quantized) - KV Cache: ~40GB (for high throughput) - System overhead: ~40GB - **Total**: ~120GB (perfect fit for GB10) ### 2. Critical Environment Flags For Blackwell (sm_121), these environment variables are **mandatory**: ```bash # Force the Triton compiler to find the correct Blackwell ptxas export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas export VLLM_USE_FLASHINFER_SAMPLER=1 # Enable Blackwell-optimized kernels export VLLM_USE_FLASHINFER_MOE=0 # Temporary workaround for MoE kernels if needed export TRITON_INTERPRET=0 # Disable interpreter mode (critical for performance) export VLLM_WORKER_MULTIPROC_METHOD=spawn ``` --- ## πŸ’» Implementation Guide ### 1. vLLM Server Setup **File**: `bin/qwen3_next_80b_gptq.sh` ```bash #!/bin/bash # Optimized for Qwen3-Next-80B-GPTQ-Int4 on DGX Spark (Blackwell) # 1. CUDA 13.0 Pathing (Must include nvcc for JIT) export CUDA_HOME=/usr/local/cuda-13.0 export CUDA_PATH=/usr/local/cuda-13.0 export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH export C_INCLUDE_PATH=$CUDA_HOME/include export CPLUS_INCLUDE_PATH=$CUDA_HOME/include # 2. Prevent Triton 'resize_()' errors by disabling Interpreter mode export TRITON_INTERPRET=0 export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas export VLLM_WORKER_MULTIPROC_METHOD=spawn # 3. Blackwell-specific optimizations export VLLM_USE_FLASHINFER_SAMPLER=1 export VLLM_USE_FLASHINFER_MOE=0 # Disable if MoE kernels cause issues export TRITON_DISABLE_LINE_INFO=1 export TORCH_COMPILE_DEBUG=0 export TORCHDYNAMO_DISABLE=1 # 4. Launch with Blackwell-specific tuning # --enforce-eager bypasses CUDA graph capture overhead on new sm_121 arch python3 -m vllm.entrypoints.openai.api_server \ --model dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16 \ --trust-remote-code \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.92 \ --max-model-len 8192 \ --enforce-eager \ --disable-custom-all-reduce \ --host 0.0.0.0 \ --port 1107 ``` **Key Parameters**: - `--gpu-memory-utilization 0.92`: Leaves room for KV cache (critical for 80B model) - `--enforce-eager`: Better stability on Blackwell (bypasses CUDA graph) - `--max-model-len 8192`: Adjust based on your use case ### 2. Application Service Class **File**: `api/services/vllm_service.py` ```python import requests import time import threading from typing import List, Dict, Optional class VLLMService: """vLLM API Service (OpenAI Compatible)""" def __init__(self, host: str, model_name: str, embedding_model: str): self.host = host.rstrip('/') self.embedding_model = embedding_model self.api_base = f"{self.host}/v1" self._warmed_models = set() self._warm_up_lock = threading.Lock() # Convert model name to actual vLLM model path self.model_name = self._resolve_model_name(model_name) def _resolve_model_name(self, model_name: str) -> str: """ Convert model name to actual vLLM model path - "nika" -> Actual model path available on vLLM server - If already a full path, return as-is """ if "/" in model_name: # Already a full path return model_name try: response = requests.get(f"{self.api_base}/models", timeout=5) if response.status_code == 200: models_data = response.json() for model_info in models_data.get("data", []): model_id = model_info.get("id") # Special logic for NIKA (Qwen3-Next-80B) if "nika" in model_name.lower() and "qwen3" in model_id.lower(): logger.info(f"βœ“ Resolved NIKA to Qwen3-Next-80B: {model_id}") return model_id if model_id and model_name.lower() in model_id.lower(): return model_id except Exception as exc: logger.warning(f"Failed to resolve model name '{model_name}': {exc}") return model_name def check_connection(self) -> bool: """Check vLLM server connection""" try: response = requests.get(f"{self.api_base}/models", timeout=5) return response.status_code == 200 except Exception: return False def generate_response( self, messages: List[Dict[str, str]], max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.9, top_k: Optional[int] = None, num_ctx: Optional[int] = None, stop: Optional[List[str]] = None, model_name: Optional[str] = None ) -> Dict[str, any]: """Generate response using vLLM OpenAI-compatible API""" request_id = f"req_{int(time.time() * 1000)}" # Resolve model name if provided if model_name: model_to_use = self._resolve_model_name(model_name) else: model_to_use = self.model_name logger.info(f"[{request_id}] Generating response via vLLM API... (model={model_to_use})") # Convert messages to OpenAI-compatible format openai_messages = [] for msg in messages: role = msg.get("role", "user") content = msg.get("content", "") openai_messages.append({"role": role, "content": content}) # vLLM OpenAI-compatible API request api_url = f"{self.api_base}/chat/completions" # Stop sequence settings default_stop = ["\n\nUser:", "Observation:", "<|endoftext|>", "<|eot_id|>", "\n\n\n"] stop_sequences = stop if stop is not None else default_stop payload = { "model": model_to_use, "messages": openai_messages, "max_tokens": max_tokens, "temperature": temperature, "top_p": top_p, "stop": stop_sequences, "stream": False } try: inference_start = time.time() response = requests.post(api_url, json=payload, timeout=300) inference_time = time.time() - inference_start if response.status_code == 200: result = response.json() choices = result.get("choices", []) if not choices: raise RuntimeError("vLLM API returned empty choices") choice = choices[0] response_text = choice.get("message", {}).get("content", "") # Token usage information usage = result.get("usage", {}) prompt_tokens = usage.get("prompt_tokens", 0) completion_tokens = usage.get("completion_tokens", 0) total_tokens = usage.get("total_tokens", 0) # Check if response was truncated finish_reason = choice.get("finish_reason", "") was_truncated = finish_reason == "length" or completion_tokens >= (max_tokens * 0.95) logger.info(f"[{request_id}] [TIMING] vLLM API call: {inference_time*1000:.2f}ms") logger.info(f"[{request_id}] [STATS] Prompt: {prompt_tokens} tokens | Completion: {completion_tokens} tokens") return { "response_text": response_text, "was_truncated": was_truncated, "eval_count": completion_tokens, "max_tokens": max_tokens } else: error_msg = f"vLLM API error: {response.status_code} - {response.text}" logger.error(f"[{request_id}] {error_msg}") raise RuntimeError(error_msg) except requests.exceptions.RequestException as exc: error_msg = f"Request error: {exc}" logger.error(f"[{request_id}] {error_msg}") raise ``` ### 3. API Endpoint Migration **Before (Ollama)**: ```python # Ollama endpoint response = requests.post( f"{ollama_host}/api/chat", json={ "model": "nika", "messages": [...], "options": { "num_predict": 512, "temperature": 0.7, "keep_alive": -1 # Try to keep model loaded } } ) ``` **After (vLLM)**: ```python # vLLM OpenAI-compatible endpoint response = requests.post( f"{vllm_host}/v1/chat/completions", json={ "model": "/root/.cache/huggingface/hub/models--dazipe--Qwen3-Next-80B...", "messages": [...], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9 # No keep_alive needed - model stays loaded permanently } ) ``` --- ## πŸ” Troubleshooting ### Q: Why does `nvidia-smi` show "Failed to initialize NVML"? **A**: This is often due to a "zombie" process holding the GB10 SoC. Run: ```bash sudo fuser -v /dev/nvidia* ``` Kill the PIDs if found. If it persists, a cold reboot of the DGX Spark is required to re-initialize the Blackwell firmware. ### Q: vLLM is extremely slow (1 token/sec). **A**: You are likely in **Triton Interpreter Mode**. This happens if `nvcc` is not found during startup. Ensure `cuda-toolkit-13-0` is installed and the `PATH` is correctly set: ```bash which nvcc # Should output: /usr/local/cuda-13.0/bin/nvcc export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas ``` ### Q: Model loading fails with "KeyError: 'model.layers.0'" **A**: This is the weight key mismatch issue. The checkpoint has `model.` prefix but vLLM expects keys without it. Solutions: 1. Use a checkpoint converter to strip the prefix 2. Patch vLLM's weight loader to handle both formats 3. Use a different checkpoint format (if available) ### Q: Shared expert layers not loading **A**: Ensure your vLLM version supports Qwen3-Next MoE architecture. You may need to: 1. Update vLLM to the latest version 2. Apply custom patches for shared expert handling 3. Use `--trust-remote-code` flag (already in script) ### Q: Model still unloads (like Ollama) **A**: This shouldn't happen with vLLM. If it does: 1. Check vLLM server logs for memory pressure 2. Verify `--gpu-memory-utilization` is not too high 3. Check for other processes using GPU memory 4. Ensure vLLM server process is not being killed --- ## πŸ“ˆ Comparison: Ollama vs TensorRT vs vLLM | Feature | Ollama | TensorRT-LLM | vLLM | | --- | --- | --- | --- | | **80B Model Loading** | ❌ Intermittent Unloading | ❌ Failed (MoE issues) | βœ… **Permanent VRAM Reservation** | | **Throughput** | Sequential | N/A (failed) | βœ… **PagedAttention Parallelism** | | **Blackwell Support** | Generic | βœ… Optimized | βœ… **sm_121 Native Acceleration** | | **Architecture** | Dense Focused | ❌ MoE Incomplete | βœ… **MoE Optimized (Shared Experts)** | | **Quantization** | Limited Options | Good | βœ… **GPTQ-Int4A16 Optimized** | | **Memory Efficiency** | ~100GB+ VRAM | N/A | βœ… **~80GB VRAM (with quantization)** | | **API Compatibility** | Custom | Custom | βœ… **OpenAI Compatible** | | **Stability** | ❌ Unloading Issues | ❌ Failed | βœ… **Stable** | --- ## πŸ“ Key Learnings ### Why Ollama Failed 1. **Memory Management**: Ollama's design prioritizes flexibility over permanence 2. **Multi-Model Focus**: Optimized for switching between models, not keeping one loaded 3. **No Guarantee**: No reliable way to ensure 80B model stays in VRAM ### Why TensorRT-LLM Failed 1. **MoE Support**: Incomplete implementation for Qwen3-Next's MoE architecture 2. **Shared Experts**: Not properly handled in TensorRT-LLM's weight loader 3. **Architecture Mismatch**: Designed more for dense models ### Why vLLM Succeeded 1. **Permanent Loading**: Model loaded once, stays in VRAM permanently 2. **MoE Native**: Full support for Qwen3-Next's MoE architecture 3. **OpenAI API**: Easy migration path from existing code 4. **Active Development**: Regular updates and Qwen3-Next support 5. **Blackwell Optimized**: Native support for sm_121 architecture --- ## 🎯 Migration Checklist ### Pre-Migration - [ ] Identify all Ollama API calls in your codebase - [ ] Document current model loading/unloading behavior - [ ] Verify vLLM server can be installed and run - [ ] Test vLLM with a smaller model first ### Migration Steps 1. **Install vLLM**: ```bash pip install vllm ``` 2. **Start vLLM Server**: ```bash bash bin/qwen3_next_80b_gptq.sh ``` 3. **Update Service Class**: - Replace `OllamaService` with `VLLMService` - Update API endpoints from `/api/chat` to `/v1/chat/completions` - Update JSON payload format 4. **Update Configuration**: - Change `OLLAMA_HOST` to `VLLM_HOST` - Update model name resolution logic 5. **Test Thoroughly**: - Verify model stays loaded - Test response generation - Monitor memory usage - Check for any unloading issues ### Post-Migration - [ ] Monitor for 24-48 hours to ensure stability - [ ] Verify no model unloading occurs - [ ] Check performance metrics - [ ] Update documentation --- ## πŸ“š References ### vLLM Documentation - [vLLM Official Documentation](https://docs.vllm.ai/) - [OpenAI Compatible API](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) ### Qwen3-Next Model - [Qwen3-Next-80B Model Card](https://huggingface.co/dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16) ### Blackwell Architecture - [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/blackwell/) --- ## πŸŽ‰ Success Metrics After migrating to vLLM: - βœ… **Zero model unloading incidents** (previously daily occurrences) - βœ… **Stable service** (no more "model not loaded" errors) - βœ… **Better throughput** (PagedAttention parallelism) - βœ… **Lower latency** (no model reload overhead) - βœ… **Simpler architecture** (no keep-alive logic needed) --- **Last Updated**: 2025-12-31 **Author**: [Jong-Seong Kim (κΉ€μ’…μ„±)](https://huggingface.co/dazipe) **Project**: LANIKA / NIKA AI **Hardware**: NVIDIA DGX Spark (GB10) - Single System Deployment