qwen3-next-gb10-guide / VLLM_APPLICATION_GUIDE.md
Jong-Seong's picture
Upload VLLM_APPLICATION_GUIDE.md
16a8ba9 verified

vLLM Migration Guide: Qwen3-Next-80B on NVIDIA Blackwell (GB10)

From Ollama Unloading Issues to vLLM Success

πŸ“‹ Overview

This guide documents our journey deploying Qwen3-Next-80B on the NVIDIA DGX Spark (GB10) - from encountering critical model unloading issues with Ollama, through failed TensorRT attempts, to finally achieving stable deployment with vLLM.

The Problem: On a single DGX Spark (GB10) system, Ollama would intermittently unload the 80B model, causing service disruptions and requiring constant monitoring. TensorRT-LLM attempts also failed due to compatibility issues with Qwen3-Next's MoE architecture.

The Solution: vLLM with OpenAI-compatible API, providing permanent VRAM reservation, PagedAttention parallelism, and native Blackwell (sm_121) acceleration.


🚨 The Journey: From Ollama to vLLM

Phase 1: The Ollama Unloading Problem

Initial Setup: We deployed Qwen3-Next-80B using Ollama on DGX Spark (GB10) with 120GB VRAM.

The Issue:

  • Model would unload unexpectedly during idle periods
  • Service would fail when requests arrived after model unload
  • Required constant "keep-alive" requests to prevent unloading
  • No reliable way to ensure model permanence in VRAM

Symptoms:

[Ollama] Model "nika" unloaded from memory
[Service] Request failed: Model not loaded
[Service] Attempting to reload model...
[Service] Reload takes 30-60 seconds

Root Cause: Ollama's memory management is designed for multi-model scenarios and doesn't guarantee permanent model retention, especially for large 80B models.


Phase 2: TensorRT-LLM Attempt

Why We Tried TensorRT-LLM:

  • Promised better performance on Blackwell architecture
  • Lower latency potential
  • Better memory efficiency

The Failure:

  • Qwen3-Next's MoE (Mixture of Experts) architecture not fully supported
  • Shared expert layers caused loading failures
  • Weight key mapping issues (model. prefix mismatch)
  • Incomplete MoE routing implementation

Error Messages:

[TensorRT-LLM] Error: Shared expert layers not recognized
[TensorRT-LLM] Error: Weight key mismatch: model.layers.0 vs layers.0
[TensorRT-LLM] Error: MoE routing not implemented for Qwen3-Next

Decision: Abandoned TensorRT-LLM due to architectural incompatibility with Qwen3-Next's MoE structure.


Phase 3: vLLM Success

Why vLLM Worked:

  • Native support for MoE architectures
  • Permanent VRAM reservation (model stays loaded)
  • OpenAI-compatible API (easy migration)
  • PagedAttention for high throughput
  • Active development and Qwen3-Next support

The Solution:

  • vLLM server runs as a separate service
  • Model loaded once at server start, stays in VRAM permanently
  • Application connects via OpenAI-compatible API
  • No more unloading issues

πŸ—οΈ Qwen3-Next-80B MoE Architecture

Understanding the architecture is crucial for deployment:

Architecture Overview

Qwen3-Next-80B uses a sophisticated Mixture of Experts (MoE) architecture:

  • Total Parameters: ~80B
  • Active Parameters (A3B): Only ~3B parameters are active per token
  • Shared Experts: Hybrid approach with both "routed experts" and "shared experts"
  • Benefits: Fast inference for its size while maintaining global knowledge

Why This Matters

  1. Memory Efficiency: Only 3B active parameters per token, but full 80B model must stay in VRAM
  2. Shared Experts: Maintain global knowledge across all tokens
  3. Weight Mapping: Special handling required for shared expert layers

πŸ› οΈ Critical Technical Challenges & Solutions

1. The Weight Key Mismatch (The model. Prefix)

Problem: Hugging Face checkpoints save weights with a model. prefix (e.g., model.layers.10...), while vLLM's internal Qwen2/3 implementation expects keys to start directly with layers.10....

Solution: vLLM's weight loader handles this automatically, but you may need to ensure checkpoint format compatibility.

Example:

# Weight key transformation needed
# Before: "model.layers.10.attention.q_proj.weight"
# After:  "layers.10.attention.q_proj.weight"

2. Shared Expert Mapping

Problem: Standard loaders may fail to recognize shared_expert layers.

Solution: vLLM's AutoWeightsLoader correctly handles Qwen3-Next's MoE structure, including shared experts.

Example:

# Shared expert layers in Qwen3-Next
shared_expert_layers = [
    "mlp.shared_expert.gate_proj.weight",
    "mlp.shared_expert.up_proj.weight",
    "mlp.shared_expert.down_proj.weight"
]

3. Model Name Resolution

Problem: Application uses friendly names like "nika", but vLLM needs actual model paths.

Solution: Implement model name resolution in VLLMService.


⚑ Blackwell (GB10) Specific Optimizations

The DGX Spark (GB10) is the first hardware to support NVFP4. To maximize performance for an 80B model:

1. Quantization Strategy: GPTQ-Int4A16

Using GPTQ Int4 weight-only quantization allows the 80B model to fit comfortably in 120GB VRAM:

  • Model weights: ~40GB (quantized)
  • KV Cache: ~40GB (for high throughput)
  • System overhead: ~40GB
  • Total: ~120GB (perfect fit for GB10)

2. Critical Environment Flags

For Blackwell (sm_121), these environment variables are mandatory:

# Force the Triton compiler to find the correct Blackwell ptxas
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export VLLM_USE_FLASHINFER_SAMPLER=1  # Enable Blackwell-optimized kernels
export VLLM_USE_FLASHINFER_MOE=0       # Temporary workaround for MoE kernels if needed
export TRITON_INTERPRET=0              # Disable interpreter mode (critical for performance)
export VLLM_WORKER_MULTIPROC_METHOD=spawn

πŸ’» Implementation Guide

1. vLLM Server Setup

File: bin/qwen3_next_80b_gptq.sh

#!/bin/bash
# Optimized for Qwen3-Next-80B-GPTQ-Int4 on DGX Spark (Blackwell)

# 1. CUDA 13.0 Pathing (Must include nvcc for JIT)
export CUDA_HOME=/usr/local/cuda-13.0
export CUDA_PATH=/usr/local/cuda-13.0
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export C_INCLUDE_PATH=$CUDA_HOME/include
export CPLUS_INCLUDE_PATH=$CUDA_HOME/include

# 2. Prevent Triton 'resize_()' errors by disabling Interpreter mode
export TRITON_INTERPRET=0
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export VLLM_WORKER_MULTIPROC_METHOD=spawn

# 3. Blackwell-specific optimizations
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_USE_FLASHINFER_MOE=0  # Disable if MoE kernels cause issues
export TRITON_DISABLE_LINE_INFO=1
export TORCH_COMPILE_DEBUG=0
export TORCHDYNAMO_DISABLE=1

# 4. Launch with Blackwell-specific tuning
# --enforce-eager bypasses CUDA graph capture overhead on new sm_121 arch
python3 -m vllm.entrypoints.openai.api_server \
    --model dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16 \
    --trust-remote-code \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 8192 \
    --enforce-eager \
    --disable-custom-all-reduce \
    --host 0.0.0.0 \
    --port 1107

Key Parameters:

  • --gpu-memory-utilization 0.92: Leaves room for KV cache (critical for 80B model)
  • --enforce-eager: Better stability on Blackwell (bypasses CUDA graph)
  • --max-model-len 8192: Adjust based on your use case

2. Application Service Class

File: api/services/vllm_service.py

import requests
import time
import threading
from typing import List, Dict, Optional

class VLLMService:
    """vLLM API Service (OpenAI Compatible)"""
    
    def __init__(self, host: str, model_name: str, embedding_model: str):
        self.host = host.rstrip('/')
        self.embedding_model = embedding_model
        self.api_base = f"{self.host}/v1"
        self._warmed_models = set()
        self._warm_up_lock = threading.Lock()
        
        # Convert model name to actual vLLM model path
        self.model_name = self._resolve_model_name(model_name)
    
    def _resolve_model_name(self, model_name: str) -> str:
        """
        Convert model name to actual vLLM model path
        - "nika" -> Actual model path available on vLLM server
        - If already a full path, return as-is
        """
        if "/" in model_name:  # Already a full path
            return model_name
        
        try:
            response = requests.get(f"{self.api_base}/models", timeout=5)
            if response.status_code == 200:
                models_data = response.json()
                for model_info in models_data.get("data", []):
                    model_id = model_info.get("id")
                    # Special logic for NIKA (Qwen3-Next-80B)
                    if "nika" in model_name.lower() and "qwen3" in model_id.lower():
                        logger.info(f"βœ“ Resolved NIKA to Qwen3-Next-80B: {model_id}")
                        return model_id
                    if model_id and model_name.lower() in model_id.lower():
                        return model_id
        except Exception as exc:
            logger.warning(f"Failed to resolve model name '{model_name}': {exc}")
        
        return model_name
    
    def check_connection(self) -> bool:
        """Check vLLM server connection"""
        try:
            response = requests.get(f"{self.api_base}/models", timeout=5)
            return response.status_code == 200
        except Exception:
            return False
    
    def generate_response(
        self,
        messages: List[Dict[str, str]],
        max_tokens: int = 512,
        temperature: float = 0.7,
        top_p: float = 0.9,
        top_k: Optional[int] = None,
        num_ctx: Optional[int] = None,
        stop: Optional[List[str]] = None,
        model_name: Optional[str] = None
    ) -> Dict[str, any]:
        """Generate response using vLLM OpenAI-compatible API"""
        request_id = f"req_{int(time.time() * 1000)}"
        # Resolve model name if provided
        if model_name:
            model_to_use = self._resolve_model_name(model_name)
        else:
            model_to_use = self.model_name
        
        logger.info(f"[{request_id}] Generating response via vLLM API... (model={model_to_use})")
        
        # Convert messages to OpenAI-compatible format
        openai_messages = []
        for msg in messages:
            role = msg.get("role", "user")
            content = msg.get("content", "")
            openai_messages.append({"role": role, "content": content})
        
        # vLLM OpenAI-compatible API request
        api_url = f"{self.api_base}/chat/completions"
        
        # Stop sequence settings
        default_stop = ["\n\nUser:", "Observation:", "<|endoftext|>", "<|eot_id|>", "\n\n\n"]
        stop_sequences = stop if stop is not None else default_stop
        
        payload = {
            "model": model_to_use,
            "messages": openai_messages,
            "max_tokens": max_tokens,
            "temperature": temperature,
            "top_p": top_p,
            "stop": stop_sequences,
            "stream": False
        }
        
        try:
            inference_start = time.time()
            response = requests.post(api_url, json=payload, timeout=300)
            inference_time = time.time() - inference_start
            
            if response.status_code == 200:
                result = response.json()
                choices = result.get("choices", [])
                if not choices:
                    raise RuntimeError("vLLM API returned empty choices")
                
                choice = choices[0]
                response_text = choice.get("message", {}).get("content", "")
                
                # Token usage information
                usage = result.get("usage", {})
                prompt_tokens = usage.get("prompt_tokens", 0)
                completion_tokens = usage.get("completion_tokens", 0)
                total_tokens = usage.get("total_tokens", 0)
                
                # Check if response was truncated
                finish_reason = choice.get("finish_reason", "")
                was_truncated = finish_reason == "length" or completion_tokens >= (max_tokens * 0.95)
                
                logger.info(f"[{request_id}] [TIMING] vLLM API call: {inference_time*1000:.2f}ms")
                logger.info(f"[{request_id}] [STATS] Prompt: {prompt_tokens} tokens | Completion: {completion_tokens} tokens")
                
                return {
                    "response_text": response_text,
                    "was_truncated": was_truncated,
                    "eval_count": completion_tokens,
                    "max_tokens": max_tokens
                }
            else:
                error_msg = f"vLLM API error: {response.status_code} - {response.text}"
                logger.error(f"[{request_id}] {error_msg}")
                raise RuntimeError(error_msg)
        except requests.exceptions.RequestException as exc:
            error_msg = f"Request error: {exc}"
            logger.error(f"[{request_id}] {error_msg}")
            raise

3. API Endpoint Migration

Before (Ollama):

# Ollama endpoint
response = requests.post(
    f"{ollama_host}/api/chat",
    json={
        "model": "nika",
        "messages": [...],
        "options": {
            "num_predict": 512,
            "temperature": 0.7,
            "keep_alive": -1  # Try to keep model loaded
        }
    }
)

After (vLLM):

# vLLM OpenAI-compatible endpoint
response = requests.post(
    f"{vllm_host}/v1/chat/completions",
    json={
        "model": "/root/.cache/huggingface/hub/models--dazipe--Qwen3-Next-80B...",
        "messages": [...],
        "max_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.9
        # No keep_alive needed - model stays loaded permanently
    }
)

πŸ” Troubleshooting

Q: Why does nvidia-smi show "Failed to initialize NVML"?

A: This is often due to a "zombie" process holding the GB10 SoC. Run:

sudo fuser -v /dev/nvidia*

Kill the PIDs if found. If it persists, a cold reboot of the DGX Spark is required to re-initialize the Blackwell firmware.

Q: vLLM is extremely slow (1 token/sec).

A: You are likely in Triton Interpreter Mode. This happens if nvcc is not found during startup. Ensure cuda-toolkit-13-0 is installed and the PATH is correctly set:

which nvcc  # Should output: /usr/local/cuda-13.0/bin/nvcc
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas

Q: Model loading fails with "KeyError: 'model.layers.0'"

A: This is the weight key mismatch issue. The checkpoint has model. prefix but vLLM expects keys without it. Solutions:

  1. Use a checkpoint converter to strip the prefix
  2. Patch vLLM's weight loader to handle both formats
  3. Use a different checkpoint format (if available)

Q: Shared expert layers not loading

A: Ensure your vLLM version supports Qwen3-Next MoE architecture. You may need to:

  1. Update vLLM to the latest version
  2. Apply custom patches for shared expert handling
  3. Use --trust-remote-code flag (already in script)

Q: Model still unloads (like Ollama)

A: This shouldn't happen with vLLM. If it does:

  1. Check vLLM server logs for memory pressure
  2. Verify --gpu-memory-utilization is not too high
  3. Check for other processes using GPU memory
  4. Ensure vLLM server process is not being killed

πŸ“ˆ Comparison: Ollama vs TensorRT vs vLLM

Feature Ollama TensorRT-LLM vLLM
80B Model Loading ❌ Intermittent Unloading ❌ Failed (MoE issues) βœ… Permanent VRAM Reservation
Throughput Sequential N/A (failed) βœ… PagedAttention Parallelism
Blackwell Support Generic βœ… Optimized βœ… sm_121 Native Acceleration
Architecture Dense Focused ❌ MoE Incomplete βœ… MoE Optimized (Shared Experts)
Quantization Limited Options Good βœ… GPTQ-Int4A16 Optimized
Memory Efficiency ~100GB+ VRAM N/A βœ… ~80GB VRAM (with quantization)
API Compatibility Custom Custom βœ… OpenAI Compatible
Stability ❌ Unloading Issues ❌ Failed βœ… Stable

πŸ“ Key Learnings

Why Ollama Failed

  1. Memory Management: Ollama's design prioritizes flexibility over permanence
  2. Multi-Model Focus: Optimized for switching between models, not keeping one loaded
  3. No Guarantee: No reliable way to ensure 80B model stays in VRAM

Why TensorRT-LLM Failed

  1. MoE Support: Incomplete implementation for Qwen3-Next's MoE architecture
  2. Shared Experts: Not properly handled in TensorRT-LLM's weight loader
  3. Architecture Mismatch: Designed more for dense models

Why vLLM Succeeded

  1. Permanent Loading: Model loaded once, stays in VRAM permanently
  2. MoE Native: Full support for Qwen3-Next's MoE architecture
  3. OpenAI API: Easy migration path from existing code
  4. Active Development: Regular updates and Qwen3-Next support
  5. Blackwell Optimized: Native support for sm_121 architecture

🎯 Migration Checklist

Pre-Migration

  • Identify all Ollama API calls in your codebase
  • Document current model loading/unloading behavior
  • Verify vLLM server can be installed and run
  • Test vLLM with a smaller model first

Migration Steps

  1. Install vLLM:

    pip install vllm
    
  2. Start vLLM Server:

    bash bin/qwen3_next_80b_gptq.sh
    
  3. Update Service Class:

    • Replace OllamaService with VLLMService
    • Update API endpoints from /api/chat to /v1/chat/completions
    • Update JSON payload format
  4. Update Configuration:

    • Change OLLAMA_HOST to VLLM_HOST
    • Update model name resolution logic
  5. Test Thoroughly:

    • Verify model stays loaded
    • Test response generation
    • Monitor memory usage
    • Check for any unloading issues

Post-Migration

  • Monitor for 24-48 hours to ensure stability
  • Verify no model unloading occurs
  • Check performance metrics
  • Update documentation

πŸ“š References

vLLM Documentation

Qwen3-Next Model

Blackwell Architecture


πŸŽ‰ Success Metrics

After migrating to vLLM:

  • βœ… Zero model unloading incidents (previously daily occurrences)
  • βœ… Stable service (no more "model not loaded" errors)
  • βœ… Better throughput (PagedAttention parallelism)
  • βœ… Lower latency (no model reload overhead)
  • βœ… Simpler architecture (no keep-alive logic needed)

Last Updated: 2025-12-31
Author: Jong-Seong Kim (κΉ€μ’…μ„±)
Project: LANIKA / NIKA AI
Hardware: NVIDIA DGX Spark (GB10) - Single System Deployment