qwen3-next-gb10-guide / VLLM_APPLICATION_GUIDE.md

Upload VLLM_APPLICATION_GUIDE.md

16a8ba9 verified 2 months ago

preview code

raw

history blame contribute delete

19.3 kB

vLLM Migration Guide: Qwen3-Next-80B on NVIDIA Blackwell (GB10)

From Ollama Unloading Issues to vLLM Success

📋 Overview

This guide documents our journey deploying Qwen3-Next-80B on the NVIDIA DGX Spark (GB10) - from encountering critical model unloading issues with Ollama, through failed TensorRT attempts, to finally achieving stable deployment with vLLM.

The Problem: On a single DGX Spark (GB10) system, Ollama would intermittently unload the 80B model, causing service disruptions and requiring constant monitoring. TensorRT-LLM attempts also failed due to compatibility issues with Qwen3-Next's MoE architecture.

The Solution: vLLM with OpenAI-compatible API, providing permanent VRAM reservation, PagedAttention parallelism, and native Blackwell (sm_121) acceleration.

🚨 The Journey: From Ollama to vLLM

Phase 1: The Ollama Unloading Problem

Initial Setup: We deployed Qwen3-Next-80B using Ollama on DGX Spark (GB10) with 120GB VRAM.

The Issue:

Model would unload unexpectedly during idle periods
Service would fail when requests arrived after model unload
Required constant "keep-alive" requests to prevent unloading
No reliable way to ensure model permanence in VRAM

Symptoms:

[Ollama] Model "nika" unloaded from memory
[Service] Request failed: Model not loaded
[Service] Attempting to reload model...
[Service] Reload takes 30-60 seconds

Root Cause: Ollama's memory management is designed for multi-model scenarios and doesn't guarantee permanent model retention, especially for large 80B models.

Phase 2: TensorRT-LLM Attempt

Why We Tried TensorRT-LLM:

Promised better performance on Blackwell architecture
Lower latency potential
Better memory efficiency

The Failure:

Qwen3-Next's MoE (Mixture of Experts) architecture not fully supported
Shared expert layers caused loading failures
Weight key mapping issues (model. prefix mismatch)
Incomplete MoE routing implementation

Error Messages:

[TensorRT-LLM] Error: Shared expert layers not recognized
[TensorRT-LLM] Error: Weight key mismatch: model.layers.0 vs layers.0
[TensorRT-LLM] Error: MoE routing not implemented for Qwen3-Next

Decision: Abandoned TensorRT-LLM due to architectural incompatibility with Qwen3-Next's MoE structure.

Phase 3: vLLM Success

Why vLLM Worked:

Native support for MoE architectures
Permanent VRAM reservation (model stays loaded)
OpenAI-compatible API (easy migration)
PagedAttention for high throughput
Active development and Qwen3-Next support

The Solution:

vLLM server runs as a separate service
Model loaded once at server start, stays in VRAM permanently
Application connects via OpenAI-compatible API
No more unloading issues

🏗️ Qwen3-Next-80B MoE Architecture

Understanding the architecture is crucial for deployment:

Architecture Overview

Qwen3-Next-80B uses a sophisticated Mixture of Experts (MoE) architecture:

Total Parameters: ~80B
Active Parameters (A3B): Only ~3B parameters are active per token
Shared Experts: Hybrid approach with both "routed experts" and "shared experts"
Benefits: Fast inference for its size while maintaining global knowledge

Why This Matters

Memory Efficiency: Only 3B active parameters per token, but full 80B model must stay in VRAM
Shared Experts: Maintain global knowledge across all tokens
Weight Mapping: Special handling required for shared expert layers

🛠️ Critical Technical Challenges & Solutions

1. The Weight Key Mismatch (The `model.` Prefix)

Problem: Hugging Face checkpoints save weights with a model. prefix (e.g., model.layers.10...), while vLLM's internal Qwen2/3 implementation expects keys to start directly with layers.10....

Solution: vLLM's weight loader handles this automatically, but you may need to ensure checkpoint format compatibility.

Example:

# Weight key transformation needed
# Before: "model.layers.10.attention.q_proj.weight"
# After:  "layers.10.attention.q_proj.weight"

2. Shared Expert Mapping

Problem: Standard loaders may fail to recognize shared_expert layers.

Solution: vLLM's AutoWeightsLoader correctly handles Qwen3-Next's MoE structure, including shared experts.

Example:

# Shared expert layers in Qwen3-Next
shared_expert_layers = [
    "mlp.shared_expert.gate_proj.weight",
    "mlp.shared_expert.up_proj.weight",
    "mlp.shared_expert.down_proj.weight"
]

3. Model Name Resolution

Problem: Application uses friendly names like "nika", but vLLM needs actual model paths.

Solution: Implement model name resolution in VLLMService.

⚡ Blackwell (GB10) Specific Optimizations

The DGX Spark (GB10) is the first hardware to support NVFP4. To maximize performance for an 80B model:

1. Quantization Strategy: GPTQ-Int4A16

Using GPTQ Int4 weight-only quantization allows the 80B model to fit comfortably in 120GB VRAM:

Model weights: ~40GB (quantized)
KV Cache: ~40GB (for high throughput)
System overhead: ~40GB
Total: ~120GB (perfect fit for GB10)

2. Critical Environment Flags

For Blackwell (sm_121), these environment variables are mandatory:

# Force the Triton compiler to find the correct Blackwell ptxas
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export VLLM_USE_FLASHINFER_SAMPLER=1  # Enable Blackwell-optimized kernels
export VLLM_USE_FLASHINFER_MOE=0       # Temporary workaround for MoE kernels if needed
export TRITON_INTERPRET=0              # Disable interpreter mode (critical for performance)
export VLLM_WORKER_MULTIPROC_METHOD=spawn

💻 Implementation Guide

1. vLLM Server Setup

File: bin/qwen3_next_80b_gptq.sh

#!/bin/bash
# Optimized for Qwen3-Next-80B-GPTQ-Int4 on DGX Spark (Blackwell)

# 1. CUDA 13.0 Pathing (Must include nvcc for JIT)
export CUDA_HOME=/usr/local/cuda-13.0
export CUDA_PATH=/usr/local/cuda-13.0
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export C_INCLUDE_PATH=$CUDA_HOME/include
export CPLUS_INCLUDE_PATH=$CUDA_HOME/include

# 2. Prevent Triton 'resize_()' errors by disabling Interpreter mode
export TRITON_INTERPRET=0
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export VLLM_WORKER_MULTIPROC_METHOD=spawn

# 3. Blackwell-specific optimizations
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_USE_FLASHINFER_MOE=0  # Disable if MoE kernels cause issues
export TRITON_DISABLE_LINE_INFO=1
export TORCH_COMPILE_DEBUG=0
export TORCHDYNAMO_DISABLE=1

# 4. Launch with Blackwell-specific tuning
# --enforce-eager bypasses CUDA graph capture overhead on new sm_121 arch
python3 -m vllm.entrypoints.openai.api_server \
    --model dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16 \
    --trust-remote-code \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 8192 \
    --enforce-eager \
    --disable-custom-all-reduce \
    --host 0.0.0.0 \
    --port 1107

Key Parameters:

--gpu-memory-utilization 0.92: Leaves room for KV cache (critical for 80B model)
--enforce-eager: Better stability on Blackwell (bypasses CUDA graph)
--max-model-len 8192: Adjust based on your use case

2. Application Service Class

File: api/services/vllm_service.py

import requests
import time
import threading
from typing import List, Dict, Optional

class VLLMService:
    """vLLM API Service (OpenAI Compatible)"""
    
    def __init__(self, host: str, model_name: str, embedding_model: str):
        self.host = host.rstrip('/')
        self.embedding_model = embedding_model
        self.api_base = f"{self.host}/v1"
        self._warmed_models = set()
        self._warm_up_lock = threading.Lock()
        
        # Convert model name to actual vLLM model path
        self.model_name = self._resolve_model_name(model_name)
    
    def _resolve_model_name(self, model_name: str) -> str:
        """
        Convert model name to actual vLLM model path
        - "nika" -> Actual model path available on vLLM server
        - If already a full path, return as-is
        """
        if "/" in model_name:  # Already a full path
            return model_name
        
        try:
            response = requests.get(f"{self.api_base}/models", timeout=5)
            if response.status_code == 200:
                models_data = response.json()
                for model_info in models_data.get("data", []):
                    model_id = model_info.get("id")
                    # Special logic for NIKA (Qwen3-Next-80B)
                    if "nika" in model_name.lower() and "qwen3" in model_id.lower():
                        logger.info(f"✓ Resolved NIKA to Qwen3-Next-80B: {model_id}")
                        return model_id
                    if model_id and model_name.lower() in model_id.lower():
                        return model_id
        except Exception as exc:
            logger.warning(f"Failed to resolve model name '{model_name}': {exc}")
        
        return model_name
    
    def check_connection(self) -> bool:
        """Check vLLM server connection"""
        try:
            response = requests.get(f"{self.api_base}/models", timeout=5)
            return response.status_code == 200
        except Exception:
            return False
    
    def generate_response(
        self,
        messages: List[Dict[str, str]],
        max_tokens: int = 512,
        temperature: float = 0.7,
        top_p: float = 0.9,
        top_k: Optional[int] = None,
        num_ctx: Optional[int] = None,
        stop: Optional[List[str]] = None,
        model_name: Optional[str] = None
    ) -> Dict[str, any]:
        """Generate response using vLLM OpenAI-compatible API"""
        request_id = f"req_{int(time.time() * 1000)}"
        # Resolve model name if provided
        if model_name:
            model_to_use = self._resolve_model_name(model_name)
        else:
            model_to_use = self.model_name
        
        logger.info(f"[{request_id}] Generating response via vLLM API... (model={model_to_use})")
        
        # Convert messages to OpenAI-compatible format
        openai_messages = []
        for msg in messages:
            role = msg.get("role", "user")
            content = msg.get("content", "")
            openai_messages.append({"role": role, "content": content})
        
        # vLLM OpenAI-compatible API request
        api_url = f"{self.api_base}/chat/completions"
        
        # Stop sequence settings
        default_stop = ["\n\nUser:", "Observation:", "<|endoftext|>", "<|eot_id|>", "\n\n\n"]
        stop_sequences = stop if stop is not None else default_stop
        
        payload = {
            "model": model_to_use,
            "messages": openai_messages,
            "max_tokens": max_tokens,
            "temperature": temperature,
            "top_p": top_p,
            "stop": stop_sequences,
            "stream": False
        }
        
        try:
            inference_start = time.time()
            response = requests.post(api_url, json=payload, timeout=300)
            inference_time = time.time() - inference_start
            
            if response.status_code == 200:
                result = response.json()
                choices = result.get("choices", [])
                if not choices:
                    raise RuntimeError("vLLM API returned empty choices")
                
                choice = choices[0]
                response_text = choice.get("message", {}).get("content", "")
                
                # Token usage information
                usage = result.get("usage", {})
                prompt_tokens = usage.get("prompt_tokens", 0)
                completion_tokens = usage.get("completion_tokens", 0)
                total_tokens = usage.get("total_tokens", 0)
                
                # Check if response was truncated
                finish_reason = choice.get("finish_reason", "")
                was_truncated = finish_reason == "length" or completion_tokens >= (max_tokens * 0.95)
                
                logger.info(f"[{request_id}] [TIMING] vLLM API call: {inference_time*1000:.2f}ms")
                logger.info(f"[{request_id}] [STATS] Prompt: {prompt_tokens} tokens | Completion: {completion_tokens} tokens")
                
                return {
                    "response_text": response_text,
                    "was_truncated": was_truncated,
                    "eval_count": completion_tokens,
                    "max_tokens": max_tokens
                }
            else:
                error_msg = f"vLLM API error: {response.status_code} - {response.text}"
                logger.error(f"[{request_id}] {error_msg}")
                raise RuntimeError(error_msg)
        except requests.exceptions.RequestException as exc:
            error_msg = f"Request error: {exc}"
            logger.error(f"[{request_id}] {error_msg}")
            raise

3. API Endpoint Migration

Before (Ollama):

# Ollama endpoint
response = requests.post(
    f"{ollama_host}/api/chat",
    json={
        "model": "nika",
        "messages": [...],
        "options": {
            "num_predict": 512,
            "temperature": 0.7,
            "keep_alive": -1  # Try to keep model loaded
        }
    }
)

After (vLLM):

# vLLM OpenAI-compatible endpoint
response = requests.post(
    f"{vllm_host}/v1/chat/completions",
    json={
        "model": "/root/.cache/huggingface/hub/models--dazipe--Qwen3-Next-80B...",
        "messages": [...],
        "max_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.9
        # No keep_alive needed - model stays loaded permanently
    }
)

🔍 Troubleshooting

Q: Why does `nvidia-smi` show "Failed to initialize NVML"?

A: This is often due to a "zombie" process holding the GB10 SoC. Run:

sudo fuser -v /dev/nvidia*

Kill the PIDs if found. If it persists, a cold reboot of the DGX Spark is required to re-initialize the Blackwell firmware.

Q: vLLM is extremely slow (1 token/sec).

A: You are likely in Triton Interpreter Mode. This happens if nvcc is not found during startup. Ensure cuda-toolkit-13-0 is installed and the PATH is correctly set:

which nvcc  # Should output: /usr/local/cuda-13.0/bin/nvcc
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas

Q: Model loading fails with "KeyError: 'model.layers.0'"

A: This is the weight key mismatch issue. The checkpoint has model. prefix but vLLM expects keys without it. Solutions:

Use a checkpoint converter to strip the prefix
Patch vLLM's weight loader to handle both formats
Use a different checkpoint format (if available)

Q: Shared expert layers not loading

A: Ensure your vLLM version supports Qwen3-Next MoE architecture. You may need to:

Update vLLM to the latest version
Apply custom patches for shared expert handling
Use --trust-remote-code flag (already in script)

Q: Model still unloads (like Ollama)

A: This shouldn't happen with vLLM. If it does:

Check vLLM server logs for memory pressure
Verify --gpu-memory-utilization is not too high
Check for other processes using GPU memory
Ensure vLLM server process is not being killed

📈 Comparison: Ollama vs TensorRT vs vLLM

Feature	Ollama	TensorRT-LLM	vLLM
80B Model Loading	❌ Intermittent Unloading	❌ Failed (MoE issues)	✅ Permanent VRAM Reservation
Throughput	Sequential	N/A (failed)	✅ PagedAttention Parallelism
Blackwell Support	Generic	✅ Optimized	✅ sm_121 Native Acceleration
Architecture	Dense Focused	❌ MoE Incomplete	✅ MoE Optimized (Shared Experts)
Quantization	Limited Options	Good	✅ GPTQ-Int4A16 Optimized
Memory Efficiency	~100GB+ VRAM	N/A	✅ ~80GB VRAM (with quantization)
API Compatibility	Custom	Custom	✅ OpenAI Compatible
Stability	❌ Unloading Issues	❌ Failed	✅ Stable

📝 Key Learnings

Why Ollama Failed

Memory Management: Ollama's design prioritizes flexibility over permanence
Multi-Model Focus: Optimized for switching between models, not keeping one loaded
No Guarantee: No reliable way to ensure 80B model stays in VRAM

Why TensorRT-LLM Failed

MoE Support: Incomplete implementation for Qwen3-Next's MoE architecture
Shared Experts: Not properly handled in TensorRT-LLM's weight loader
Architecture Mismatch: Designed more for dense models

Why vLLM Succeeded

Permanent Loading: Model loaded once, stays in VRAM permanently
MoE Native: Full support for Qwen3-Next's MoE architecture
OpenAI API: Easy migration path from existing code
Active Development: Regular updates and Qwen3-Next support
Blackwell Optimized: Native support for sm_121 architecture

🎯 Migration Checklist

Pre-Migration

Identify all Ollama API calls in your codebase
Document current model loading/unloading behavior
Verify vLLM server can be installed and run
Test vLLM with a smaller model first

Migration Steps

Install vLLM:
```
pip install vllm
```
Start vLLM Server:
```
bash bin/qwen3_next_80b_gptq.sh
```
Update Service Class:
- Replace OllamaService with VLLMService
- Update API endpoints from /api/chat to /v1/chat/completions
- Update JSON payload format
Update Configuration:
- Change OLLAMA_HOST to VLLM_HOST
- Update model name resolution logic
Test Thoroughly:
- Verify model stays loaded
- Test response generation
- Monitor memory usage
- Check for any unloading issues

Post-Migration

Monitor for 24-48 hours to ensure stability
Verify no model unloading occurs
Check performance metrics
Update documentation

📚 References

vLLM Documentation

Qwen3-Next Model

Qwen3-Next-80B Model Card

Blackwell Architecture

NVIDIA Blackwell Architecture

🎉 Success Metrics

After migrating to vLLM:

✅ Zero model unloading incidents (previously daily occurrences)
✅ Stable service (no more "model not loaded" errors)
✅ Better throughput (PagedAttention parallelism)
✅ Lower latency (no model reload overhead)
✅ Simpler architecture (no keep-alive logic needed)

Last Updated: 2025-12-31
Author: Jong-Seong Kim (김종성)
Project: LANIKA / NIKA AI
Hardware: NVIDIA DGX Spark (GB10) - Single System Deployment