# vLLM Migration Guide: Qwen3-Next-80B on NVIDIA Blackwell (GB10)
### *From Ollama Unloading Issues to vLLM Success*

## 📋 Overview

This guide documents our journey deploying **Qwen3-Next-80B** on the **NVIDIA DGX Spark (GB10)** - from encountering critical model unloading issues with Ollama, through failed TensorRT attempts, to finally achieving stable deployment with vLLM.

**The Problem**: On a single DGX Spark (GB10) system, Ollama would intermittently unload the 80B model, causing service disruptions and requiring constant monitoring. TensorRT-LLM attempts also failed due to compatibility issues with Qwen3-Next's MoE architecture.

**The Solution**: vLLM with OpenAI-compatible API, providing permanent VRAM reservation, PagedAttention parallelism, and native Blackwell (sm_121) acceleration.

---

## 🚨 The Journey: From Ollama to vLLM

### Phase 1: The Ollama Unloading Problem

**Initial Setup**: We deployed Qwen3-Next-80B using Ollama on DGX Spark (GB10) with 120GB VRAM.

**The Issue**:
- Model would unload unexpectedly during idle periods
- Service would fail when requests arrived after model unload
- Required constant "keep-alive" requests to prevent unloading
- No reliable way to ensure model permanence in VRAM

**Symptoms**:
```
[Ollama] Model "nika" unloaded from memory
[Service] Request failed: Model not loaded
[Service] Attempting to reload model...
[Service] Reload takes 30-60 seconds
```

**Root Cause**: Ollama's memory management is designed for multi-model scenarios and doesn't guarantee permanent model retention, especially for large 80B models.

---

### Phase 2: TensorRT-LLM Attempt

**Why We Tried TensorRT-LLM**:
- Promised better performance on Blackwell architecture
- Lower latency potential
- Better memory efficiency

**The Failure**:
- Qwen3-Next's MoE (Mixture of Experts) architecture not fully supported
- Shared expert layers caused loading failures
- Weight key mapping issues (`model.` prefix mismatch)
- Incomplete MoE routing implementation

**Error Messages**:
```
[TensorRT-LLM] Error: Shared expert layers not recognized
[TensorRT-LLM] Error: Weight key mismatch: model.layers.0 vs layers.0
[TensorRT-LLM] Error: MoE routing not implemented for Qwen3-Next
```

**Decision**: Abandoned TensorRT-LLM due to architectural incompatibility with Qwen3-Next's MoE structure.

---

### Phase 3: vLLM Success

**Why vLLM Worked**:
- Native support for MoE architectures
- Permanent VRAM reservation (model stays loaded)
- OpenAI-compatible API (easy migration)
- PagedAttention for high throughput
- Active development and Qwen3-Next support

**The Solution**:
- vLLM server runs as a separate service
- Model loaded once at server start, stays in VRAM permanently
- Application connects via OpenAI-compatible API
- No more unloading issues

---

## 🏗️ Qwen3-Next-80B MoE Architecture

Understanding the architecture is crucial for deployment:

### Architecture Overview

Qwen3-Next-80B uses a sophisticated **Mixture of Experts (MoE)** architecture:

- **Total Parameters**: ~80B
- **Active Parameters (A3B)**: Only ~3B parameters are active per token
- **Shared Experts**: Hybrid approach with both "routed experts" and "shared experts"
- **Benefits**: Fast inference for its size while maintaining global knowledge

### Why This Matters

1. **Memory Efficiency**: Only 3B active parameters per token, but full 80B model must stay in VRAM
2. **Shared Experts**: Maintain global knowledge across all tokens
3. **Weight Mapping**: Special handling required for shared expert layers

---

## 🛠️ Critical Technical Challenges & Solutions

### 1. The Weight Key Mismatch (The `model.` Prefix)

**Problem**: Hugging Face checkpoints save weights with a `model.` prefix (e.g., `model.layers.10...`), while vLLM's internal `Qwen2/3` implementation expects keys to start directly with `layers.10...`.

**Solution**: vLLM's weight loader handles this automatically, but you may need to ensure checkpoint format compatibility.

**Example**:
```python
# Weight key transformation needed
# Before: "model.layers.10.attention.q_proj.weight"
# After:  "layers.10.attention.q_proj.weight"
```

### 2. Shared Expert Mapping

**Problem**: Standard loaders may fail to recognize `shared_expert` layers.

**Solution**: vLLM's `AutoWeightsLoader` correctly handles Qwen3-Next's MoE structure, including shared experts.

**Example**:
```python
# Shared expert layers in Qwen3-Next
shared_expert_layers = [
    "mlp.shared_expert.gate_proj.weight",
    "mlp.shared_expert.up_proj.weight",
    "mlp.shared_expert.down_proj.weight"
]
```

### 3. Model Name Resolution

**Problem**: Application uses friendly names like "nika", but vLLM needs actual model paths.

**Solution**: Implement model name resolution in `VLLMService`.

---

## ⚡ Blackwell (GB10) Specific Optimizations

The **DGX Spark (GB10)** is the first hardware to support **NVFP4**. To maximize performance for an 80B model:

### 1. Quantization Strategy: GPTQ-Int4A16

Using **GPTQ Int4 weight-only quantization** allows the 80B model to fit comfortably in 120GB VRAM:
- Model weights: ~40GB (quantized)
- KV Cache: ~40GB (for high throughput)
- System overhead: ~40GB
- **Total**: ~120GB (perfect fit for GB10)

### 2. Critical Environment Flags

For Blackwell (sm_121), these environment variables are **mandatory**:

```bash
# Force the Triton compiler to find the correct Blackwell ptxas
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export VLLM_USE_FLASHINFER_SAMPLER=1  # Enable Blackwell-optimized kernels
export VLLM_USE_FLASHINFER_MOE=0       # Temporary workaround for MoE kernels if needed
export TRITON_INTERPRET=0              # Disable interpreter mode (critical for performance)
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```

---

## 💻 Implementation Guide

### 1. vLLM Server Setup

**File**: `bin/qwen3_next_80b_gptq.sh`

```bash
#!/bin/bash
# Optimized for Qwen3-Next-80B-GPTQ-Int4 on DGX Spark (Blackwell)

# 1. CUDA 13.0 Pathing (Must include nvcc for JIT)
export CUDA_HOME=/usr/local/cuda-13.0
export CUDA_PATH=/usr/local/cuda-13.0
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export C_INCLUDE_PATH=$CUDA_HOME/include
export CPLUS_INCLUDE_PATH=$CUDA_HOME/include

# 2. Prevent Triton 'resize_()' errors by disabling Interpreter mode
export TRITON_INTERPRET=0
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export VLLM_WORKER_MULTIPROC_METHOD=spawn

# 3. Blackwell-specific optimizations
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_USE_FLASHINFER_MOE=0  # Disable if MoE kernels cause issues
export TRITON_DISABLE_LINE_INFO=1
export TORCH_COMPILE_DEBUG=0
export TORCHDYNAMO_DISABLE=1

# 4. Launch with Blackwell-specific tuning
# --enforce-eager bypasses CUDA graph capture overhead on new sm_121 arch
python3 -m vllm.entrypoints.openai.api_server \
    --model dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16 \
    --trust-remote-code \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 8192 \
    --enforce-eager \
    --disable-custom-all-reduce \
    --host 0.0.0.0 \
    --port 1107
```

**Key Parameters**:
- `--gpu-memory-utilization 0.92`: Leaves room for KV cache (critical for 80B model)
- `--enforce-eager`: Better stability on Blackwell (bypasses CUDA graph)
- `--max-model-len 8192`: Adjust based on your use case

### 2. Application Service Class

**File**: `api/services/vllm_service.py`

```python
import requests
import time
import threading
from typing import List, Dict, Optional

class VLLMService:
    """vLLM API Service (OpenAI Compatible)"""
    
    def __init__(self, host: str, model_name: str, embedding_model: str):
        self.host = host.rstrip('/')
        self.embedding_model = embedding_model
        self.api_base = f"{self.host}/v1"
        self._warmed_models = set()
        self._warm_up_lock = threading.Lock()
        
        # Convert model name to actual vLLM model path
        self.model_name = self._resolve_model_name(model_name)
    
    def _resolve_model_name(self, model_name: str) -> str:
        """
        Convert model name to actual vLLM model path
        - "nika" -> Actual model path available on vLLM server
        - If already a full path, return as-is
        """
        if "/" in model_name:  # Already a full path
            return model_name
        
        try:
            response = requests.get(f"{self.api_base}/models", timeout=5)
            if response.status_code == 200:
                models_data = response.json()
                for model_info in models_data.get("data", []):
                    model_id = model_info.get("id")
                    # Special logic for NIKA (Qwen3-Next-80B)
                    if "nika" in model_name.lower() and "qwen3" in model_id.lower():
                        logger.info(f"✓ Resolved NIKA to Qwen3-Next-80B: {model_id}")
                        return model_id
                    if model_id and model_name.lower() in model_id.lower():
                        return model_id
        except Exception as exc:
            logger.warning(f"Failed to resolve model name '{model_name}': {exc}")
        
        return model_name
    
    def check_connection(self) -> bool:
        """Check vLLM server connection"""
        try:
            response = requests.get(f"{self.api_base}/models", timeout=5)
            return response.status_code == 200
        except Exception:
            return False
    
    def generate_response(
        self,
        messages: List[Dict[str, str]],
        max_tokens: int = 512,
        temperature: float = 0.7,
        top_p: float = 0.9,
        top_k: Optional[int] = None,
        num_ctx: Optional[int] = None,
        stop: Optional[List[str]] = None,
        model_name: Optional[str] = None
    ) -> Dict[str, any]:
        """Generate response using vLLM OpenAI-compatible API"""
        request_id = f"req_{int(time.time() * 1000)}"
        # Resolve model name if provided
        if model_name:
            model_to_use = self._resolve_model_name(model_name)
        else:
            model_to_use = self.model_name
        
        logger.info(f"[{request_id}] Generating response via vLLM API... (model={model_to_use})")
        
        # Convert messages to OpenAI-compatible format
        openai_messages = []
        for msg in messages:
            role = msg.get("role", "user")
            content = msg.get("content", "")
            openai_messages.append({"role": role, "content": content})
        
        # vLLM OpenAI-compatible API request
        api_url = f"{self.api_base}/chat/completions"
        
        # Stop sequence settings
        default_stop = ["\n\nUser:", "Observation:", "<|endoftext|>", "<|eot_id|>", "\n\n\n"]
        stop_sequences = stop if stop is not None else default_stop
        
        payload = {
            "model": model_to_use,
            "messages": openai_messages,
            "max_tokens": max_tokens,
            "temperature": temperature,
            "top_p": top_p,
            "stop": stop_sequences,
            "stream": False
        }
        
        try:
            inference_start = time.time()
            response = requests.post(api_url, json=payload, timeout=300)
            inference_time = time.time() - inference_start
            
            if response.status_code == 200:
                result = response.json()
                choices = result.get("choices", [])
                if not choices:
                    raise RuntimeError("vLLM API returned empty choices")
                
                choice = choices[0]
                response_text = choice.get("message", {}).get("content", "")
                
                # Token usage information
                usage = result.get("usage", {})
                prompt_tokens = usage.get("prompt_tokens", 0)
                completion_tokens = usage.get("completion_tokens", 0)
                total_tokens = usage.get("total_tokens", 0)
                
                # Check if response was truncated
                finish_reason = choice.get("finish_reason", "")
                was_truncated = finish_reason == "length" or completion_tokens >= (max_tokens * 0.95)
                
                logger.info(f"[{request_id}] [TIMING] vLLM API call: {inference_time*1000:.2f}ms")
                logger.info(f"[{request_id}] [STATS] Prompt: {prompt_tokens} tokens | Completion: {completion_tokens} tokens")
                
                return {
                    "response_text": response_text,
                    "was_truncated": was_truncated,
                    "eval_count": completion_tokens,
                    "max_tokens": max_tokens
                }
            else:
                error_msg = f"vLLM API error: {response.status_code} - {response.text}"
                logger.error(f"[{request_id}] {error_msg}")
                raise RuntimeError(error_msg)
        except requests.exceptions.RequestException as exc:
            error_msg = f"Request error: {exc}"
            logger.error(f"[{request_id}] {error_msg}")
            raise
```

### 3. API Endpoint Migration

**Before (Ollama)**:
```python
# Ollama endpoint
response = requests.post(
    f"{ollama_host}/api/chat",
    json={
        "model": "nika",
        "messages": [...],
        "options": {
            "num_predict": 512,
            "temperature": 0.7,
            "keep_alive": -1  # Try to keep model loaded
        }
    }
)
```

**After (vLLM)**:
```python
# vLLM OpenAI-compatible endpoint
response = requests.post(
    f"{vllm_host}/v1/chat/completions",
    json={
        "model": "/root/.cache/huggingface/hub/models--dazipe--Qwen3-Next-80B...",
        "messages": [...],
        "max_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.9
        # No keep_alive needed - model stays loaded permanently
    }
)
```

---

## 🔍 Troubleshooting

### Q: Why does `nvidia-smi` show "Failed to initialize NVML"?

**A**: This is often due to a "zombie" process holding the GB10 SoC. Run:
```bash
sudo fuser -v /dev/nvidia*
```
Kill the PIDs if found. If it persists, a cold reboot of the DGX Spark is required to re-initialize the Blackwell firmware.

### Q: vLLM is extremely slow (1 token/sec).

**A**: You are likely in **Triton Interpreter Mode**. This happens if `nvcc` is not found during startup. Ensure `cuda-toolkit-13-0` is installed and the `PATH` is correctly set:

```bash
which nvcc  # Should output: /usr/local/cuda-13.0/bin/nvcc
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
```

### Q: Model loading fails with "KeyError: 'model.layers.0'"

**A**: This is the weight key mismatch issue. The checkpoint has `model.` prefix but vLLM expects keys without it. Solutions:
1. Use a checkpoint converter to strip the prefix
2. Patch vLLM's weight loader to handle both formats
3. Use a different checkpoint format (if available)

### Q: Shared expert layers not loading

**A**: Ensure your vLLM version supports Qwen3-Next MoE architecture. You may need to:
1. Update vLLM to the latest version
2. Apply custom patches for shared expert handling
3. Use `--trust-remote-code` flag (already in script)

### Q: Model still unloads (like Ollama)

**A**: This shouldn't happen with vLLM. If it does:
1. Check vLLM server logs for memory pressure
2. Verify `--gpu-memory-utilization` is not too high
3. Check for other processes using GPU memory
4. Ensure vLLM server process is not being killed

---

## 📈 Comparison: Ollama vs TensorRT vs vLLM

| Feature | Ollama | TensorRT-LLM | vLLM |
| --- | --- | --- | --- |
| **80B Model Loading** | ❌ Intermittent Unloading | ❌ Failed (MoE issues) | ✅ **Permanent VRAM Reservation** |
| **Throughput** | Sequential | N/A (failed) | ✅ **PagedAttention Parallelism** |
| **Blackwell Support** | Generic | ✅ Optimized | ✅ **sm_121 Native Acceleration** |
| **Architecture** | Dense Focused | ❌ MoE Incomplete | ✅ **MoE Optimized (Shared Experts)** |
| **Quantization** | Limited Options | Good | ✅ **GPTQ-Int4A16 Optimized** |
| **Memory Efficiency** | ~100GB+ VRAM | N/A | ✅ **~80GB VRAM (with quantization)** |
| **API Compatibility** | Custom | Custom | ✅ **OpenAI Compatible** |
| **Stability** | ❌ Unloading Issues | ❌ Failed | ✅ **Stable** |

---

## 📝 Key Learnings

### Why Ollama Failed

1. **Memory Management**: Ollama's design prioritizes flexibility over permanence
2. **Multi-Model Focus**: Optimized for switching between models, not keeping one loaded
3. **No Guarantee**: No reliable way to ensure 80B model stays in VRAM

### Why TensorRT-LLM Failed

1. **MoE Support**: Incomplete implementation for Qwen3-Next's MoE architecture
2. **Shared Experts**: Not properly handled in TensorRT-LLM's weight loader
3. **Architecture Mismatch**: Designed more for dense models

### Why vLLM Succeeded

1. **Permanent Loading**: Model loaded once, stays in VRAM permanently
2. **MoE Native**: Full support for Qwen3-Next's MoE architecture
3. **OpenAI API**: Easy migration path from existing code
4. **Active Development**: Regular updates and Qwen3-Next support
5. **Blackwell Optimized**: Native support for sm_121 architecture

---

## 🎯 Migration Checklist

### Pre-Migration

- [ ] Identify all Ollama API calls in your codebase
- [ ] Document current model loading/unloading behavior
- [ ] Verify vLLM server can be installed and run
- [ ] Test vLLM with a smaller model first

### Migration Steps

1. **Install vLLM**:
   ```bash
   pip install vllm
   ```

2. **Start vLLM Server**:
   ```bash
   bash bin/qwen3_next_80b_gptq.sh
   ```

3. **Update Service Class**:
   - Replace `OllamaService` with `VLLMService`
   - Update API endpoints from `/api/chat` to `/v1/chat/completions`
   - Update JSON payload format

4. **Update Configuration**:
   - Change `OLLAMA_HOST` to `VLLM_HOST`
   - Update model name resolution logic

5. **Test Thoroughly**:
   - Verify model stays loaded
   - Test response generation
   - Monitor memory usage
   - Check for any unloading issues

### Post-Migration

- [ ] Monitor for 24-48 hours to ensure stability
- [ ] Verify no model unloading occurs
- [ ] Check performance metrics
- [ ] Update documentation

---

## 📚 References

### vLLM Documentation

- [vLLM Official Documentation](https://docs.vllm.ai/)
- [OpenAI Compatible API](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)

### Qwen3-Next Model

- [Qwen3-Next-80B Model Card](https://huggingface.co/dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16)

### Blackwell Architecture

- [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/blackwell/)

---

## 🎉 Success Metrics

After migrating to vLLM:

- ✅ **Zero model unloading incidents** (previously daily occurrences)
- ✅ **Stable service** (no more "model not loaded" errors)
- ✅ **Better throughput** (PagedAttention parallelism)
- ✅ **Lower latency** (no model reload overhead)
- ✅ **Simpler architecture** (no keep-alive logic needed)

---

**Last Updated**: 2025-12-31  
**Author**: [Jong-Seong Kim (김종성)](https://huggingface.co/dazipe)  
**Project**: LANIKA / NIKA AI  
**Hardware**: NVIDIA DGX Spark (GB10) - Single System Deployment