Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4

Run a 178B-parameter MoE model with 128K context on a single NVIDIA DGX Spark. Blazing fast prefill — 37,000+ tok/s on long inputs thanks to native Blackwell NVFP4 tensor cores.

NVFP4 quantized version of bknyaz/Qwen3-235B-A22B-Instruct-2507-REAP. Compressed from 350 GB (BF16) to 102 GB (NVFP4) — fits entirely in the DGX Spark's 128 GB of coherent unified memory on the GB10 Blackwell GPU, with room for 128K token context using FP8 KV cache.

Quick Start: DGX Spark (128K Context, OpenAI-Compatible Server)

# Pull the NGC vLLM container (has Blackwell NVFP4 support)
sudo docker pull nvcr.io/nvidia/vllm:26.01-py3

# Download the model
pip install huggingface_hub[hf_transfer]
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
  Banana-Bae/Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4 \
  --local-dir ~/models/Qwen3-REAP-nvfp4

# Serve with 128K context
sudo docker run -d --gpus all --name qwen3-nvfp4 \
  -p 8000:8000 \
  -v ~/models/Qwen3-REAP-nvfp4:/model \
  --shm-size=16g \
  nvcr.io/nvidia/vllm:26.01-py3 \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --gpu-memory-utilization 0.93 \
    --max-model-len 131072 \
    --max-num-seqs 1 \
    --kv-cache-dtype fp8 \
    --enforce-eager \
    --trust-remote-code

The server takes ~10 minutes to load the model. Once ready, it exposes a standard OpenAI-compatible API.

Making Requests

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="/model",
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms."}],
    max_tokens=256,
    temperature=0.7,
)
print(response.choices[0].message.content)

curl

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"/model","prompt":"Hello!","max_tokens":100,"temperature":0.7}'

DGX Spark Performance

Tested on NVIDIA DGX Spark (ARM64 Grace CPU + GB10 Blackwell GPU, 128 GB unified LPDDR5x memory).

Inference Speed

Prompt Length Prefill (tok/s) Decode (tok/s) Time to First Token
~50 tokens 46 12.7 0.24s
~200 tokens 239 12.4 0.16s
~500 tokens 595 12.3 0.17s
~4,000 tokens 37,723 12.3 0.16s

Decode speed is consistent at ~12.4 tok/s regardless of context length. Prefill throughput scales with prompt size due to batched computation.

Memory Breakdown (128K Context)

Component Size
Model weights (NVFP4) 95.2 GiB
KV cache (FP8, 146K tokens) 13.1 GiB
Runtime overhead ~7.7 GiB
Free ~3 GiB
Total unified memory 119 GiB

How to Reproduce Benchmarks

Save this as benchmark.py and run on the DGX Spark while the server is running:

import requests, time, json

base = "http://localhost:8000/v1"
model_id = "/model"

tests = [
    ("50 tokens", "Explain what a neural network is in one sentence."),
    ("200 tokens", "Write a detailed explanation of how gradient descent works."),
    ("500 tokens", "Write a comprehensive guide on transformer architectures, "
     "encoder vs decoder models, LLM training, and inference optimizations."),
    ("~4K tokens", ("The quick brown fox jumps over the lazy dog. " * 500)
     + "\n\nSummarize the above."),
]

for label, prompt in tests:
    t0 = time.time()
    r = requests.post(f"{base}/completions", json={
        "model": model_id, "prompt": prompt,
        "max_tokens": 256, "temperature": 0.0, "stream": True
    }, stream=True)

    first_token_time = None
    token_count = 0
    for line in r.iter_lines():
        if line:
            line = line.decode()
            if line.startswith("data: ") and line != "data: [DONE]":
                if first_token_time is None:
                    first_token_time = time.time()
                token_count += 1

    end_time = time.time()
    ttft = first_token_time - t0 if first_token_time else -1
    gen_time = end_time - first_token_time if first_token_time else -1
    decode_tps = token_count / gen_time if gen_time > 0 else 0
    prompt_tokens = int(len(prompt.split()) * 1.3)
    prefill_tps = prompt_tokens / ttft if ttft > 0 else 0

    print(f"{label:>12} | prefill: {prefill_tps:>8.1f} tok/s | "
          f"decode: {decode_tps:>5.1f} tok/s | TTFT: {ttft:.2f}s | "
          f"generated: {token_count} tokens")

Quantization Details

Property Value
Method NVIDIA Model Optimizer (nvidia-modelopt v0.39.0)
Format NVFP4 (E2M1, 4-bit float with two-level FP8 micro-block scaling)
Block size 16 elements
Calibration 256 samples (128 GSM8K math + 128 CNN DailyMail general text)
Calibration time 78.8 min on 8x NVIDIA H100 80GB
Source precision BF16 (~350 GB)
Quantized size ~102 GB (3.4x compression)
Excluded layers lm_head, all MoE gate layers (kept in full precision)

Accuracy (B200 Native NVFP4)

Evaluated with lm_eval + vLLM on NVIDIA B200 GPUs.

Benchmark BF16 NVFP4 Delta
GSM8K (CoT, 8-shot) 90.07% 89.61% -0.46%
GPQA Diamond (0-shot) 41.92% 39.39% -2.53%
IFeval (inst_level_loose) 72.78% 71.46% -1.32%

Hardware Requirements

Hardware Context Notes
1x DGX Spark (GB10) 128K Native NVFP4 + FP8 KV cache, single request
2x B200 192GB 128K+ Native NVFP4 with TP=2, concurrent requests

Base Model

bknyaz/Qwen3-235B-A22B-Instruct-2507-REAP — a 25% expert-pruned variant of Qwen3-235B-A22B-Instruct-2507, reducing MoE experts from 128 to 96 per layer (235B → 178B params) while retaining ≥99% of original performance via the REAP method (Redundant Expert Ablation and Pruning).

Downloads last month
71
Safetensors
Model size
90B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Banana-Bae/Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4