Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4
Run a 178B-parameter MoE model with 128K context on a single NVIDIA DGX Spark. Blazing fast prefill — 37,000+ tok/s on long inputs thanks to native Blackwell NVFP4 tensor cores.
NVFP4 quantized version of bknyaz/Qwen3-235B-A22B-Instruct-2507-REAP. Compressed from 350 GB (BF16) to 102 GB (NVFP4) — fits entirely in the DGX Spark's 128 GB of coherent unified memory on the GB10 Blackwell GPU, with room for 128K token context using FP8 KV cache.
Quick Start: DGX Spark (128K Context, OpenAI-Compatible Server)
# Pull the NGC vLLM container (has Blackwell NVFP4 support)
sudo docker pull nvcr.io/nvidia/vllm:26.01-py3
# Download the model
pip install huggingface_hub[hf_transfer]
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
Banana-Bae/Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4 \
--local-dir ~/models/Qwen3-REAP-nvfp4
# Serve with 128K context
sudo docker run -d --gpus all --name qwen3-nvfp4 \
-p 8000:8000 \
-v ~/models/Qwen3-REAP-nvfp4:/model \
--shm-size=16g \
nvcr.io/nvidia/vllm:26.01-py3 \
python -m vllm.entrypoints.openai.api_server \
--model /model \
--gpu-memory-utilization 0.93 \
--max-model-len 131072 \
--max-num-seqs 1 \
--kv-cache-dtype fp8 \
--enforce-eager \
--trust-remote-code
The server takes ~10 minutes to load the model. Once ready, it exposes a standard OpenAI-compatible API.
Making Requests
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="/model",
messages=[{"role": "user", "content": "Explain quantum computing in simple terms."}],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)
curl
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"/model","prompt":"Hello!","max_tokens":100,"temperature":0.7}'
DGX Spark Performance
Tested on NVIDIA DGX Spark (ARM64 Grace CPU + GB10 Blackwell GPU, 128 GB unified LPDDR5x memory).
Inference Speed
| Prompt Length | Prefill (tok/s) | Decode (tok/s) | Time to First Token |
|---|---|---|---|
| ~50 tokens | 46 | 12.7 | 0.24s |
| ~200 tokens | 239 | 12.4 | 0.16s |
| ~500 tokens | 595 | 12.3 | 0.17s |
| ~4,000 tokens | 37,723 | 12.3 | 0.16s |
Decode speed is consistent at ~12.4 tok/s regardless of context length. Prefill throughput scales with prompt size due to batched computation.
Memory Breakdown (128K Context)
| Component | Size |
|---|---|
| Model weights (NVFP4) | 95.2 GiB |
| KV cache (FP8, 146K tokens) | 13.1 GiB |
| Runtime overhead | ~7.7 GiB |
| Free | ~3 GiB |
| Total unified memory | 119 GiB |
How to Reproduce Benchmarks
Save this as benchmark.py and run on the DGX Spark while the server is running:
import requests, time, json
base = "http://localhost:8000/v1"
model_id = "/model"
tests = [
("50 tokens", "Explain what a neural network is in one sentence."),
("200 tokens", "Write a detailed explanation of how gradient descent works."),
("500 tokens", "Write a comprehensive guide on transformer architectures, "
"encoder vs decoder models, LLM training, and inference optimizations."),
("~4K tokens", ("The quick brown fox jumps over the lazy dog. " * 500)
+ "\n\nSummarize the above."),
]
for label, prompt in tests:
t0 = time.time()
r = requests.post(f"{base}/completions", json={
"model": model_id, "prompt": prompt,
"max_tokens": 256, "temperature": 0.0, "stream": True
}, stream=True)
first_token_time = None
token_count = 0
for line in r.iter_lines():
if line:
line = line.decode()
if line.startswith("data: ") and line != "data: [DONE]":
if first_token_time is None:
first_token_time = time.time()
token_count += 1
end_time = time.time()
ttft = first_token_time - t0 if first_token_time else -1
gen_time = end_time - first_token_time if first_token_time else -1
decode_tps = token_count / gen_time if gen_time > 0 else 0
prompt_tokens = int(len(prompt.split()) * 1.3)
prefill_tps = prompt_tokens / ttft if ttft > 0 else 0
print(f"{label:>12} | prefill: {prefill_tps:>8.1f} tok/s | "
f"decode: {decode_tps:>5.1f} tok/s | TTFT: {ttft:.2f}s | "
f"generated: {token_count} tokens")
Quantization Details
| Property | Value |
|---|---|
| Method | NVIDIA Model Optimizer (nvidia-modelopt v0.39.0) |
| Format | NVFP4 (E2M1, 4-bit float with two-level FP8 micro-block scaling) |
| Block size | 16 elements |
| Calibration | 256 samples (128 GSM8K math + 128 CNN DailyMail general text) |
| Calibration time | 78.8 min on 8x NVIDIA H100 80GB |
| Source precision | BF16 (~350 GB) |
| Quantized size | ~102 GB (3.4x compression) |
| Excluded layers | lm_head, all MoE gate layers (kept in full precision) |
Accuracy (B200 Native NVFP4)
Evaluated with lm_eval + vLLM on NVIDIA B200 GPUs.
| Benchmark | BF16 | NVFP4 | Delta |
|---|---|---|---|
| GSM8K (CoT, 8-shot) | 90.07% | 89.61% | -0.46% |
| GPQA Diamond (0-shot) | 41.92% | 39.39% | -2.53% |
| IFeval (inst_level_loose) | 72.78% | 71.46% | -1.32% |
Hardware Requirements
| Hardware | Context | Notes |
|---|---|---|
| 1x DGX Spark (GB10) | 128K | Native NVFP4 + FP8 KV cache, single request |
| 2x B200 192GB | 128K+ | Native NVFP4 with TP=2, concurrent requests |
Base Model
bknyaz/Qwen3-235B-A22B-Instruct-2507-REAP — a 25% expert-pruned variant of Qwen3-235B-A22B-Instruct-2507, reducing MoE experts from 128 to 96 per layer (235B → 178B params) while retaining ≥99% of original performance via the REAP method (Redundant Expert Ablation and Pruning).
- Downloads last month
- 71
Model tree for Banana-Bae/Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4
Base model
Qwen/Qwen3-235B-A22B-Instruct-2507