Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4

Run a 178B-parameter MoE model with 128K context on a single NVIDIA DGX Spark. Blazing fast prefill — 37,000+ tok/s on long inputs thanks to native Blackwell NVFP4 tensor cores.

NVFP4 quantized version of bknyaz/Qwen3-235B-A22B-Instruct-2507-REAP. Compressed from 350 GB (BF16) to 102 GB (NVFP4) — fits entirely in the DGX Spark's 128 GB of coherent unified memory on the GB10 Blackwell GPU, with room for 128K token context using FP8 KV cache.

Quick Start: DGX Spark (128K Context, OpenAI-Compatible Server)

# Pull the NGC vLLM container (has Blackwell NVFP4 support)
sudo docker pull nvcr.io/nvidia/vllm:26.01-py3

# Download the model
pip install huggingface_hub[hf_transfer]
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
  Banana-Bae/Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4 \
  --local-dir ~/models/Qwen3-REAP-nvfp4

# Serve with 128K context
sudo docker run -d --gpus all --name qwen3-nvfp4 \
  -p 8000:8000 \
  -v ~/models/Qwen3-REAP-nvfp4:/model \
  --shm-size=16g \
  nvcr.io/nvidia/vllm:26.01-py3 \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --gpu-memory-utilization 0.93 \
    --max-model-len 131072 \
    --max-num-seqs 1 \
    --kv-cache-dtype fp8 \
    --enforce-eager \
    --trust-remote-code

The server takes ~10 minutes to load the model. Once ready, it exposes a standard OpenAI-compatible API.

Making Requests

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="/model",
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms."}],
    max_tokens=256,
    temperature=0.7,
)
print(response.choices[0].message.content)

curl

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"/model","prompt":"Hello!","max_tokens":100,"temperature":0.7}'

DGX Spark Performance

Tested on NVIDIA DGX Spark (ARM64 Grace CPU + GB10 Blackwell GPU, 128 GB unified LPDDR5x memory).

Inference Speed

Prompt Length	Prefill (tok/s)	Decode (tok/s)	Time to First Token
~50 tokens	46	12.7	0.24s
~200 tokens	239	12.4	0.16s
~500 tokens	595	12.3	0.17s
~4,000 tokens	37,723	12.3	0.16s

Decode speed is consistent at ~12.4 tok/s regardless of context length. Prefill throughput scales with prompt size due to batched computation.

Memory Breakdown (128K Context)

Component	Size
Model weights (NVFP4)	95.2 GiB
KV cache (FP8, 146K tokens)	13.1 GiB
Runtime overhead	~7.7 GiB
Free	~3 GiB
Total unified memory	119 GiB

How to Reproduce Benchmarks

Save this as benchmark.py and run on the DGX Spark while the server is running:

import requests, time, json

base = "http://localhost:8000/v1"
model_id = "/model"

tests = [
    ("50 tokens", "Explain what a neural network is in one sentence."),
    ("200 tokens", "Write a detailed explanation of how gradient descent works."),
    ("500 tokens", "Write a comprehensive guide on transformer architectures, "
     "encoder vs decoder models, LLM training, and inference optimizations."),
    ("~4K tokens", ("The quick brown fox jumps over the lazy dog. " * 500)
     + "\n\nSummarize the above."),
]

for label, prompt in tests:
    t0 = time.time()
    r = requests.post(f"{base}/completions", json={
        "model": model_id, "prompt": prompt,
        "max_tokens": 256, "temperature": 0.0, "stream": True
    }, stream=True)

    first_token_time = None
    token_count = 0
    for line in r.iter_lines():
        if line:
            line = line.decode()
            if line.startswith("data: ") and line != "data: [DONE]":
                if first_token_time is None:
                    first_token_time = time.time()
                token_count += 1

    end_time = time.time()
    ttft = first_token_time - t0 if first_token_time else -1
    gen_time = end_time - first_token_time if first_token_time else -1
    decode_tps = token_count / gen_time if gen_time > 0 else 0
    prompt_tokens = int(len(prompt.split()) * 1.3)
    prefill_tps = prompt_tokens / ttft if ttft > 0 else 0

    print(f"{label:>12} | prefill: {prefill_tps:>8.1f} tok/s | "
          f"decode: {decode_tps:>5.1f} tok/s | TTFT: {ttft:.2f}s | "
          f"generated: {token_count} tokens")

Quantization Details

Property	Value
Method	NVIDIA Model Optimizer (`nvidia-modelopt` v0.39.0)
Format	NVFP4 (E2M1, 4-bit float with two-level FP8 micro-block scaling)
Block size	16 elements
Calibration	256 samples (128 GSM8K math + 128 CNN DailyMail general text)
Calibration time	78.8 min on 8x NVIDIA H100 80GB
Source precision	BF16 (~350 GB)
Quantized size	~102 GB (3.4x compression)
Excluded layers	`lm_head`, all MoE gate layers (kept in full precision)

Accuracy (B200 Native NVFP4)

Evaluated with lm_eval + vLLM on NVIDIA B200 GPUs.

Benchmark	BF16	NVFP4	Delta
GSM8K (CoT, 8-shot)	90.07%	89.61%	-0.46%
GPQA Diamond (0-shot)	41.92%	39.39%	-2.53%
IFeval (inst_level_loose)	72.78%	71.46%	-1.32%

Hardware Requirements

Hardware	Context	Notes
1x DGX Spark (GB10)	128K	Native NVFP4 + FP8 KV cache, single request
2x B200 192GB	128K+	Native NVFP4 with TP=2, concurrent requests

Base Model

bknyaz/Qwen3-235B-A22B-Instruct-2507-REAP — a 25% expert-pruned variant of Qwen3-235B-A22B-Instruct-2507, reducing MoE experts from 128 to 96 per layer (235B → 178B params) while retaining ≥99% of original performance via the REAP method (Redundant Expert Ablation and Pruning).

Downloads last month: 71

Safetensors

Model size

90B params

Tensor type

BF16

F8_E4M3

Model tree for Banana-Bae/Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4

Base model

Qwen/Qwen3-235B-A22B-Instruct-2507

Finetuned

bknyaz/Qwen3-235B-A22B-Instruct-2507-REAP

Quantized

(1)

this model