Qwen3.6-35B-A3B-NVFP4-modelopt(候选版 / candidate)

⚠️ 状态:候选 / PTQ 完成 / 4 路径 serving 全 blocked —— Status: candidate / PTQ verified / 4 serving paths all blocked


🇨🇳 中文说明(英文版在下方)

这不是一个能直接 serve 的模型。 2026-05-12 在 RTX PRO 6000 Workstation(sm_120,96GB)上用 NVIDIA Model Optimizer 0.43 完成了 Qwen3.6-35B-A3B 的 NVFP4 量化,PTQ 端到端通,24GB safetensors + hf_quant_config.json(NVFP4 + FP8 KV + group_size=16)精确就位。但当晚验证了 4 条 backend serving 通路,全被生态版本差卡死:

Backend 失败点
SGLang dev-cu13(production 主镜像) linear.py:858 load_merged_column_weight shape assert(fused QKV/gate_up 跟 modelopt 序列化不对齐)
SGLang v0.5.9 stable image 不带 modelopt 包 + transformers 4.57 不认 qwen3_5_moe model_type
vLLM aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 fork KeyError: 'experts.w2_input_scale'(loader 期望融合 scale,modelopt 输出 per-expert)
TRT-LLM 1.2.0/1.2.1(nvcr.io/nvidia/tritonserver:26.03/26.04) 原生认 modelopt config 但 pin transformers 4.57,升 5.8 引发 AutoModelForVision2Seqget_parameter_device 一连串 import cascade,patch 无底洞

根因不是单个框架,是四维交叉

qwen3_5_moe model_type(transformers ≥ 5.2 才认) × modelopt_fp4 序列化(2025-末 NVIDIA Model Optimizer 0.43 输出) × Blackwell sm_120/121(2024-2025 才量产) × 多模态 wrapper(ConditionalGenerationForCausalLM 双 entry,modelopt 改写一次,backend loader 没跟上)—— 四个维度同时落进生态各自的盲区

现在能怎么用

  • 不能直接给 SGLang/vLLM/TRT-LLM serve(2026-05-12 状态)
  • 可以用 modelopt 自己的 inference API 验证 PTQ 正确性(from modelopt.torch.utils.huggingface import dequantize_state_dict)
  • 可以作为 reference 跟其他 NVFP4 量化产物对照(per-expert scale 命名标准格式)
  • 不要用在生产

生产环境用什么

compressed-tensors nvfp4-pack-quantized v8-RTN 路径:nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN — 已在 SGLang dev-cu13 nightly 上 production 流量 2+ 周。

什么时候回试 modelopt_fp4

预估 4-8 周(2026-06 → 07),等下面任一上游对齐:

  • SGLang stable 同时装 modelopt 包 + 升 transformers ≥ 5.2
  • 或 TRT-LLM 1.3 release 改 pin 到 transformers ≥ 5.2(放弃 AutoModelForVision2Seq / get_parameter_device 等 deprecated 调用)
  • 或 vLLM 单独注册 Qwen3_5MoeForCausalLM + loader 支持 modelopt fused/per-expert scale

14 个具体坑 + 修复路径

详见下方 ## ⚠️ 14 ecosystem traps 表。任何团队走这条路都会撞到同一片生态空档,这是公开 disclose 而不是埋雷。


English (full disclosure follows)

⚠️ Status: candidate / PTQ verified / backend smoke blocked

This is a rehearsal-stage artifact for the 2026-05-15 Lynn V4-Distill-Qwen public ship. The quantization step itself is end-to-end verified; serving/backend smoke has been attempted on R6000 sm_120 and Spark sm_121, and is currently blocked by backend loader support gaps documented below.

NVFP4 (NVIDIA Model Optimizer) quantized version of Qwen/Qwen3.6-35B-A3B, using nvidia-modelopt 0.43.0 + --qformat nvfp4_experts_only + --kv_cache_qformat fp8_cast via hf_ptq.py. Output format is modelopt_fp4 (HF-native), aligned with NVIDIA's own reference NVFP4 models such as nvidia/Qwen3.5-397B-A17B-NVFP4.

This is a deliberate departure from our earlier compressed-tensors nvfp4-pack-quantized v8-RTN ckpt (now marked legacy). Full route postmortem with 10 traps detailed: see companion zhihu post (link to be added once published).

⚠️ Three-state status disclosure

Stage Status Detail
1. PTQ end-to-end verified hf_ptq.py ran to completion on R6000 sm_120; modelopt 0.43 + transformers 5.8 stack; 24GB model.safetensors + hf_quant_config.json + complete tokenizer/config emitted; integrity verified (offsets self-consistent, tensor count match, no None architectures)
2. R6000 sm_120 backend smoke blocked sm_120 + cu130 ecosystem has not caught up: SGLang stable/nightly sgl_kernel ships only sm_90 + sm_100 (no sm_120/); vLLM 0.20.2 hard-requires flash-attn whose cu130 + torch 2.11 + py3.12 wheel does not yet exist on PyPI; vLLM falls back to Qwen3_5MoeForConditionalGeneration (only architecture registered for qwen3_5_moe) which force-initializes vision tower → triggers flash_attn.ops.triton.rotary import → ModuleNotFoundError. TODO: re-test once sgl_kernel sm_120 ships upstream (typically 2-4 weeks) OR flash-attn cu130 wheel released.
3. Spark sm_121 + SGLang dev-cu13 smoke blocked SGLang dev-cu13 recognizes the checkpoint as modelopt_fp4, enters ModelOptModelLoader, detects nvfp4, auto-selects Blackwell fp4-gemm-backend=flashinfer_cudnn, then fails during weight load: qwen3_5.py load_weights -> linear.py weight_loader_v2 -> load_merged_column_weight -> assert param_data.shape == loaded_weight.shape. Restoring the original Qwen3_5MoeForConditionalGeneration architecture is required for SGLang to reach this point; the vLLM text-only Qwen3_5MoeForCausalLM patch is not compatible with SGLang.

Do not assume this checkpoint serves correctly. The PTQ artifact integrity is sound; the current blockers are backend loader support for Qwen3.6-A3B modelopt_fp4 multimodal-MoE serialization and fused/merged projection shape mapping.

Quantization recipe

Parameter Value
Tool nvidia-modelopt 0.43.0 (NVIDIA Model Optimizer)
Tool examples NVIDIA/TensorRT-Model-Optimizer at tag 0.43.0 (matching PyPI release)
Scheme NVFP4 (4-bit weights + 4-bit activations, group_size=16, fp8 per-group scale)
qformat nvfp4_experts_only (MoE-aware: auto-ignore lm_head / mlp.gate / router / *visual* / *vision* / *encoder*)
KV cache fp8_cast (NVIDIA default)
Calibration cnn_dailymail only, 512 samples × 512 seq
Base model dtype BF16 (dequantized from Qwen/Qwen3.6-35B-A3B-FP8 via engine/convert_fp8_to_bf16.py)
Hardware (PTQ) NVIDIA RTX PRO 6000 Blackwell sm_120 / 96GB GDDR7
transformers 5.8.0 (pinned; modelopt[hf] default downgrades to 4.57 which breaks Qwen3.6 — see traps)
Output HF safetensors (24GB single shard) + hf_quant_config.json

Calibration caveat (vs NVIDIA reference)

NVIDIA's own NVFP4 reference models use cnn_dailymail + Nemotron-Post-Training-Dataset-v2 dual calibration. We used only cnn_dailymail (Nemotron is a gated dataset requiring license accept). modelopt documentation states calibration accuracy is robust across different choices of calibration data, but the long-tail token-distribution coverage is narrower than NV reference. For a production Lynn V4-Distill ship (2026-05-15), we may use task-aligned calibration data (Lynn V4 distill subset) for better in-domain quality.

⚠️ 10 ecosystem traps caught during rehearsal (5/15 ship pre-flight intel)

Documented for downstream users following the same path. Each is a real 5/15 first-touch risk; fixes here:

# Trap Where Fix
1 pip install nvidia-modelopt[hf] force-downgrades transformers 5.x → 4.57.6 (deps pin <5); Qwen3.6 qwen3_5_moe model_type requires transformers ≥ 5.2 install step After modelopt install, force re-pin: pip install -U --no-deps transformers==5.8.0 huggingface_hub==1.14.0
2 git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git times out (~130s) from mainland China TRT-MO fetch Use codeload tarball: curl -fsL -o /tmp/trtmo.tar.gz https://codeload.github.com/NVIDIA/TensorRT-Model-Optimizer/tar.gz/refs/tags/0.43.0. Note: extracted top-dir is Model-Optimizer-0.43.0 (without "TensorRT-" prefix)
3 TRT-MO main branch hf_ptq.py imports EagleOfflineDataCollator which doesn't exist in PyPI nvidia-modelopt 0.43.0 main vs release tag mismatch Pin TRT-MO checkout to the same release tag as installed modelopt (here tag/0.43.0) — examples/llm_ptq/hf_ptq.py size differs (50KB at tag 0.43.0 vs 58KB at HEAD, no speculative decoding imports)
4 huggingface_hub.errors.RuntimeError: Cannot send a request, as the client has been closed. during dataset download (HF Hub network instability from mainland) calibration data export HF_ENDPOINT=https://hf-mirror.com before running hf_ptq.py
5 datasets.exceptions.DatasetNotFoundError: nvidia/Nemotron-Post-Training-Dataset-v2 is a gated dataset (HF Hub default hf_ptq.py calibration includes both cnn_dailymail and Nemotron-Post-Training-v2) calibration data Either accept Nemotron gated license on HF web UI + use HF_TOKEN, or pass --dataset cnn_dailymail to skip Nemotron entirely
6 Inline modelopt PTQ via mtq.quantize() + export_hf_checkpoint() fails with TypeError: 'NoneType' object is not iterable at is_multimodal_model (model loaded with AutoModelForCausalLM from multimodal-base sets architectures = None for export path) export step Use the hf_ptq.py CLI wrapper (sets architectures correctly) instead of writing inline modelopt API calls. CLI handles multimodal model class registration properly
7 SGLang stable v0.5.9 / nightly v0.5.11 sgl_kernel PyPI wheel ships only sm_90 and sm_100 precompiled binaries — sm_120/ directory does not exist serving on sm_120 (Blackwell) Wait for upstream sm_120 binary release; or build sgl_kernel from source matching local CUDA+torch combination
8 vLLM 0.20.2 hard-requires flash_attn.ops.triton.rotary for Qwen3_VL vision tower rotary embedding initialization serving via vLLM Install flash-attnbut no PyPI wheel exists for torch 2.11 + cu130 + py3.12 combination at this time (PyPI search returns "no matching distribution")
9 vLLM 0.20.2 ModelRegistry registers only Qwen3_5MoeForConditionalGeneration for model_type=qwen3_5_moe. Patching config.json architectures = ["Qwen3_5MoeForCausalLM"] (a class that DOES exist in qwen3_5.py:556) has no effect — vLLM falls back to ConditionalGeneration which force-inits the vision tower regardless serving on Blackwell with no flash-attn Wait for either (a) sgl_kernel sm_120 binary, (b) vLLM upstream to register Qwen3_5MoeForCausalLM separately, (c) flash-attn cu130 wheel. None available 2026-05-11
10 Spark sm_121 + SGLang dev-cu13 reaches ModelOptModelLoader and recognizes nvfp4, but fails in Qwen3.5 MoE weight loading with assert param_data.shape == loaded_weight.shape at load_merged_column_weight serving this exact modelopt_fp4 artifact on Spark Backend-side support gap: Qwen3.6-A3B modelopt serialization and SGLang's fused/merged projection parameter shapes do not currently align. Needs upstream loader fix or a modelopt export variant that matches SGLang's expected fused-column layout.
11 lmsysorg/sglang:v0.5.9 stable image does not ship modelopt Python package (unlike dev-cu13 which has it); transformers 4.57.1 in this image also does not register qwen3_5_moe model_type → cannot load this ckpt at all serving via SGLang stable v0.5.9 Wait for SGLang stable to bundle modelopt + a transformers release that maps qwen3_5_moe. As of 2026-05-12 neither is true in any tagged SGLang release.
12 nvcr.io/nvidia/tritonserver:26.03/26.04-trtllm-python-py3 ships TRT-LLM 1.2.0/1.2.1 + transformers 4.57.3 (hard-pinned). TRT-LLM successfully reads hf_quant_config.json and recognizes nvfp4 + fp8 kv + group_size=16, but the executor worker dies with KeyError: 'qwen3_5_moe' from transformers.models.auto.configuration_auto.CONFIG_MAPPING_NAMES. Force-upgrading to transformers 5.8.0 (via aliyun pypi mirror) makes qwen3_5_moe known but cascades into ImportError: cannot import name 'AutoModelForVision2Seq' (renamed → AutoModelForImageTextToText in transformers 5.x), then after sed-patching that into ImportError: cannot import name 'get_parameter_device' from 'transformers.modeling_utils', with several more API renames likely behind it. serving via TRT-LLM 1.2.x Patch-cascade is bottomless; not a 5/15-ship path. Wait for TRT-LLM 1.3 release that pins to a transformers ≥ 5.2 (covers qwen3_5_moe) without the legacy AutoModelForVision2Seq / get_parameter_device API surface.
13 Docker restart=always (sglang-35b-fp8) and restart=unless-stopped (elyza-nvllm) policies cause production containers to auto-resurrect after docker stop, immediately taking GPU memory back from the fp4 test container → CUDA stream OOM in TRT-LLM with Free memory on device cuda:0 (82.6/119.63 GiB) on startup is less than desired GPU memory utilization (0.7, 83.74 GiB) test runner that needs to free GPU on Spark Pre-test: docker update --restart=no <container> for each production container before docker stop. Post-test (restore): docker start <container> && docker update --restart=<original-policy>. Otherwise daemon-level policy beats stop.
14 lmsysorg/sglang:dev-cu13 ships TRT-LLM-style Qwen3_5MoeForConditionalGeneration registered but not Qwen3_5MoeForCausalLM (which modelopt rewrites the architecture to). Patching config.json architectures back to the original ConditionalGeneration is what lets SGLang reach the deeper load_merged_column_weight shape assert (Trap #10). configuration boundary between modelopt export & SGLang After modelopt PTQ, do not keep modelopt's architectures = ["Qwen3_5MoeForCausalLM"] rewrite if targeting SGLang dev-cu13; revert to original Qwen3_5MoeForConditionalGeneration.

Bottom line: 4 serving routes validated end-to-end (SGLang dev-cu13, SGLang v0.5.9 stable, vLLM aeon-7 fork, TRT-LLM 1.2.0/1.2.1 in tritonserver 26.03/26.04). All 4 are blocked by version-skew between Qwen3.6 qwen3_5_moe model_type, transformers (4.57 vs 5.8 API), modelopt scale-tensor layout, and SGLang/vLLM fused QKV+gate_up loader expectations. The PTQ artifact itself is sound; production serving on Blackwell requires the upstream ecosystem to land simultaneously: SGLang stable + modelopt + qwen3_5_moe-aware transformers, or TRT-LLM 1.3 with transformers ≥ 5.2 pinned. ETA: estimated 4–8 weeks from 2026-05-12.

For the 2026-05-15 Lynn V4-Distill ship, the production-shipped quantization is the predecessor compressed-tensors v8-RTN ckpt (nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN), which has been verified end-to-end on Spark dev-cu13 nightly and is currently in production traffic on sglang-35b-fp8 slot. modelopt_fp4 will re-attempt once the upstream gap closes.

Spark sm_121 reference perf (compressed-tensors v8-RTN, for comparison)

The compressed-tensors v8-RTN predecessor of this checkpoint (same Qwen3.6-35B-A3B base, same NVFP4 algorithm, different serialization format) was perf-tested on NVIDIA Spark GB10 sm_121 via SGLang dev-cu13 docker (data published in nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN model card):

Scenario NVFP4 v8-RTN + MTP Note
short single 103.2 chars/s sm_121 reference
N=4 agg 309.7 chars/s sm_121 reference
N=8 agg 499.0 chars/s sm_121 reference
N=16 agg 756.0 chars/s sm_121 reference

These numbers are NOT for this checkpoint — they are for the v8-RTN predecessor in compressed-tensors format. For modelopt_fp4, Spark smoke currently fails before serving due to Trap #10.

Scenario modelopt_fp4 (this ckpt) on Spark sm_121 Status
short single N/A blocked at SGLang weight load
N=4 agg N/A blocked at SGLang weight load
N=8 agg N/A blocked at SGLang weight load
N=16 agg N/A blocked at SGLang weight load

Files

config.json                    # architectures: [Qwen3_5MoeForConditionalGeneration] for SGLang; CausalLM patch was vLLM-only and is not SGLang-compatible
chat_template.jinja
generation_config.json
hf_quant_config.json           # modelopt quantization metadata
model.safetensors              # 24GB single shard, NVFP4 packed
processor_config.json
tokenizer.json
tokenizer_config.json

Recommended deployment (once smoke verifies)

# SGLang stable v0.5.9 path (NV-blessed)
docker run --gpus all -v /path/to/ckpt:/model lmsysorg/sglang:v0.5.9 \
  python3 -m sglang.launch_server --model /model \
    --quantization modelopt_fp4 \
    --tensor-parallel-size 1 \
    --trust-remote-code

# vLLM ≥ 0.17 path (once flash-attn ecosystem catches up)
vllm serve /path/to/ckpt \
  --quantization modelopt_fp4 \
  --reasoning-parser qwen3

Citation / acknowledgments

Status update history

  • 2026-05-11: PTQ verified on R6000 sm_120; 9 ecosystem traps documented; sm_120/cu130 backend ecosystem gating R6000 serve verification.
  • 2026-05-12: Spark sm_121 + SGLang dev-cu13 smoke attempted. Backend reaches ModelOptModelLoader and detects nvfp4, then fails at SGLang load_merged_column_weight shape assertion. Trap #10 added; checkpoint remains candidate / not serving-verified.
Downloads last month
156
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nerkyor/Qwen3.6-35B-A3B-NVFP4-modelopt

Quantized
(407)
this model