- Qwen3.6-35B-A3B-NVFP4-modelopt(候选版 / candidate)
- 🇨🇳 中文说明(英文版在下方)
- English (full disclosure follows)
- ⚠️ Three-state status disclosure
- Quantization recipe
- ⚠️ 10 ecosystem traps caught during rehearsal (5/15 ship pre-flight intel)
- Spark sm_121 reference perf (compressed-tensors v8-RTN, for comparison)
- Files
- Recommended deployment (once smoke verifies)
- Citation / acknowledgments
- Status update history
Qwen3.6-35B-A3B-NVFP4-modelopt(候选版 / candidate)
⚠️ 状态:候选 / PTQ 完成 / 4 路径 serving 全 blocked —— Status: candidate / PTQ verified / 4 serving paths all blocked
🇨🇳 中文说明(英文版在下方)
这不是一个能直接 serve 的模型。 2026-05-12 在 RTX PRO 6000 Workstation(sm_120,96GB)上用 NVIDIA Model Optimizer 0.43 完成了 Qwen3.6-35B-A3B 的 NVFP4 量化,PTQ 端到端通,24GB safetensors + hf_quant_config.json(NVFP4 + FP8 KV + group_size=16)精确就位。但当晚验证了 4 条 backend serving 通路,全被生态版本差卡死:
| Backend | 失败点 |
|---|---|
SGLang dev-cu13(production 主镜像) |
linear.py:858 load_merged_column_weight shape assert(fused QKV/gate_up 跟 modelopt 序列化不对齐) |
SGLang v0.5.9 stable |
image 不带 modelopt 包 + transformers 4.57 不认 qwen3_5_moe model_type |
vLLM aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 fork |
KeyError: 'experts.w2_input_scale'(loader 期望融合 scale,modelopt 输出 per-expert) |
TRT-LLM 1.2.0/1.2.1(nvcr.io/nvidia/tritonserver:26.03/26.04) |
原生认 modelopt config 但 pin transformers 4.57,升 5.8 引发 AutoModelForVision2Seq、get_parameter_device 一连串 import cascade,patch 无底洞 |
根因不是单个框架,是四维交叉
qwen3_5_moe model_type(transformers ≥ 5.2 才认) × modelopt_fp4 序列化(2025-末 NVIDIA Model Optimizer 0.43 输出) × Blackwell sm_120/121(2024-2025 才量产) × 多模态 wrapper(ConditionalGeneration ↔ ForCausalLM 双 entry,modelopt 改写一次,backend loader 没跟上)—— 四个维度同时落进生态各自的盲区。
现在能怎么用
- 不能直接给 SGLang/vLLM/TRT-LLM serve(2026-05-12 状态)
- 可以用
modelopt自己的 inference API 验证 PTQ 正确性(from modelopt.torch.utils.huggingface import dequantize_state_dict) - 可以作为 reference 跟其他 NVFP4 量化产物对照(per-expert scale 命名标准格式)
- 不要用在生产
生产环境用什么
compressed-tensors nvfp4-pack-quantized v8-RTN 路径:nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN — 已在 SGLang dev-cu13 nightly 上 production 流量 2+ 周。
什么时候回试 modelopt_fp4
预估 4-8 周(2026-06 → 07),等下面任一上游对齐:
- SGLang stable 同时装
modelopt包 + 升 transformers ≥ 5.2 - 或 TRT-LLM 1.3 release 改 pin 到 transformers ≥ 5.2(放弃
AutoModelForVision2Seq/get_parameter_device等 deprecated 调用) - 或 vLLM 单独注册
Qwen3_5MoeForCausalLM+ loader 支持 modelopt fused/per-expert scale
14 个具体坑 + 修复路径
详见下方 ## ⚠️ 14 ecosystem traps 表。任何团队走这条路都会撞到同一片生态空档,这是公开 disclose 而不是埋雷。
English (full disclosure follows)
⚠️ Status: candidate / PTQ verified / backend smoke blocked
This is a rehearsal-stage artifact for the 2026-05-15 Lynn V4-Distill-Qwen public ship. The quantization step itself is end-to-end verified; serving/backend smoke has been attempted on R6000 sm_120 and Spark sm_121, and is currently blocked by backend loader support gaps documented below.
NVFP4 (NVIDIA Model Optimizer) quantized version of Qwen/Qwen3.6-35B-A3B, using nvidia-modelopt 0.43.0 + --qformat nvfp4_experts_only + --kv_cache_qformat fp8_cast via hf_ptq.py. Output format is modelopt_fp4 (HF-native), aligned with NVIDIA's own reference NVFP4 models such as nvidia/Qwen3.5-397B-A17B-NVFP4.
This is a deliberate departure from our earlier compressed-tensors nvfp4-pack-quantized v8-RTN ckpt (now marked legacy). Full route postmortem with 10 traps detailed: see companion zhihu post (link to be added once published).
⚠️ Three-state status disclosure
| Stage | Status | Detail |
|---|---|---|
| 1. PTQ end-to-end | ✅ verified | hf_ptq.py ran to completion on R6000 sm_120; modelopt 0.43 + transformers 5.8 stack; 24GB model.safetensors + hf_quant_config.json + complete tokenizer/config emitted; integrity verified (offsets self-consistent, tensor count match, no None architectures) |
| 2. R6000 sm_120 backend smoke | ❌ blocked | sm_120 + cu130 ecosystem has not caught up: SGLang stable/nightly sgl_kernel ships only sm_90 + sm_100 (no sm_120/); vLLM 0.20.2 hard-requires flash-attn whose cu130 + torch 2.11 + py3.12 wheel does not yet exist on PyPI; vLLM falls back to Qwen3_5MoeForConditionalGeneration (only architecture registered for qwen3_5_moe) which force-initializes vision tower → triggers flash_attn.ops.triton.rotary import → ModuleNotFoundError. TODO: re-test once sgl_kernel sm_120 ships upstream (typically 2-4 weeks) OR flash-attn cu130 wheel released. |
| 3. Spark sm_121 + SGLang dev-cu13 smoke | ❌ blocked | SGLang dev-cu13 recognizes the checkpoint as modelopt_fp4, enters ModelOptModelLoader, detects nvfp4, auto-selects Blackwell fp4-gemm-backend=flashinfer_cudnn, then fails during weight load: qwen3_5.py load_weights -> linear.py weight_loader_v2 -> load_merged_column_weight -> assert param_data.shape == loaded_weight.shape. Restoring the original Qwen3_5MoeForConditionalGeneration architecture is required for SGLang to reach this point; the vLLM text-only Qwen3_5MoeForCausalLM patch is not compatible with SGLang. |
Do not assume this checkpoint serves correctly. The PTQ artifact integrity is sound; the current blockers are backend loader support for Qwen3.6-A3B modelopt_fp4 multimodal-MoE serialization and fused/merged projection shape mapping.
Quantization recipe
| Parameter | Value |
|---|---|
| Tool | nvidia-modelopt 0.43.0 (NVIDIA Model Optimizer) |
| Tool examples | NVIDIA/TensorRT-Model-Optimizer at tag 0.43.0 (matching PyPI release) |
| Scheme | NVFP4 (4-bit weights + 4-bit activations, group_size=16, fp8 per-group scale) |
| qformat | nvfp4_experts_only (MoE-aware: auto-ignore lm_head / mlp.gate / router / *visual* / *vision* / *encoder*) |
| KV cache | fp8_cast (NVIDIA default) |
| Calibration | cnn_dailymail only, 512 samples × 512 seq |
| Base model dtype | BF16 (dequantized from Qwen/Qwen3.6-35B-A3B-FP8 via engine/convert_fp8_to_bf16.py) |
| Hardware (PTQ) | NVIDIA RTX PRO 6000 Blackwell sm_120 / 96GB GDDR7 |
| transformers | 5.8.0 (pinned; modelopt[hf] default downgrades to 4.57 which breaks Qwen3.6 — see traps) |
| Output | HF safetensors (24GB single shard) + hf_quant_config.json |
Calibration caveat (vs NVIDIA reference)
NVIDIA's own NVFP4 reference models use cnn_dailymail + Nemotron-Post-Training-Dataset-v2 dual calibration. We used only cnn_dailymail (Nemotron is a gated dataset requiring license accept). modelopt documentation states calibration accuracy is robust across different choices of calibration data, but the long-tail token-distribution coverage is narrower than NV reference. For a production Lynn V4-Distill ship (2026-05-15), we may use task-aligned calibration data (Lynn V4 distill subset) for better in-domain quality.
⚠️ 10 ecosystem traps caught during rehearsal (5/15 ship pre-flight intel)
Documented for downstream users following the same path. Each is a real 5/15 first-touch risk; fixes here:
| # | Trap | Where | Fix |
|---|---|---|---|
| 1 | pip install nvidia-modelopt[hf] force-downgrades transformers 5.x → 4.57.6 (deps pin <5); Qwen3.6 qwen3_5_moe model_type requires transformers ≥ 5.2 |
install step | After modelopt install, force re-pin: pip install -U --no-deps transformers==5.8.0 huggingface_hub==1.14.0 |
| 2 | git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git times out (~130s) from mainland China |
TRT-MO fetch | Use codeload tarball: curl -fsL -o /tmp/trtmo.tar.gz https://codeload.github.com/NVIDIA/TensorRT-Model-Optimizer/tar.gz/refs/tags/0.43.0. Note: extracted top-dir is Model-Optimizer-0.43.0 (without "TensorRT-" prefix) |
| 3 | TRT-MO main branch hf_ptq.py imports EagleOfflineDataCollator which doesn't exist in PyPI nvidia-modelopt 0.43.0 |
main vs release tag mismatch |
Pin TRT-MO checkout to the same release tag as installed modelopt (here tag/0.43.0) — examples/llm_ptq/hf_ptq.py size differs (50KB at tag 0.43.0 vs 58KB at HEAD, no speculative decoding imports) |
| 4 | huggingface_hub.errors.RuntimeError: Cannot send a request, as the client has been closed. during dataset download (HF Hub network instability from mainland) |
calibration data | export HF_ENDPOINT=https://hf-mirror.com before running hf_ptq.py |
| 5 | datasets.exceptions.DatasetNotFoundError: nvidia/Nemotron-Post-Training-Dataset-v2 is a gated dataset (HF Hub default hf_ptq.py calibration includes both cnn_dailymail and Nemotron-Post-Training-v2) |
calibration data | Either accept Nemotron gated license on HF web UI + use HF_TOKEN, or pass --dataset cnn_dailymail to skip Nemotron entirely |
| 6 | Inline modelopt PTQ via mtq.quantize() + export_hf_checkpoint() fails with TypeError: 'NoneType' object is not iterable at is_multimodal_model (model loaded with AutoModelForCausalLM from multimodal-base sets architectures = None for export path) |
export step | Use the hf_ptq.py CLI wrapper (sets architectures correctly) instead of writing inline modelopt API calls. CLI handles multimodal model class registration properly |
| 7 | SGLang stable v0.5.9 / nightly v0.5.11 sgl_kernel PyPI wheel ships only sm_90 and sm_100 precompiled binaries — sm_120/ directory does not exist |
serving on sm_120 (Blackwell) | Wait for upstream sm_120 binary release; or build sgl_kernel from source matching local CUDA+torch combination |
| 8 | vLLM 0.20.2 hard-requires flash_attn.ops.triton.rotary for Qwen3_VL vision tower rotary embedding initialization |
serving via vLLM | Install flash-attn — but no PyPI wheel exists for torch 2.11 + cu130 + py3.12 combination at this time (PyPI search returns "no matching distribution") |
| 9 | vLLM 0.20.2 ModelRegistry registers only Qwen3_5MoeForConditionalGeneration for model_type=qwen3_5_moe. Patching config.json architectures = ["Qwen3_5MoeForCausalLM"] (a class that DOES exist in qwen3_5.py:556) has no effect — vLLM falls back to ConditionalGeneration which force-inits the vision tower regardless |
serving on Blackwell with no flash-attn | Wait for either (a) sgl_kernel sm_120 binary, (b) vLLM upstream to register Qwen3_5MoeForCausalLM separately, (c) flash-attn cu130 wheel. None available 2026-05-11 |
| 10 | Spark sm_121 + SGLang dev-cu13 reaches ModelOptModelLoader and recognizes nvfp4, but fails in Qwen3.5 MoE weight loading with assert param_data.shape == loaded_weight.shape at load_merged_column_weight |
serving this exact modelopt_fp4 artifact on Spark |
Backend-side support gap: Qwen3.6-A3B modelopt serialization and SGLang's fused/merged projection parameter shapes do not currently align. Needs upstream loader fix or a modelopt export variant that matches SGLang's expected fused-column layout. |
| 11 | lmsysorg/sglang:v0.5.9 stable image does not ship modelopt Python package (unlike dev-cu13 which has it); transformers 4.57.1 in this image also does not register qwen3_5_moe model_type → cannot load this ckpt at all |
serving via SGLang stable v0.5.9 | Wait for SGLang stable to bundle modelopt + a transformers release that maps qwen3_5_moe. As of 2026-05-12 neither is true in any tagged SGLang release. |
| 12 | nvcr.io/nvidia/tritonserver:26.03/26.04-trtllm-python-py3 ships TRT-LLM 1.2.0/1.2.1 + transformers 4.57.3 (hard-pinned). TRT-LLM successfully reads hf_quant_config.json and recognizes nvfp4 + fp8 kv + group_size=16, but the executor worker dies with KeyError: 'qwen3_5_moe' from transformers.models.auto.configuration_auto.CONFIG_MAPPING_NAMES. Force-upgrading to transformers 5.8.0 (via aliyun pypi mirror) makes qwen3_5_moe known but cascades into ImportError: cannot import name 'AutoModelForVision2Seq' (renamed → AutoModelForImageTextToText in transformers 5.x), then after sed-patching that into ImportError: cannot import name 'get_parameter_device' from 'transformers.modeling_utils', with several more API renames likely behind it. |
serving via TRT-LLM 1.2.x | Patch-cascade is bottomless; not a 5/15-ship path. Wait for TRT-LLM 1.3 release that pins to a transformers ≥ 5.2 (covers qwen3_5_moe) without the legacy AutoModelForVision2Seq / get_parameter_device API surface. |
| 13 | Docker restart=always (sglang-35b-fp8) and restart=unless-stopped (elyza-nvllm) policies cause production containers to auto-resurrect after docker stop, immediately taking GPU memory back from the fp4 test container → CUDA stream OOM in TRT-LLM with Free memory on device cuda:0 (82.6/119.63 GiB) on startup is less than desired GPU memory utilization (0.7, 83.74 GiB) |
test runner that needs to free GPU on Spark | Pre-test: docker update --restart=no <container> for each production container before docker stop. Post-test (restore): docker start <container> && docker update --restart=<original-policy>. Otherwise daemon-level policy beats stop. |
| 14 | lmsysorg/sglang:dev-cu13 ships TRT-LLM-style Qwen3_5MoeForConditionalGeneration registered but not Qwen3_5MoeForCausalLM (which modelopt rewrites the architecture to). Patching config.json architectures back to the original ConditionalGeneration is what lets SGLang reach the deeper load_merged_column_weight shape assert (Trap #10). |
configuration boundary between modelopt export & SGLang | After modelopt PTQ, do not keep modelopt's architectures = ["Qwen3_5MoeForCausalLM"] rewrite if targeting SGLang dev-cu13; revert to original Qwen3_5MoeForConditionalGeneration. |
Bottom line: 4 serving routes validated end-to-end (SGLang dev-cu13, SGLang v0.5.9 stable, vLLM aeon-7 fork, TRT-LLM 1.2.0/1.2.1 in tritonserver 26.03/26.04). All 4 are blocked by version-skew between Qwen3.6 qwen3_5_moe model_type, transformers (4.57 vs 5.8 API), modelopt scale-tensor layout, and SGLang/vLLM fused QKV+gate_up loader expectations. The PTQ artifact itself is sound; production serving on Blackwell requires the upstream ecosystem to land simultaneously: SGLang stable + modelopt + qwen3_5_moe-aware transformers, or TRT-LLM 1.3 with transformers ≥ 5.2 pinned. ETA: estimated 4–8 weeks from 2026-05-12.
For the 2026-05-15 Lynn V4-Distill ship, the production-shipped quantization is the predecessor compressed-tensors v8-RTN ckpt (nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN), which has been verified end-to-end on Spark dev-cu13 nightly and is currently in production traffic on sglang-35b-fp8 slot. modelopt_fp4 will re-attempt once the upstream gap closes.
Spark sm_121 reference perf (compressed-tensors v8-RTN, for comparison)
The compressed-tensors v8-RTN predecessor of this checkpoint (same Qwen3.6-35B-A3B base, same NVFP4 algorithm, different serialization format) was perf-tested on NVIDIA Spark GB10 sm_121 via SGLang dev-cu13 docker (data published in nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN model card):
| Scenario | NVFP4 v8-RTN + MTP | Note |
|---|---|---|
| short single | 103.2 chars/s | sm_121 reference |
| N=4 agg | 309.7 chars/s | sm_121 reference |
| N=8 agg | 499.0 chars/s | sm_121 reference |
| N=16 agg | 756.0 chars/s | sm_121 reference |
These numbers are NOT for this checkpoint — they are for the v8-RTN predecessor in compressed-tensors format. For modelopt_fp4, Spark smoke currently fails before serving due to Trap #10.
| Scenario | modelopt_fp4 (this ckpt) on Spark sm_121 | Status |
|---|---|---|
| short single | N/A | blocked at SGLang weight load |
| N=4 agg | N/A | blocked at SGLang weight load |
| N=8 agg | N/A | blocked at SGLang weight load |
| N=16 agg | N/A | blocked at SGLang weight load |
Files
config.json # architectures: [Qwen3_5MoeForConditionalGeneration] for SGLang; CausalLM patch was vLLM-only and is not SGLang-compatible
chat_template.jinja
generation_config.json
hf_quant_config.json # modelopt quantization metadata
model.safetensors # 24GB single shard, NVFP4 packed
processor_config.json
tokenizer.json
tokenizer_config.json
Recommended deployment (once smoke verifies)
# SGLang stable v0.5.9 path (NV-blessed)
docker run --gpus all -v /path/to/ckpt:/model lmsysorg/sglang:v0.5.9 \
python3 -m sglang.launch_server --model /model \
--quantization modelopt_fp4 \
--tensor-parallel-size 1 \
--trust-remote-code
# vLLM ≥ 0.17 path (once flash-attn ecosystem catches up)
vllm serve /path/to/ckpt \
--quantization modelopt_fp4 \
--reasoning-parser qwen3
Citation / acknowledgments
- Base model:
Qwen/Qwen3.6-35B-A3B - FP8 reference for vision weights borrowed:
Qwen/Qwen3.6-35B-A3B-FP8 - Quantization tool:
nvidia-modelopt≥ 0.42 - Reference NVFP4 model (same route):
nvidia/Qwen3.5-397B-A17B-NVFP4 - Companion zhihu route-postmortem (link pending publish)
- Legacy compressed-tensors variant:
nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN
Status update history
- 2026-05-11: PTQ verified on R6000 sm_120; 9 ecosystem traps documented; sm_120/cu130 backend ecosystem gating R6000 serve verification.
- 2026-05-12: Spark sm_121 + SGLang dev-cu13 smoke attempted. Backend reaches
ModelOptModelLoaderand detectsnvfp4, then fails at SGLangload_merged_column_weightshape assertion. Trap #10 added; checkpoint remains candidate / not serving-verified.
- Downloads last month
- 156
Model tree for nerkyor/Qwen3.6-35B-A3B-NVFP4-modelopt
Base model
Qwen/Qwen3.6-35B-A3B