Qwen3.6-35B-A3B-NVFP4-modelopt(候选版 / candidate)

⚠️ 状态:候选 / PTQ 完成 / 4 路径 serving 全 blocked —— Status: candidate / PTQ verified / 4 serving paths all blocked

🇨🇳 中文说明(英文版在下方)

这不是一个能直接 serve 的模型。 2026-05-12 在 RTX PRO 6000 Workstation(sm_120,96GB)上用 NVIDIA Model Optimizer 0.43 完成了 Qwen3.6-35B-A3B 的 NVFP4 量化,PTQ 端到端通,24GB safetensors + hf_quant_config.json(NVFP4 + FP8 KV + group_size=16)精确就位。但当晚验证了 4 条 backend serving 通路,全被生态版本差卡死:

Backend	失败点
SGLang `dev-cu13`(production 主镜像)	`linear.py:858 load_merged_column_weight` shape assert(fused QKV/gate_up 跟 modelopt 序列化不对齐)
SGLang `v0.5.9` stable	image 不带 `modelopt` 包 + `transformers 4.57` 不认 `qwen3_5_moe` model_type
vLLM `aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3` fork	`KeyError: 'experts.w2_input_scale'`(loader 期望融合 scale,modelopt 输出 per-expert)
TRT-LLM 1.2.0/1.2.1(`nvcr.io/nvidia/tritonserver:26.03/26.04`)	原生认 modelopt config 但 pin `transformers 4.57`,升 5.8 引发 `AutoModelForVision2Seq`、`get_parameter_device` 一连串 import cascade,patch 无底洞

根因不是单个框架,是四维交叉

qwen3_5_moe model_type(transformers ≥ 5.2 才认) × modelopt_fp4 序列化(2025-末 NVIDIA Model Optimizer 0.43 输出) × Blackwell sm_120/121(2024-2025 才量产) × 多模态 wrapper(ConditionalGeneration ↔ ForCausalLM 双 entry,modelopt 改写一次,backend loader 没跟上)—— 四个维度同时落进生态各自的盲区。

现在能怎么用

不能直接给 SGLang/vLLM/TRT-LLM serve(2026-05-12 状态)
可以用 modelopt 自己的 inference API 验证 PTQ 正确性(from modelopt.torch.utils.huggingface import dequantize_state_dict)
可以作为 reference 跟其他 NVFP4 量化产物对照(per-expert scale 命名标准格式)
不要用在生产

生产环境用什么

compressed-tensors nvfp4-pack-quantized v8-RTN 路径:nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN — 已在 SGLang dev-cu13 nightly 上 production 流量 2+ 周。

什么时候回试 modelopt_fp4

预估 4-8 周(2026-06 → 07),等下面任一上游对齐:

SGLang stable 同时装 modelopt 包 + 升 transformers ≥ 5.2
或 TRT-LLM 1.3 release 改 pin 到 transformers ≥ 5.2(放弃 AutoModelForVision2Seq / get_parameter_device 等 deprecated 调用)
或 vLLM 单独注册 Qwen3_5MoeForCausalLM + loader 支持 modelopt fused/per-expert scale

14 个具体坑 + 修复路径

详见下方 ## ⚠️ 14 ecosystem traps 表。任何团队走这条路都会撞到同一片生态空档,这是公开 disclose 而不是埋雷。

English (full disclosure follows)

⚠️ Status: candidate / PTQ verified / backend smoke blocked

This is a rehearsal-stage artifact for the 2026-05-15 Lynn V4-Distill-Qwen public ship. The quantization step itself is end-to-end verified; serving/backend smoke has been attempted on R6000 sm_120 and Spark sm_121, and is currently blocked by backend loader support gaps documented below.

NVFP4 (NVIDIA Model Optimizer) quantized version of Qwen/Qwen3.6-35B-A3B, using nvidia-modelopt 0.43.0 + --qformat nvfp4_experts_only + --kv_cache_qformat fp8_cast via hf_ptq.py. Output format is modelopt_fp4 (HF-native), aligned with NVIDIA's own reference NVFP4 models such as nvidia/Qwen3.5-397B-A17B-NVFP4.

This is a deliberate departure from our earlier compressed-tensors nvfp4-pack-quantized v8-RTN ckpt (now marked legacy). Full route postmortem with 10 traps detailed: see companion zhihu post (link to be added once published).

⚠️ Three-state status disclosure

Stage	Status	Detail
1. PTQ end-to-end	✅ verified	`hf_ptq.py` ran to completion on R6000 sm_120; modelopt 0.43 + transformers 5.8 stack; 24GB `model.safetensors` + `hf_quant_config.json` + complete tokenizer/config emitted; integrity verified (offsets self-consistent, tensor count match, no None architectures)
2. R6000 sm_120 backend smoke	❌ blocked	sm_120 + cu130 ecosystem has not caught up: SGLang stable/nightly `sgl_kernel` ships only `sm_90 + sm_100` (no `sm_120/`); vLLM 0.20.2 hard-requires `flash-attn` whose `cu130 + torch 2.11 + py3.12` wheel does not yet exist on PyPI; vLLM falls back to `Qwen3_5MoeForConditionalGeneration` (only architecture registered for `qwen3_5_moe`) which force-initializes vision tower → triggers `flash_attn.ops.triton.rotary` import → ModuleNotFoundError. TODO: re-test once sgl_kernel sm_120 ships upstream (typically 2-4 weeks) OR flash-attn cu130 wheel released.
3. Spark sm_121 + SGLang dev-cu13 smoke	❌ blocked	SGLang dev-cu13 recognizes the checkpoint as `modelopt_fp4`, enters `ModelOptModelLoader`, detects `nvfp4`, auto-selects Blackwell `fp4-gemm-backend=flashinfer_cudnn`, then fails during weight load: `qwen3_5.py load_weights -> linear.py weight_loader_v2 -> load_merged_column_weight -> assert param_data.shape == loaded_weight.shape`. Restoring the original `Qwen3_5MoeForConditionalGeneration` architecture is required for SGLang to reach this point; the vLLM text-only `Qwen3_5MoeForCausalLM` patch is not compatible with SGLang.

Do not assume this checkpoint serves correctly. The PTQ artifact integrity is sound; the current blockers are backend loader support for Qwen3.6-A3B modelopt_fp4 multimodal-MoE serialization and fused/merged projection shape mapping.

Quantization recipe

Parameter	Value
Tool	`nvidia-modelopt` 0.43.0 (NVIDIA Model Optimizer)
Tool examples	`NVIDIA/TensorRT-Model-Optimizer` at tag `0.43.0` (matching PyPI release)
Scheme	NVFP4 (4-bit weights + 4-bit activations, group_size=16, fp8 per-group scale)
qformat	`nvfp4_experts_only` (MoE-aware: auto-ignore `lm_head` / `mlp.gate` / `router` / `visual` / `vision` / `encoder`)
KV cache	`fp8_cast` (NVIDIA default)
Calibration	`cnn_dailymail` only, 512 samples × 512 seq
Base model dtype	BF16 (dequantized from `Qwen/Qwen3.6-35B-A3B-FP8` via `engine/convert_fp8_to_bf16.py`)
Hardware (PTQ)	NVIDIA RTX PRO 6000 Blackwell sm_120 / 96GB GDDR7
transformers	5.8.0 (pinned; modelopt[hf] default downgrades to 4.57 which breaks Qwen3.6 — see traps)
Output	HF safetensors (24GB single shard) + `hf_quant_config.json`

Calibration caveat (vs NVIDIA reference)

NVIDIA's own NVFP4 reference models use cnn_dailymail + Nemotron-Post-Training-Dataset-v2 dual calibration. We used only cnn_dailymail (Nemotron is a gated dataset requiring license accept). modelopt documentation states calibration accuracy is robust across different choices of calibration data, but the long-tail token-distribution coverage is narrower than NV reference. For a production Lynn V4-Distill ship (2026-05-15), we may use task-aligned calibration data (Lynn V4 distill subset) for better in-domain quality.

⚠️ 10 ecosystem traps caught during rehearsal (5/15 ship pre-flight intel)

Documented for downstream users following the same path. Each is a real 5/15 first-touch risk; fixes here:

#	Trap	Where	Fix
1	`pip install nvidia-modelopt[hf]` force-downgrades `transformers 5.x → 4.57.6` (deps pin `<5`); Qwen3.6 `qwen3_5_moe` model_type requires transformers ≥ 5.2	install step	After modelopt install, force re-pin: `pip install -U --no-deps transformers==5.8.0 huggingface_hub==1.14.0`
2	`git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git` times out (~130s) from mainland China	TRT-MO fetch	Use codeload tarball: `curl -fsL -o /tmp/trtmo.tar.gz https://codeload.github.com/NVIDIA/TensorRT-Model-Optimizer/tar.gz/refs/tags/0.43.0`. Note: extracted top-dir is `Model-Optimizer-0.43.0` (without "TensorRT-" prefix)
3	TRT-MO `main` branch `hf_ptq.py` imports `EagleOfflineDataCollator` which doesn't exist in PyPI `nvidia-modelopt 0.43.0`	`main` vs release tag mismatch	Pin TRT-MO checkout to the same release tag as installed modelopt (here `tag/0.43.0`) — `examples/llm_ptq/hf_ptq.py` size differs (50KB at tag 0.43.0 vs 58KB at HEAD, no speculative decoding imports)
4	`huggingface_hub.errors.RuntimeError: Cannot send a request, as the client has been closed.` during dataset download (HF Hub network instability from mainland)	calibration data	`export HF_ENDPOINT=https://hf-mirror.com` before running `hf_ptq.py`
5	`datasets.exceptions.DatasetNotFoundError: nvidia/Nemotron-Post-Training-Dataset-v2 is a gated dataset` (HF Hub default `hf_ptq.py` calibration includes both cnn_dailymail and Nemotron-Post-Training-v2)	calibration data	Either accept Nemotron gated license on HF web UI + use `HF_TOKEN`, or pass `--dataset cnn_dailymail` to skip Nemotron entirely
6	Inline modelopt PTQ via `mtq.quantize()` + `export_hf_checkpoint()` fails with `TypeError: 'NoneType' object is not iterable` at `is_multimodal_model` (model loaded with `AutoModelForCausalLM` from multimodal-base sets `architectures = None` for export path)	export step	Use the `hf_ptq.py` CLI wrapper (sets architectures correctly) instead of writing inline modelopt API calls. CLI handles multimodal model class registration properly
7	SGLang stable v0.5.9 / nightly v0.5.11 `sgl_kernel` PyPI wheel ships only `sm_90` and `sm_100` precompiled binaries — `sm_120/` directory does not exist	serving on sm_120 (Blackwell)	Wait for upstream sm_120 binary release; or build sgl_kernel from source matching local CUDA+torch combination
8	vLLM 0.20.2 hard-requires `flash_attn.ops.triton.rotary` for Qwen3_VL vision tower rotary embedding initialization	serving via vLLM	Install `flash-attn` — but no PyPI wheel exists for `torch 2.11 + cu130 + py3.12` combination at this time (PyPI search returns "no matching distribution")
9	vLLM 0.20.2 `ModelRegistry` registers only `Qwen3_5MoeForConditionalGeneration` for `model_type=qwen3_5_moe`. Patching `config.json architectures = ["Qwen3_5MoeForCausalLM"]` (a class that DOES exist in `qwen3_5.py:556`) has no effect — vLLM falls back to ConditionalGeneration which force-inits the vision tower regardless	serving on Blackwell with no flash-attn	Wait for either (a) sgl_kernel sm_120 binary, (b) vLLM upstream to register `Qwen3_5MoeForCausalLM` separately, (c) flash-attn cu130 wheel. None available 2026-05-11
10	Spark sm_121 + SGLang dev-cu13 reaches `ModelOptModelLoader` and recognizes `nvfp4`, but fails in Qwen3.5 MoE weight loading with `assert param_data.shape == loaded_weight.shape` at `load_merged_column_weight`	serving this exact `modelopt_fp4` artifact on Spark	Backend-side support gap: Qwen3.6-A3B modelopt serialization and SGLang's fused/merged projection parameter shapes do not currently align. Needs upstream loader fix or a modelopt export variant that matches SGLang's expected fused-column layout.
11	`lmsysorg/sglang:v0.5.9` stable image does not ship `modelopt` Python package (unlike `dev-cu13` which has it); `transformers 4.57.1` in this image also does not register `qwen3_5_moe` model_type → cannot load this ckpt at all	serving via SGLang stable v0.5.9	Wait for SGLang stable to bundle `modelopt` + a transformers release that maps `qwen3_5_moe`. As of 2026-05-12 neither is true in any tagged SGLang release.
12	`nvcr.io/nvidia/tritonserver:26.03/26.04-trtllm-python-py3` ships TRT-LLM 1.2.0/1.2.1 + `transformers 4.57.3` (hard-pinned). TRT-LLM successfully reads `hf_quant_config.json` and recognizes `nvfp4 + fp8 kv + group_size=16`, but the executor worker dies with `KeyError: 'qwen3_5_moe'` from `transformers.models.auto.configuration_auto.CONFIG_MAPPING_NAMES`. Force-upgrading to `transformers 5.8.0` (via aliyun pypi mirror) makes `qwen3_5_moe` known but cascades into `ImportError: cannot import name 'AutoModelForVision2Seq'` (renamed → `AutoModelForImageTextToText` in transformers 5.x), then after sed-patching that into `ImportError: cannot import name 'get_parameter_device' from 'transformers.modeling_utils'`, with several more API renames likely behind it.	serving via TRT-LLM 1.2.x	Patch-cascade is bottomless; not a 5/15-ship path. Wait for TRT-LLM 1.3 release that pins to a transformers ≥ 5.2 (covers `qwen3_5_moe`) without the legacy `AutoModelForVision2Seq` / `get_parameter_device` API surface.
13	Docker `restart=always` (sglang-35b-fp8) and `restart=unless-stopped` (elyza-nvllm) policies cause production containers to auto-resurrect after `docker stop`, immediately taking GPU memory back from the fp4 test container → CUDA stream OOM in TRT-LLM with `Free memory on device cuda:0 (82.6/119.63 GiB) on startup is less than desired GPU memory utilization (0.7, 83.74 GiB)`	test runner that needs to free GPU on Spark	Pre-test: `docker update --restart=no <container>` for each production container before `docker stop`. Post-test (restore): `docker start <container> && docker update --restart=<original-policy>`. Otherwise daemon-level policy beats stop.
14	`lmsysorg/sglang:dev-cu13` ships TRT-LLM-style `Qwen3_5MoeForConditionalGeneration` registered but not `Qwen3_5MoeForCausalLM` (which modelopt rewrites the architecture to). Patching `config.json architectures` back to the original `ConditionalGeneration` is what lets SGLang reach the deeper `load_merged_column_weight` shape assert (Trap #10).	configuration boundary between modelopt export & SGLang	After modelopt PTQ, do not keep modelopt's `architectures = ["Qwen3_5MoeForCausalLM"]` rewrite if targeting SGLang dev-cu13; revert to original `Qwen3_5MoeForConditionalGeneration`.

Bottom line: 4 serving routes validated end-to-end (SGLang dev-cu13, SGLang v0.5.9 stable, vLLM aeon-7 fork, TRT-LLM 1.2.0/1.2.1 in tritonserver 26.03/26.04). All 4 are blocked by version-skew between Qwen3.6 qwen3_5_moe model_type, transformers (4.57 vs 5.8 API), modelopt scale-tensor layout, and SGLang/vLLM fused QKV+gate_up loader expectations. The PTQ artifact itself is sound; production serving on Blackwell requires the upstream ecosystem to land simultaneously: SGLang stable + modelopt + qwen3_5_moe-aware transformers, or TRT-LLM 1.3 with transformers ≥ 5.2 pinned. ETA: estimated 4–8 weeks from 2026-05-12.

For the 2026-05-15 Lynn V4-Distill ship, the production-shipped quantization is the predecessor compressed-tensors v8-RTN ckpt (nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN), which has been verified end-to-end on Spark dev-cu13 nightly and is currently in production traffic on sglang-35b-fp8 slot. modelopt_fp4 will re-attempt once the upstream gap closes.

Spark sm_121 reference perf (compressed-tensors v8-RTN, for comparison)

The compressed-tensors v8-RTN predecessor of this checkpoint (same Qwen3.6-35B-A3B base, same NVFP4 algorithm, different serialization format) was perf-tested on NVIDIA Spark GB10 sm_121 via SGLang dev-cu13 docker (data published in nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN model card):

Scenario	NVFP4 v8-RTN + MTP	Note
short single	103.2 chars/s	sm_121 reference
N=4 agg	309.7 chars/s	sm_121 reference
N=8 agg	499.0 chars/s	sm_121 reference
N=16 agg	756.0 chars/s	sm_121 reference

These numbers are NOT for this checkpoint — they are for the v8-RTN predecessor in compressed-tensors format. For modelopt_fp4, Spark smoke currently fails before serving due to Trap #10.

Scenario	modelopt_fp4 (this ckpt) on Spark sm_121	Status
short single	N/A	blocked at SGLang weight load
N=4 agg	N/A	blocked at SGLang weight load
N=8 agg	N/A	blocked at SGLang weight load
N=16 agg	N/A	blocked at SGLang weight load

Files

config.json                    # architectures: [Qwen3_5MoeForConditionalGeneration] for SGLang; CausalLM patch was vLLM-only and is not SGLang-compatible
chat_template.jinja
generation_config.json
hf_quant_config.json           # modelopt quantization metadata
model.safetensors              # 24GB single shard, NVFP4 packed
processor_config.json
tokenizer.json
tokenizer_config.json

Recommended deployment (once smoke verifies)

# SGLang stable v0.5.9 path (NV-blessed)
docker run --gpus all -v /path/to/ckpt:/model lmsysorg/sglang:v0.5.9 \
  python3 -m sglang.launch_server --model /model \
    --quantization modelopt_fp4 \
    --tensor-parallel-size 1 \
    --trust-remote-code

# vLLM ≥ 0.17 path (once flash-attn ecosystem catches up)
vllm serve /path/to/ckpt \
  --quantization modelopt_fp4 \
  --reasoning-parser qwen3

Citation / acknowledgments

Base model: Qwen/Qwen3.6-35B-A3B
FP8 reference for vision weights borrowed: Qwen/Qwen3.6-35B-A3B-FP8
Quantization tool: nvidia-modelopt ≥ 0.42
Reference NVFP4 model (same route): nvidia/Qwen3.5-397B-A17B-NVFP4
Companion zhihu route-postmortem (link pending publish)
Legacy compressed-tensors variant: nerkyor/Qwen3.6-35B-A3B-NVFP4-v8-RTN

Status update history

2026-05-11: PTQ verified on R6000 sm_120; 9 ecosystem traps documented; sm_120/cu130 backend ecosystem gating R6000 serve verification.
2026-05-12: Spark sm_121 + SGLang dev-cu13 smoke attempted. Backend reaches ModelOptModelLoader and detects nvfp4, then fails at SGLang load_merged_column_weight shape assertion. Trap #10 added; checkpoint remains candidate / not serving-verified.

Downloads last month: 156

Model tree for nerkyor/Qwen3.6-35B-A3B-NVFP4-modelopt

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(407)

this model