Qwen3.5-35B-A3B-heretic-v2-FP8
FP8 block-wise quantization of llmfan46/Qwen3.5-35B-A3B-heretic-v2 (abliterated via Heretic v1.2.0 MPOA+SOMA).
Quantization format matches Qwen/Qwen3.5-35B-A3B-FP8 exactly (same tensor names, scale format, skip list).
Quantization Details
| Parameter | Value |
|---|---|
| Method | FP8 E4M3 block-wise |
| Block size | 128x128 |
| Scale dtype | BF16 |
| Scale naming | weight_scale_inv |
| Activation scheme | Dynamic |
| Original size | ~66 GB (BF16) |
| Quantized size | ~35 GB |
What's quantized
All linear projections in attention (q/k/v/o_proj), GatedDeltaNet (in_proj_qkv, in_proj_z, out_proj), and MLP experts (gate/up/down_proj for all 256 experts + shared expert).
What's kept in BF16
Embeddings, lm_head, norms, MoE router gates, GDN precision-sensitive params (conv1d, in_proj_a, in_proj_b), vision tower.
Usage with SGLang
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
SGLANG_ENABLE_SPEC_V2=1 \
python -m sglang.launch_server \
--model-path nivvis/Qwen3.5-35B-A3B-heretic-v2-FP8 \
--tool-call-parser qwen3_coder \
--port 30000 --host 0.0.0.0 \
--mem-fraction-static 0.85 \
--context-length 32768 \
--attention-backend triton \
--reasoning-parser qwen3 \
--mamba-scheduler-strategy extra_buffer \
--trust-remote-code
Usage with vLLM
from vllm import LLM
model = LLM("nivvis/Qwen3.5-35B-A3B-heretic-v2-FP8", trust_remote_code=True)
Performance
Tested on NVIDIA RTX PRO 6000 Blackwell (98 GB):
- Decode throughput: ~134 tokens/s (single request)
- VRAM usage: ~34 GB model + KV cache
- Default FP8 kernel configs (no device-specific tuning)
Notes
- No MTP: Heretic abliteration strips MTP weights. Do not use
--speculative-algo NEXTN(accept rate ~0.25). Restoring MTP by copying the head from the original Qwen model and fine-tuning against heretic hidden states is a potential future improvement. - Thinking model: Supports
<think>reasoning. Use"chat_template_kwargs": {"enable_thinking": false}to disable. - Validated against official Qwen/Qwen3.5-35B-A3B-FP8: 0 missing scale tensors, 0 unexpected tensors, matching quantization config.
Credits
- Base model: Qwen/Qwen3.5-35B-A3B by Qwen Team
- Abliteration: llmfan46/Qwen3.5-35B-A3B-heretic-v2 using Heretic
- FP8 quantization: nivvis
- Downloads last month
- -
Model tree for nivvis/Qwen3.5-35B-A3B-heretic-v2-FP8
Base model
Qwen/Qwen3.5-35B-A3B-Base
Finetuned
Qwen/Qwen3.5-35B-A3B
Finetuned
llmfan46/Qwen3.5-35B-A3B-heretic-v2