Qwen3.5-35B-A3B-heretic-v2-FP8

FP8 block-wise quantization of llmfan46/Qwen3.5-35B-A3B-heretic-v2 (abliterated via Heretic v1.2.0 MPOA+SOMA).

Quantization format matches Qwen/Qwen3.5-35B-A3B-FP8 exactly (same tensor names, scale format, skip list).

Quantization Details

Parameter Value
Method FP8 E4M3 block-wise
Block size 128x128
Scale dtype BF16
Scale naming weight_scale_inv
Activation scheme Dynamic
Original size ~66 GB (BF16)
Quantized size ~35 GB

What's quantized

All linear projections in attention (q/k/v/o_proj), GatedDeltaNet (in_proj_qkv, in_proj_z, out_proj), and MLP experts (gate/up/down_proj for all 256 experts + shared expert).

What's kept in BF16

Embeddings, lm_head, norms, MoE router gates, GDN precision-sensitive params (conv1d, in_proj_a, in_proj_b), vision tower.

Usage with SGLang

CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
SGLANG_ENABLE_SPEC_V2=1 \
  python -m sglang.launch_server \
    --model-path nivvis/Qwen3.5-35B-A3B-heretic-v2-FP8 \
    --tool-call-parser qwen3_coder \
    --port 30000 --host 0.0.0.0 \
    --mem-fraction-static 0.85 \
    --context-length 32768 \
    --attention-backend triton \
    --reasoning-parser qwen3 \
    --mamba-scheduler-strategy extra_buffer \
    --trust-remote-code

Usage with vLLM

from vllm import LLM
model = LLM("nivvis/Qwen3.5-35B-A3B-heretic-v2-FP8", trust_remote_code=True)

Performance

Tested on NVIDIA RTX PRO 6000 Blackwell (98 GB):

  • Decode throughput: ~134 tokens/s (single request)
  • VRAM usage: ~34 GB model + KV cache
  • Default FP8 kernel configs (no device-specific tuning)

Notes

  • No MTP: Heretic abliteration strips MTP weights. Do not use --speculative-algo NEXTN (accept rate ~0.25). Restoring MTP by copying the head from the original Qwen model and fine-tuning against heretic hidden states is a potential future improvement.
  • Thinking model: Supports <think> reasoning. Use "chat_template_kwargs": {"enable_thinking": false} to disable.
  • Validated against official Qwen/Qwen3.5-35B-A3B-FP8: 0 missing scale tensors, 0 unexpected tensors, matching quantization config.

Credits

Downloads last month
-
Safetensors
Model size
35B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nivvis/Qwen3.5-35B-A3B-heretic-v2-FP8

Quantized
(5)
this model