Qwen3.5-REAP-212B-A17B-W4A16
INT4 Weight-Quantized Qwen 3.5 REAP MoE (212B total / 17B active params)
A W4A16 quantization of OpenMOSE/Qwen3.5-REAP-212B-A17B, produced using Intel's AutoRound with the standard NeelNanda/pile-10k calibration dataset. The same recipe used by 0xSero for the GLM-4.7-REAP-218B-A32B-W4A16 quantization.
Also consider trying https://huggingface.co/OpenMOSE/Qwen3.5-REAP-262B-A17B
Model Details
| Property | Value |
|---|---|
| Base Model | OpenMOSE/Qwen3.5-REAP-212B-A17B |
| Architecture | Mixture-of-Experts (MoE), REAP-pruned |
| Total Parameters | 212B |
| Active Parameters | 17B per forward pass |
| Quantization | INT4 weights, FP16 activations (W4A16) |
| Group Size | 128 |
| Calibration Dataset | NeelNanda/pile-10k |
| Calibration Samples | 64 |
| Sequence Length | 512 |
| Format | AutoRound |
| Quantization Tool | Intel AutoRound |
Quantization Details
AutoRound is Intel's weight quantization method that uses signed gradient descent to find optimal rounding decisions — rather than naively rounding each weight to its nearest 4-bit value, it iteratively adjusts rounding directions to minimize output error across calibration samples. Think of it as the difference between rounding every number on your tax return individually versus rounding them in a coordinated way that keeps the total as accurate as possible.
AutoRound Config:
bits: 4
group_size: 128
format: auto_round
nsamples: 64
seqlen: 512
batch_size: 1
dataset: NeelNanda/pile-10k
What Gets Quantized (and What Doesn't)
The quantization process handles 542 out of 543 layers per transformer block. The one layer that gets deliberately skipped is mlp.shared_expert_gate — the MoE router. Router weights determine which experts activate for a given token, and quantizing them would be like compressing the traffic signals at an intersection. You want the routing decisions to stay precise even if the experts themselves are operating at reduced precision.
How It Was Made
The Fun Part
from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"OpenMOSE/Qwen3.5-REAP-212B-A17B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"OpenMOSE/Qwen3.5-REAP-212B-A17B",
trust_remote_code=True,
)
ar = AutoRound(
model,
tokenizer=tokenizer,
device="cuda",
nsamples=64,
seqlen=512,
batch_size=1,
)
ar.quantize_and_save("./Qwen3.5-REAP-212B-A17B-W4A16", format="auto_round")
The Less Fun Part
If you're running transformers>=5.0 (which you need for Qwen 3.5's qwen3_5_moe architecture), you'll hit two issues that require monkey-patching before the imports above:
1. transformers 5.x removed pytorch_utils.Conv1D, which AutoRound still references at import time:
import types, torch, transformers
if not hasattr(transformers, 'pytorch_utils'):
pytorch_utils = types.ModuleType('pytorch_utils')
pytorch_utils.Conv1D = torch.nn.Linear
transformers.pytorch_utils = pytorch_utils
This is safe — Conv1D is only used by GPT-2 style models. Qwen uses standard nn.Linear layers throughout, so AutoRound will never actually invoke it. We're just satisfying the import.
2. AutoRound detects Qwen 3.5 as a multimodal model and routes to its MLLM quantization path, which crashes because there's no video processor to load. We need to override the detection at both the module level and the local binding:
import auto_round.utils
import auto_round.autoround
auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False
Both patches must come before from auto_round import AutoRound.
Hardware & Runtime
Quantized on 8× NVIDIA H100 80GB HBM3:
| Stage | Time |
|---|---|
| Weight loading | ~51 seconds |
| Calibration caching | ~4 minutes |
| Quantization (60 blocks) | ~4.5 hours |
| Total | ~5 hours |
Peak RAM: ~404GB. Peak VRAM: ~69GB. The model shards comfortably across 8 GPUs with device_map="auto".
Deployment
vLLM
vllm serve Qwen3.5-REAP-212B-A17B-W4A16 \
--tensor-parallel-size 4 \
--trust-remote-code \
--quantization gptq
Acknowledgments
- OpenMOSE — Base REAP-pruned Qwen 3.5 model
- Cerebras — REAP methodology (arXiv:2510.13999)
- Intel — AutoRound quantization framework
- 0xSero — Reference recipe from GLM-4.7-REAP-218B-A32B-W4A16
Citation
@article{jones2025reap,
title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
author={Jones, et al.},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
@misc{autoround2024,
title={AutoRound: Advanced Weight Quantization},
author={Intel Corporation},
year={2024},
howpublished={\url{https://github.com/intel/auto-round}}
}
- Downloads last month
- 123
Model tree for atbender/Qwen3.5-REAP-212B-A17B-W4A16
Base model
Qwen/Qwen3.5-397B-A17B