Qwen3.5-REAP-212B-A17B-W4A16

INT4 Weight-Quantized Qwen 3.5 REAP MoE (212B total / 17B active params)

A W4A16 quantization of OpenMOSE/Qwen3.5-REAP-212B-A17B, produced using Intel's AutoRound with the standard NeelNanda/pile-10k calibration dataset. The same recipe used by 0xSero for the GLM-4.7-REAP-218B-A32B-W4A16 quantization.

Also consider trying https://huggingface.co/OpenMOSE/Qwen3.5-REAP-262B-A17B

Model Details

Property	Value
Base Model	OpenMOSE/Qwen3.5-REAP-212B-A17B
Architecture	Mixture-of-Experts (MoE), REAP-pruned
Total Parameters	212B
Active Parameters	17B per forward pass
Quantization	INT4 weights, FP16 activations (W4A16)
Group Size	128
Calibration Dataset	NeelNanda/pile-10k
Calibration Samples	64
Sequence Length	512
Format	AutoRound
Quantization Tool	Intel AutoRound

Quantization Details

AutoRound is Intel's weight quantization method that uses signed gradient descent to find optimal rounding decisions — rather than naively rounding each weight to its nearest 4-bit value, it iteratively adjusts rounding directions to minimize output error across calibration samples. Think of it as the difference between rounding every number on your tax return individually versus rounding them in a coordinated way that keeps the total as accurate as possible.

AutoRound Config:
  bits: 4
  group_size: 128
  format: auto_round
  nsamples: 64
  seqlen: 512
  batch_size: 1
  dataset: NeelNanda/pile-10k

What Gets Quantized (and What Doesn't)

The quantization process handles 542 out of 543 layers per transformer block. The one layer that gets deliberately skipped is mlp.shared_expert_gate — the MoE router. Router weights determine which experts activate for a given token, and quantizing them would be like compressing the traffic signals at an intersection. You want the routing decisions to stay precise even if the experts themselves are operating at reduced precision.

How It Was Made

The Fun Part

from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "OpenMOSE/Qwen3.5-REAP-212B-A17B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "OpenMOSE/Qwen3.5-REAP-212B-A17B",
    trust_remote_code=True,
)

ar = AutoRound(
    model,
    tokenizer=tokenizer,
    device="cuda",
    nsamples=64,
    seqlen=512,
    batch_size=1,
)
ar.quantize_and_save("./Qwen3.5-REAP-212B-A17B-W4A16", format="auto_round")

The Less Fun Part

If you're running transformers>=5.0 (which you need for Qwen 3.5's qwen3_5_moe architecture), you'll hit two issues that require monkey-patching before the imports above:

1. transformers 5.x removed pytorch_utils.Conv1D, which AutoRound still references at import time:

import types, torch, transformers

if not hasattr(transformers, 'pytorch_utils'):
    pytorch_utils = types.ModuleType('pytorch_utils')
    pytorch_utils.Conv1D = torch.nn.Linear
    transformers.pytorch_utils = pytorch_utils

This is safe — Conv1D is only used by GPT-2 style models. Qwen uses standard nn.Linear layers throughout, so AutoRound will never actually invoke it. We're just satisfying the import.

2. AutoRound detects Qwen 3.5 as a multimodal model and routes to its MLLM quantization path, which crashes because there's no video processor to load. We need to override the detection at both the module level and the local binding:

import auto_round.utils
import auto_round.autoround

auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False

Both patches must come before from auto_round import AutoRound.

Hardware & Runtime

Quantized on 8× NVIDIA H100 80GB HBM3:

Stage	Time
Weight loading	~51 seconds
Calibration caching	~4 minutes
Quantization (60 blocks)	~4.5 hours
Total	~5 hours

Peak RAM: ~404GB. Peak VRAM: ~69GB. The model shards comfortably across 8 GPUs with device_map="auto".

Deployment

vLLM

vllm serve Qwen3.5-REAP-212B-A17B-W4A16 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --quantization gptq

Acknowledgments

OpenMOSE — Base REAP-pruned Qwen 3.5 model
Cerebras — REAP methodology (arXiv:2510.13999)
Intel — AutoRound quantization framework
0xSero — Reference recipe from GLM-4.7-REAP-218B-A32B-W4A16

Citation

@article{jones2025reap,
  title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
  author={Jones, et al.},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}

Downloads last month: 123

Safetensors

Model size

30B params

Tensor type

I32

BF16

F16

Model tree for atbender/Qwen3.5-REAP-212B-A17B-W4A16

Base model

Qwen/Qwen3.5-397B-A17B

Finetuned

OpenMOSE/Qwen3.5-REAP-212B-A17B

Finetuned

(1)

this model

Paper for atbender/Qwen3.5-REAP-212B-A17B-W4A16

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 14