Qwen3.5-REAP-212B-A17B-W4A16

INT4 Weight-Quantized Qwen 3.5 REAP MoE (212B total / 17B active params)

A W4A16 quantization of OpenMOSE/Qwen3.5-REAP-212B-A17B, produced using Intel's AutoRound with the standard NeelNanda/pile-10k calibration dataset. The same recipe used by 0xSero for the GLM-4.7-REAP-218B-A32B-W4A16 quantization.

Also consider trying https://huggingface.co/OpenMOSE/Qwen3.5-REAP-262B-A17B

Model Details

Property Value
Base Model OpenMOSE/Qwen3.5-REAP-212B-A17B
Architecture Mixture-of-Experts (MoE), REAP-pruned
Total Parameters 212B
Active Parameters 17B per forward pass
Quantization INT4 weights, FP16 activations (W4A16)
Group Size 128
Calibration Dataset NeelNanda/pile-10k
Calibration Samples 64
Sequence Length 512
Format AutoRound
Quantization Tool Intel AutoRound

Quantization Details

AutoRound is Intel's weight quantization method that uses signed gradient descent to find optimal rounding decisions — rather than naively rounding each weight to its nearest 4-bit value, it iteratively adjusts rounding directions to minimize output error across calibration samples. Think of it as the difference between rounding every number on your tax return individually versus rounding them in a coordinated way that keeps the total as accurate as possible.

AutoRound Config:
  bits: 4
  group_size: 128
  format: auto_round
  nsamples: 64
  seqlen: 512
  batch_size: 1
  dataset: NeelNanda/pile-10k

What Gets Quantized (and What Doesn't)

The quantization process handles 542 out of 543 layers per transformer block. The one layer that gets deliberately skipped is mlp.shared_expert_gate — the MoE router. Router weights determine which experts activate for a given token, and quantizing them would be like compressing the traffic signals at an intersection. You want the routing decisions to stay precise even if the experts themselves are operating at reduced precision.

How It Was Made

The Fun Part

from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "OpenMOSE/Qwen3.5-REAP-212B-A17B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "OpenMOSE/Qwen3.5-REAP-212B-A17B",
    trust_remote_code=True,
)

ar = AutoRound(
    model,
    tokenizer=tokenizer,
    device="cuda",
    nsamples=64,
    seqlen=512,
    batch_size=1,
)
ar.quantize_and_save("./Qwen3.5-REAP-212B-A17B-W4A16", format="auto_round")

The Less Fun Part

If you're running transformers>=5.0 (which you need for Qwen 3.5's qwen3_5_moe architecture), you'll hit two issues that require monkey-patching before the imports above:

1. transformers 5.x removed pytorch_utils.Conv1D, which AutoRound still references at import time:

import types, torch, transformers

if not hasattr(transformers, 'pytorch_utils'):
    pytorch_utils = types.ModuleType('pytorch_utils')
    pytorch_utils.Conv1D = torch.nn.Linear
    transformers.pytorch_utils = pytorch_utils

This is safe — Conv1D is only used by GPT-2 style models. Qwen uses standard nn.Linear layers throughout, so AutoRound will never actually invoke it. We're just satisfying the import.

2. AutoRound detects Qwen 3.5 as a multimodal model and routes to its MLLM quantization path, which crashes because there's no video processor to load. We need to override the detection at both the module level and the local binding:

import auto_round.utils
import auto_round.autoround

auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False

Both patches must come before from auto_round import AutoRound.

Hardware & Runtime

Quantized on 8× NVIDIA H100 80GB HBM3:

Stage Time
Weight loading ~51 seconds
Calibration caching ~4 minutes
Quantization (60 blocks) ~4.5 hours
Total ~5 hours

Peak RAM: ~404GB. Peak VRAM: ~69GB. The model shards comfortably across 8 GPUs with device_map="auto".

Deployment

vLLM

vllm serve Qwen3.5-REAP-212B-A17B-W4A16 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --quantization gptq

Acknowledgments

Citation

@article{jones2025reap,
  title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
  author={Jones, et al.},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}
Downloads last month
123
Safetensors
Model size
30B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for atbender/Qwen3.5-REAP-212B-A17B-W4A16

Finetuned
(1)
this model

Paper for atbender/Qwen3.5-REAP-212B-A17B-W4A16