ExLlamaV3 quantizations of Qwen3-Coder-30B-A3B-Instruct with tensor-level (L3) optimization and boosted attention layers (5-6 bit). Maximum effort applied towards the goal of achieving the best possible quantizations at the expense of time and compute.

Using this measurement.json file and the base quants provided, additional highly-optimized quantizations can be made in seconds at any reasonable bpw by anyone. All work done with ExLlamaV3 v0.0.18.

Optimized

VRAM-targeted quants using exl3's measure.py โ†’ optimize.py โ†’ recompile.py pipeline.

Size bpw Target
3.34bpw-h5-opt 13 GB 3.34 16GB @ 128k
4.90bpw-h6-opt 18 GB 4.90 24GB @ 262k

The 4.90bpw quant hit the optimization ceiling - requesting higher bpw (5.30, 6.95) produced identical 4.83bpw pre-boost output, indicating no further beneficial tensor swaps available.

Base

Size bpw
2.0bpw-h6 8 GB 2.0
3.0bpw-h6 12 GB 3.0
4.0bpw-h6 15 GB 4.0
5.0bpw-h6 19 GB 5.0
6.0bpw-h6 22 GB 6.0
7.0bpw-h6 26 GB 7.0
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for amanwalksdownthestreet/Qwen3-Coder-30B-A3B-Instruct-exl3

Quantized
(127)
this model