ExLlamaV3 quantizations of Qwen3-Coder-30B-A3B-Instruct with tensor-level (L3) optimization and boosted attention layers (5-6 bit). Maximum effort applied towards the goal of achieving the best possible quantizations at the expense of time and compute.
Using this measurement.json file and the base quants provided, additional highly-optimized quantizations can be made in seconds at any reasonable bpw by anyone. All work done with ExLlamaV3 v0.0.18.
Optimized
VRAM-targeted quants using exl3's measure.py โ optimize.py โ recompile.py pipeline.
| Size | bpw | Target | |
|---|---|---|---|
| 3.34bpw-h5-opt | 13 GB | 3.34 | 16GB @ 128k |
| 4.90bpw-h6-opt | 18 GB | 4.90 | 24GB @ 262k |
The 4.90bpw quant hit the optimization ceiling - requesting higher bpw (5.30, 6.95) produced identical 4.83bpw pre-boost output, indicating no further beneficial tensor swaps available.
Base
Model tree for amanwalksdownthestreet/Qwen3-Coder-30B-A3B-Instruct-exl3
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct