These are NOT actual AWQ-quantized models.

#2
by cai-cai - opened

Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.

cyankiwi org

AWQ is the algorithm used to optimize this model, whereas compressed-tensors is the format i.e., weight_packed, weight_scale, weight_zero_point, weight_shape that the model is saved after quantization.

In regards to kernels used for inference, vllm uses the same Marlin kernel for compressed-tensors and AutoAWQ format, but via different routes.

Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.

https://github.com/vllm-project/llm-compressor/blob/main/examples/awq/README.md

这个模型可以使用Lvllm进行混合推理 https://github.com/guqiong96/Lvllm/blob/main/README.md

I have used cpatonn's AWQ-4bit variants for about 7 to 8 months now and they are definitely quantized. I have built a complete sovereign AI infrastructure using these models. I have no cloud dependency at all. I have attempted to serve numerous un-quantized models like mistral-small-4-119b or qwen3.5-122b on L40S-180 GPU Instances(Dual 48GB cards) the model has to be properly configured in order to shard across multiple GPU's. This is where you come to get guaranteed working models. Hopefully he has time to quantize the new nemotron-3-nano-omni-30-reasoning model(these names are just getting way too long). I had to use a random quantized model from a reputable user/repo/space(drawais) and it works, including all modalities. It's an ANY to ANY model. I run all my models via VLLM and docker-compose.

Sign up or log in to comment