FP8-block, FP8-dynamic, NVFP4, w4a16, w8a8 quantized models of ibm-granite/granite-4.0-h-small and ibm-granite/granite-4.0-h-tiny models
Inference Optimization
community
AI & ML interests
None defined yet.
Recent Activity
View all activity
FP8-dynamic, FP8-block, NVFP4, INT4, INT8 versions of Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking Models
Collection on FP8 Quantization of Weights, Activations and KV Cache
-
inference-optimization/Llama-3.1-8B-Instruct-QKV-Cache-FP8-Per-Head
8B • Updated • 1 -
inference-optimization/Llama-3.1-8B-Instruct-QKV-Cache-FP8-Per-Tensor
8B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-FP8-dynamic-QKV-Cache-FP8-Per-Head
8B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-FP8-dynamic-QKV-Cache-FP8-Per-Tensor
8B • Updated
FP8-dynamic, FP8-block, NVFP4, INT4, versions of nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B
-
inference-optimization/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
Text Generation • 32B • Updated • 6 -
inference-optimization/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
18B • Updated • 152 -
inference-optimization/NVIDIA-Nemotron-3-Nano-30B-A3B-quantized.w4a16
6B • Updated • 40 -
inference-optimization/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8-dynamic
32B • Updated • 214
Collection of Mixed Precision LLaMA and Qwen Models
-
inference-optimization/Llama-3.1-8B-Instruct-Mixed-NVFP4-FP8_BLOCK-out_proj-all
5B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-Mixed-NVFP4-FP8_BLOCK-qkv_proj-all
5B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-Mixed-NVFP4-FP8_BLOCK-down_proj-all
6B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-Mixed-NVFP4-FP8_BLOCK-gate_up_proj-all
7B • Updated
FP8-block, FP8-dynamic, NVFP4, w4a16, w8a8 quantized models of ibm-granite/granite-4.0-h-small and ibm-granite/granite-4.0-h-tiny models
FP8-dynamic, FP8-block, NVFP4, INT4, versions of nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B
-
inference-optimization/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
Text Generation • 32B • Updated • 6 -
inference-optimization/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
18B • Updated • 152 -
inference-optimization/NVIDIA-Nemotron-3-Nano-30B-A3B-quantized.w4a16
6B • Updated • 40 -
inference-optimization/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8-dynamic
32B • Updated • 214
FP8-dynamic, FP8-block, NVFP4, INT4, INT8 versions of Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking Models
Collection of Mixed Precision LLaMA and Qwen Models
-
inference-optimization/Llama-3.1-8B-Instruct-Mixed-NVFP4-FP8_BLOCK-out_proj-all
5B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-Mixed-NVFP4-FP8_BLOCK-qkv_proj-all
5B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-Mixed-NVFP4-FP8_BLOCK-down_proj-all
6B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-Mixed-NVFP4-FP8_BLOCK-gate_up_proj-all
7B • Updated
Collection on FP8 Quantization of Weights, Activations and KV Cache
-
inference-optimization/Llama-3.1-8B-Instruct-QKV-Cache-FP8-Per-Head
8B • Updated • 1 -
inference-optimization/Llama-3.1-8B-Instruct-QKV-Cache-FP8-Per-Tensor
8B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-FP8-dynamic-QKV-Cache-FP8-Per-Head
8B • Updated -
inference-optimization/Llama-3.1-8B-Instruct-FP8-dynamic-QKV-Cache-FP8-Per-Tensor
8B • Updated