DeepSeekV4Flash Quantization Repository

v4-benchmark-2

This repository provides scripts and guidelines for quantizing the DeepSeek V4 Flash model, enabling reduced model size and optimized inference performance.

v4-efficiency


๐Ÿš€ Purpose

  • Reduce model size (BF16 โ†’ Q3/Q4/Q5/Q8, etc.)
  • Improve inference speed
  • Enable deployment on limited GPU/CPU resources

๐ŸŒ Languages

  • English (en)
  • Vietnamese (vi)

๐Ÿง  Base Model


๐Ÿ“ฆ Contents

  • Model conversion and quantization scripts
  • Usage examples for llama.cpp / GGUF workflows
  • Common quantization configurations

๐Ÿ› ๏ธ Requirements

  • Python >= 3.12
  • Latest version of llama.cpp (with GGUF support)
  • HuggingFace Transformers (if converting from HF format)
  • Sufficient RAM/VRAM depending on model size

โš™๏ธ Example Usage

python convert_hf_to_gguf.py   --model deepseek-ai/DeepSeek-V4-Flash   --outfile models/DeepSeekV4Flash.gguf

./llama-quantize models/DeepSeekV4Flash.gguf Q4_K_M

๐Ÿ“Œ Notes

  • Quantization may require significant system memory depending on model size
  • Some quantization formats may not be compatible with all runtimes or versions
  • Always validate output quality after quantization

๐Ÿ‘ค Author


๐Ÿ“„ License

This repository follows the original DeepSeek model license.

  • Base model: Apache 2.0 (DeepSeek)
  • Only conversion scripts included, no weight modification
Downloads last month
6,786
GGUF
Model size
158B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tecaprovn/deepseek-v4-flash-gguf

Quantized
(16)
this model