DeepSeekV4Flash Quantization Repository

This repository provides scripts and guidelines for quantizing the DeepSeek V4 Flash model, enabling reduced model size and optimized inference performance.

🚀 Purpose

Reduce model size (BF16 → Q3/Q4/Q5/Q8, etc.)
Improve inference speed
Enable deployment on limited GPU/CPU resources

🌍 Languages

English (en)
Vietnamese (vi)

🧠 Base Model

DeepSeek-V4-Flash (https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)

📦 Contents

Model conversion and quantization scripts
Usage examples for llama.cpp / GGUF workflows
Common quantization configurations

🛠️ Requirements

Python >= 3.12
Latest version of llama.cpp (with GGUF support)
HuggingFace Transformers (if converting from HF format)
Sufficient RAM/VRAM depending on model size

⚙️ Example Usage

python convert_hf_to_gguf.py   --model deepseek-ai/DeepSeek-V4-Flash   --outfile models/DeepSeekV4Flash.gguf

./llama-quantize models/DeepSeekV4Flash.gguf Q4_K_M

📌 Notes

Quantization may require significant system memory depending on model size
Some quantization formats may not be compatible with all runtimes or versions
Always validate output quality after quantization

👤 Author

Email: tecaprovn@gmail.com
Telegram: https://t.me/tamndx

📄 License

This repository follows the original DeepSeek model license.

Base model: Apache 2.0 (DeepSeek)
Only conversion scripts included, no weight modification

Downloads last month: 6,786

GGUF

Model size

158B params

Architecture

deepseek2

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tecaprovn/deepseek-v4-flash-gguf

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(16)

this model