DeepSeekV4Flash Quantization Repository
This repository provides scripts and guidelines for quantizing the DeepSeek V4 Flash model, enabling reduced model size and optimized inference performance.
๐ Purpose
- Reduce model size (BF16 โ Q3/Q4/Q5/Q8, etc.)
- Improve inference speed
- Enable deployment on limited GPU/CPU resources
๐ Languages
- English (en)
- Vietnamese (vi)
๐ง Base Model
- DeepSeek-V4-Flash (https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
๐ฆ Contents
- Model conversion and quantization scripts
- Usage examples for llama.cpp / GGUF workflows
- Common quantization configurations
๐ ๏ธ Requirements
- Python >= 3.12
- Latest version of llama.cpp (with GGUF support)
- HuggingFace Transformers (if converting from HF format)
- Sufficient RAM/VRAM depending on model size
โ๏ธ Example Usage
python convert_hf_to_gguf.py --model deepseek-ai/DeepSeek-V4-Flash --outfile models/DeepSeekV4Flash.gguf
./llama-quantize models/DeepSeekV4Flash.gguf Q4_K_M
๐ Notes
- Quantization may require significant system memory depending on model size
- Some quantization formats may not be compatible with all runtimes or versions
- Always validate output quality after quantization
๐ค Author
- Email: tecaprovn@gmail.com
- Telegram: https://t.me/tamndx
๐ License
This repository follows the original DeepSeek model license.
- Base model: Apache 2.0 (DeepSeek)
- Only conversion scripts included, no weight modification
- Downloads last month
- 6,786
Hardware compatibility
Log In to add your hardware
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for tecaprovn/deepseek-v4-flash-gguf
Base model
deepseek-ai/DeepSeek-V4-Flash
