--- license: apache-2.0 base_model: ibm-granite/granite-docling-258M tags: - text-generation - documents - code - formula - chart - ocr - layout - table - document-parse - docling - granite - extraction - math - gguf - llama.cpp - quantized language: - en pipeline_tag: image-text-to-text library_name: gguf quantized_by: bowserj --- # granite-docling-258M-GGUF This repository contains GGUF format quantized versions of [ibm-granite/granite-docling-258M](https://huggingface.co/ibm-granite/granite-docling-258M) for use with [llama.cpp](https://github.com/ggerganov/llama.cpp).

Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. This GGUF version enables fast CPU and GPU inference using llama.cpp, making it ideal for edge deployment and resource-constrained environments.

## Model Summary - **Original Model**: [ibm-granite/granite-docling-258M](https://huggingface.co/ibm-granite/granite-docling-258M) - **Developed by**: IBM Research - **Model type**: Multi-modal model (image+text-to-text) - **Architecture**: Idefics3 (SigLIP vision encoder + Granite 165M LLM) - **Language(s)**: English (NLP) - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) - **Format**: GGUF (for llama.cpp) - **Quantizations**: F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M ## Files Included This repository contains multiple quantization levels of the text model, plus the vision encoder/projector. You need **one text model file** + **the mmproj file** for inference. ### Text Model Files (choose one): | Filename | Quant | Size | Use Case | |----------|-------|------|----------| | granite-docling-258M-Q4_K_M.gguf | Q4_K_M | 133 MB | Smallest, good quality-size balance, recommended for most users | | granite-docling-258M-Q5_K_M.gguf | Q5_K_M | 139 MB | Better quality, still compact | | granite-docling-258M-Q6_K.gguf | Q6_K | 164 MB | Higher quality | | granite-docling-258M-Q8_0.gguf | Q8_0 | 170 MB | Very high quality | | granite-docling-258M-f16.gguf | F16 | 317 MB | Highest quality, original precision | ### Vision/Projector File (required): | Filename | Size | Notes | |----------|------|-------| | mmproj-granite-docling-258M-f16.gguf | 182 MB | Vision encoder (SigLIP) - required for all quantizations | **Note**: The mmproj file is kept at F16 precision to maintain vision quality. ## Getting Started with llama.cpp ### Prerequisites 1. Build llama.cpp with multimodal support: ```bash git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # For CUDA support cmake --build build --config Release -j ``` 2. Download the GGUF files from this repository ### Basic Usage ```bash # Using Q4_K_M (recommended for most users) ./llama.cpp/build/bin/llama-mtmd-cli \ -m granite-docling-258M-Q4_K_M.gguf \ --mmproj mmproj-granite-docling-258M-f16.gguf \ --image document.png \ --chat-template "{%- for message in messages -%}{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}{%- if message['content'] is string -%}{{- message['content'] -}}{%- else -%}{%- for part in message['content'] -%}{%- if part['type'] == 'text' -%}{{- part['text'] -}}{%- elif part['type'] == 'image' -%}{{- '' -}}{%- endif -%}{%- endfor -%}{%- endif -%}{{- '<|end_of_text|>\n' -}}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<|start_of_role|>assistant' -}}{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}{{- '<|end_of_role|>' -}}{%- endif -%}" \ -p "Convert this page to docling." \ -n 512 \ --temp 0.1 \ -ngl 99 # Or use any other quantization (Q5_K_M, Q6_K, Q8_0, F16) # Just replace the -m parameter with your chosen model file ``` ### Simplified Usage (Save Chat Template) Save the chat template to a file for easier reuse: ```bash # Save chat template cat > granite_docling_template.jinja << 'EOF' {%- for message in messages -%} {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}} {%- if message['content'] is string -%} {{- message['content'] -}} {%- else -%} {%- for part in message['content'] -%} {%- if part['type'] == 'text' -%} {{- part['text'] -}} {%- elif part['type'] == 'image' -%} {{- '' -}} {%- endif -%} {%- endfor -%} {%- endif -%} {{- '<|end_of_text|> ' -}} {%- endfor -%} {%- if add_generation_prompt -%} {{- '<|start_of_role|>assistant' -}} {%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%} {{- '<|end_of_role|>' -}} {%- endif -%} EOF # Then use it: ./llama.cpp/build/bin/llama-mtmd-cli \ -m granite-docling-258M-Q4_K_M.gguf \ --mmproj mmproj-granite-docling-258M-f16.gguf \ --image document.png \ --chat-template "$(cat granite_docling_template.jinja)" \ -p "Convert this page to docling." \ -n 512 \ -ngl 99 ``` ### Choosing a Quantization **Recommended for most users:** Q4_K_M - Best balance of size and quality | Quantization | Total Size | Quality | Speed | RAM Usage | |--------------|------------|---------|-------|-----------| | Q4_K_M | 315 MB | Good | Fastest | ~400 MB | | Q5_K_M | 321 MB | Better | Fast | ~420 MB | | Q6_K | 346 MB | High | Medium | ~450 MB | | Q8_0 | 352 MB | Very High | Medium | ~480 MB | | F16 | 499 MB | Highest | Slower | ~650 MB | *Total size = text model + mmproj (182 MB)* ### Example Output The model outputs DocTags format with precise layout information: ``` ENERGY BUDGET OF WASP-121 b while the kernel weights are structured as... ... ``` ## Supported Instructions | Description | Instruction | |-------------|-------------| | **Full conversion** | Convert this page to docling. | | **Chart** | Convert chart to table. | | **Formula** | Convert formula to LaTeX. | | **Code** | Convert code to text. | | **Table** | Convert table to OTSL. | | **OCR region** | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> | ## Performance Tested on NVIDIA RTX 4070 Ti SUPER with CUDA: - **Image encoding**: ~6-8ms per slice (17 slices total for 512x512 images) - **Prompt processing**: ~1305 tokens/sec - **Generation speed**: ~706 tokens/sec - **Total memory**: ~600 MB GPU (with all layers offloaded) ## Model Architecture The architecture consists of: 1. **Vision encoder**: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512) - 768 hidden dimensions, 12 layers, 12 heads - 512x512 image input with 16x16 patches - Image splitting: 4x4 grid + global view = 17 frames 2. **Vision-language connector**: Pixel shuffle projector (Idefics3 style) - Projects vision features to LLM embedding space - Scale factor: 4 3. **Large language model**: Granite 165M - 30 layers, 9 attention heads, 3 KV heads - 576 hidden dimensions - Context length: 8192 tokens - Vocabulary: 100,352 tokens (GPT2 tokenizer with DocTags extensions) ## Conversion Details These GGUF files were converted from the original model using llama.cpp's conversion tools: 1. **Vision/Projector conversion** (F16): ```bash python convert_hf_to_gguf.py granite-docling-258M \ --mmproj --outtype f16 \ --outfile mmproj-granite-docling-258M-f16.gguf ``` 2. **Text model conversion**: - Extracted text model weights from the full VLM model - Converted to GGUF with F16 precision - Preserved all special tokens and tokenizer configuration 3. **Quantization**: ```bash llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q4_K_M.gguf Q4_K_M llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q5_K_M.gguf Q5_K_M llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q6_K.gguf Q6_K llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q8_0.gguf Q8_0 ``` ## Use Cases Granite-Docling excels at: - 📄 **Document OCR**: Extract text from scanned documents with layout preservation - 📊 **Table Recognition**: Convert tables to structured formats (OTSL/HTML) - 🔢 **Equation Recognition**: Extract LaTeX from mathematical formulas - 💻 **Code Recognition**: Extract code snippets from documents - 📈 **Chart-to-Table**: Convert charts and graphs to structured data - 🗂️ **Layout Analysis**: Understand document structure (headers, footers, sections) ## Limitations - **Not for general image understanding**: For general vision tasks, use [Granite Vision models](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800) - **Document-focused**: Optimized for document pages, not natural images - **English primary**: Best performance on English documents (experimental support for Japanese, Arabic, Chinese) - **Potential hallucination**: Like all smaller VLMs, may hallucinate in complex scenarios ## Responsible Use This model is designed for document understanding and should be used responsibly: - Verify outputs for critical applications - Be aware of potential biases in document interpretation - Do not use for autonomous decision-making without human oversight - Consider using with [Granite Guardian](https://huggingface.co/collections/ibm-granite/granite-guardian-67b57364c9803552c95b1923) for additional safety ## Citation If you use this model, please cite: ```bibtex @misc{granite-docling-2025, title={Granite Docling: Efficient Document Conversion with Vision-Language Models}, author={IBM Research}, year={2025}, url={https://huggingface.co/ibm-granite/granite-docling-258M} } ``` ## Resources - 📚 [Original Model](https://huggingface.co/ibm-granite/granite-docling-258M) - 🐥 [Docling Library](https://github.com/docling-project/docling) - 🚀 [llama.cpp](https://github.com/ggerganov/llama.cpp) - 📖 [Docling Documentation](https://docling-project.github.io/docling/) - 💡 [Granite Resources](https://ibm.biz/granite-learning-resources) ## Acknowledgments - Original model by IBM Research - GGUF conversion using llama.cpp conversion tools - Thanks to the llama.cpp team for multimodal support