---
license: apache-2.0
base_model: ibm-granite/granite-docling-258M
tags:
- text-generation
- documents
- code
- formula
- chart
- ocr
- layout
- table
- document-parse
- docling
- granite
- extraction
- math
- gguf
- llama.cpp
- quantized
language:
- en
pipeline_tag: image-text-to-text
library_name: gguf
quantized_by: bowserj
---

# granite-docling-258M-GGUF

This repository contains GGUF format quantized versions of [ibm-granite/granite-docling-258M](https://huggingface.co/ibm-granite/granite-docling-258M) for use with [llama.cpp](https://github.com/ggerganov/llama.cpp).

<div style="display: flex; align-items: center;">
    <div>
        <p>Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. This GGUF version enables fast CPU and GPU inference using llama.cpp, making it ideal for edge deployment and resource-constrained environments.</p>
    </div>
</div>

## Model Summary

- **Original Model**: [ibm-granite/granite-docling-258M](https://huggingface.co/ibm-granite/granite-docling-258M)
- **Developed by**: IBM Research
- **Model type**: Multi-modal model (image+text-to-text)
- **Architecture**: Idefics3 (SigLIP vision encoder + Granite 165M LLM)
- **Language(s)**: English (NLP)
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Format**: GGUF (for llama.cpp)
- **Quantizations**: F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M

## Files Included

This repository contains multiple quantization levels of the text model, plus the vision encoder/projector. You need **one text model file** + **the mmproj file** for inference.

### Text Model Files (choose one):

| Filename | Quant | Size | Use Case |
|----------|-------|------|----------|
| granite-docling-258M-Q4_K_M.gguf | Q4_K_M | 133 MB | Smallest, good quality-size balance, recommended for most users |
| granite-docling-258M-Q5_K_M.gguf | Q5_K_M | 139 MB | Better quality, still compact |
| granite-docling-258M-Q6_K.gguf | Q6_K | 164 MB | Higher quality |
| granite-docling-258M-Q8_0.gguf | Q8_0 | 170 MB | Very high quality |
| granite-docling-258M-f16.gguf | F16 | 317 MB | Highest quality, original precision |

### Vision/Projector File (required):

| Filename | Size | Notes |
|----------|------|-------|
| mmproj-granite-docling-258M-f16.gguf | 182 MB | Vision encoder (SigLIP) - required for all quantizations |

**Note**: The mmproj file is kept at F16 precision to maintain vision quality.

## Getting Started with llama.cpp

### Prerequisites

1. Build llama.cpp with multimodal support:
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # For CUDA support
cmake --build build --config Release -j
```

2. Download the GGUF files from this repository

### Basic Usage

```bash
# Using Q4_K_M (recommended for most users)
./llama.cpp/build/bin/llama-mtmd-cli \
  -m granite-docling-258M-Q4_K_M.gguf \
  --mmproj mmproj-granite-docling-258M-f16.gguf \
  --image document.png \
  --chat-template "{%- for message in messages -%}{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}{%- if message['content'] is string -%}{{- message['content'] -}}{%- else -%}{%- for part in message['content'] -%}{%- if part['type'] == 'text' -%}{{- part['text'] -}}{%- elif part['type'] == 'image' -%}{{- '<image>' -}}{%- endif -%}{%- endfor -%}{%- endif -%}{{- '<|end_of_text|>\n' -}}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<|start_of_role|>assistant' -}}{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}{{- '<|end_of_role|>' -}}{%- endif -%}" \
  -p "Convert this page to docling." \
  -n 512 \
  --temp 0.1 \
  -ngl 99

# Or use any other quantization (Q5_K_M, Q6_K, Q8_0, F16)
# Just replace the -m parameter with your chosen model file
```

### Simplified Usage (Save Chat Template)

Save the chat template to a file for easier reuse:

```bash
# Save chat template
cat > granite_docling_template.jinja << 'EOF'
{%- for message in messages -%}
{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}
{%- if message['content'] is string -%}
{{- message['content'] -}}
{%- else -%}
{%- for part in message['content'] -%}
{%- if part['type'] == 'text' -%}
{{- part['text'] -}}
{%- elif part['type'] == 'image' -%}
{{- '<image>' -}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{{- '<|end_of_text|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|start_of_role|>assistant' -}}
{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}
{{- '<|end_of_role|>' -}}
{%- endif -%}
EOF

# Then use it:
./llama.cpp/build/bin/llama-mtmd-cli \
  -m granite-docling-258M-Q4_K_M.gguf \
  --mmproj mmproj-granite-docling-258M-f16.gguf \
  --image document.png \
  --chat-template "$(cat granite_docling_template.jinja)" \
  -p "Convert this page to docling." \
  -n 512 \
  -ngl 99
```

### Choosing a Quantization

**Recommended for most users:** Q4_K_M - Best balance of size and quality

| Quantization | Total Size | Quality | Speed | RAM Usage |
|--------------|------------|---------|-------|-----------|
| Q4_K_M | 315 MB | Good | Fastest | ~400 MB |
| Q5_K_M | 321 MB | Better | Fast | ~420 MB |
| Q6_K | 346 MB | High | Medium | ~450 MB |
| Q8_0 | 352 MB | Very High | Medium | ~480 MB |
| F16 | 499 MB | Highest | Slower | ~650 MB |

*Total size = text model + mmproj (182 MB)*

### Example Output

The model outputs DocTags format with precise layout information:

```
<doctag>
<page_header><loc_145><loc_28><loc_355><loc_35>ENERGY BUDGET OF WASP-121 b</page_header>
<text><loc_88><loc_42><loc_242><loc_89>while the kernel weights are structured as...</text>
...
</doctag>
```

## Supported Instructions

| Description | Instruction |
|-------------|-------------|
| **Full conversion** | Convert this page to docling. |
| **Chart** | Convert chart to table. |
| **Formula** | Convert formula to LaTeX. |
| **Code** | Convert code to text. |
| **Table** | Convert table to OTSL. |
| **OCR region** | OCR the text in a specific location: &lt;loc_155&gt;&lt;loc_233&gt;&lt;loc_206&gt;&lt;loc_237&gt; |

## Performance

Tested on NVIDIA RTX 4070 Ti SUPER with CUDA:

- **Image encoding**: ~6-8ms per slice (17 slices total for 512x512 images)
- **Prompt processing**: ~1305 tokens/sec
- **Generation speed**: ~706 tokens/sec
- **Total memory**: ~600 MB GPU (with all layers offloaded)

## Model Architecture

The architecture consists of:

1. **Vision encoder**: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512)
   - 768 hidden dimensions, 12 layers, 12 heads
   - 512x512 image input with 16x16 patches
   - Image splitting: 4x4 grid + global view = 17 frames

2. **Vision-language connector**: Pixel shuffle projector (Idefics3 style)
   - Projects vision features to LLM embedding space
   - Scale factor: 4

3. **Large language model**: Granite 165M
   - 30 layers, 9 attention heads, 3 KV heads
   - 576 hidden dimensions
   - Context length: 8192 tokens
   - Vocabulary: 100,352 tokens (GPT2 tokenizer with DocTags extensions)

## Conversion Details

These GGUF files were converted from the original model using llama.cpp's conversion tools:

1. **Vision/Projector conversion** (F16):
   ```bash
   python convert_hf_to_gguf.py granite-docling-258M \
     --mmproj --outtype f16 \
     --outfile mmproj-granite-docling-258M-f16.gguf
   ```

2. **Text model conversion**:
   - Extracted text model weights from the full VLM model
   - Converted to GGUF with F16 precision
   - Preserved all special tokens and tokenizer configuration

3. **Quantization**:
   ```bash
   llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q4_K_M.gguf Q4_K_M
   llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q5_K_M.gguf Q5_K_M
   llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q6_K.gguf Q6_K
   llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q8_0.gguf Q8_0
   ```

## Use Cases

Granite-Docling excels at:

- 📄 **Document OCR**: Extract text from scanned documents with layout preservation
- 📊 **Table Recognition**: Convert tables to structured formats (OTSL/HTML)
- 🔢 **Equation Recognition**: Extract LaTeX from mathematical formulas
- 💻 **Code Recognition**: Extract code snippets from documents
- 📈 **Chart-to-Table**: Convert charts and graphs to structured data
- 🗂️ **Layout Analysis**: Understand document structure (headers, footers, sections)

## Limitations

- **Not for general image understanding**: For general vision tasks, use [Granite Vision models](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800)
- **Document-focused**: Optimized for document pages, not natural images
- **English primary**: Best performance on English documents (experimental support for Japanese, Arabic, Chinese)
- **Potential hallucination**: Like all smaller VLMs, may hallucinate in complex scenarios

## Responsible Use

This model is designed for document understanding and should be used responsibly:

- Verify outputs for critical applications
- Be aware of potential biases in document interpretation
- Do not use for autonomous decision-making without human oversight
- Consider using with [Granite Guardian](https://huggingface.co/collections/ibm-granite/granite-guardian-67b57364c9803552c95b1923) for additional safety

## Citation

If you use this model, please cite:

```bibtex
@misc{granite-docling-2025,
  title={Granite Docling: Efficient Document Conversion with Vision-Language Models},
  author={IBM Research},
  year={2025},
  url={https://huggingface.co/ibm-granite/granite-docling-258M}
}
```

## Resources

- 📚 [Original Model](https://huggingface.co/ibm-granite/granite-docling-258M)
- 🐥 [Docling Library](https://github.com/docling-project/docling)
- 🚀 [llama.cpp](https://github.com/ggerganov/llama.cpp)
- 📖 [Docling Documentation](https://docling-project.github.io/docling/)
- 💡 [Granite Resources](https://ibm.biz/granite-learning-resources)

## Acknowledgments

- Original model by IBM Research
- GGUF conversion using llama.cpp conversion tools
- Thanks to the llama.cpp team for multimodal support