Image-Text-to-Text
Transformers
GGUF
English
text-generation
documents
code
formula
chart
ocr
layout
table
document-parse
docling
granite
extraction
math
conversational
Instructions to use Userb1az/granite-docling-258M-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Userb1az/granite-docling-258M-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Userb1az/granite-docling-258M-GGUF") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Userb1az/granite-docling-258M-GGUF", dtype="auto") - llama-cpp-python
How to use Userb1az/granite-docling-258M-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Userb1az/granite-docling-258M-GGUF", filename="granite-docling-258M-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Userb1az/granite-docling-258M-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Userb1az/granite-docling-258M-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf Userb1az/granite-docling-258M-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Userb1az/granite-docling-258M-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf Userb1az/granite-docling-258M-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Userb1az/granite-docling-258M-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf Userb1az/granite-docling-258M-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Userb1az/granite-docling-258M-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Userb1az/granite-docling-258M-GGUF:F16
Use Docker
docker model run hf.co/Userb1az/granite-docling-258M-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use Userb1az/granite-docling-258M-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Userb1az/granite-docling-258M-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Userb1az/granite-docling-258M-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Userb1az/granite-docling-258M-GGUF:F16
- SGLang
How to use Userb1az/granite-docling-258M-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Userb1az/granite-docling-258M-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Userb1az/granite-docling-258M-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Userb1az/granite-docling-258M-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Userb1az/granite-docling-258M-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use Userb1az/granite-docling-258M-GGUF with Ollama:
ollama run hf.co/Userb1az/granite-docling-258M-GGUF:F16
- Unsloth Studio
How to use Userb1az/granite-docling-258M-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Userb1az/granite-docling-258M-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Userb1az/granite-docling-258M-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Userb1az/granite-docling-258M-GGUF to start chatting
- Docker Model Runner
How to use Userb1az/granite-docling-258M-GGUF with Docker Model Runner:
docker model run hf.co/Userb1az/granite-docling-258M-GGUF:F16
- Lemonade
How to use Userb1az/granite-docling-258M-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Userb1az/granite-docling-258M-GGUF:F16
Run and chat with the model
lemonade run user.granite-docling-258M-GGUF-F16
List all available models
lemonade list
Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,536 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- ds4sd/SynthCodeNet
|
| 5 |
+
- ds4sd/SynthFormulaNet
|
| 6 |
+
- ds4sd/SynthChartNet
|
| 7 |
+
- HuggingFaceM4/DoclingMatix
|
| 8 |
+
tags:
|
| 9 |
+
- text-generation
|
| 10 |
+
- documents
|
| 11 |
+
- code
|
| 12 |
+
- formula
|
| 13 |
+
- chart
|
| 14 |
+
- ocr
|
| 15 |
+
- layout
|
| 16 |
+
- table
|
| 17 |
+
- document-parse
|
| 18 |
+
- docling
|
| 19 |
+
- granite
|
| 20 |
+
- extraction
|
| 21 |
+
- math
|
| 22 |
+
language:
|
| 23 |
+
- en
|
| 24 |
+
pipeline_tag: image-text-to-text
|
| 25 |
+
library_name: transformers
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
# granite-docling-258m
|
| 29 |
+
<div style="display: flex; align-items: center;">
|
| 30 |
+
<img src="https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/granite_docling.png" alt="Granite Docling Logo" style="width: 200px; height: auto; margin-right: 20px;">
|
| 31 |
+
<div>
|
| 32 |
+
<p>Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with <a href="https://docling-project.github.io/docling ">DoclingDocuments</a> to ensure full compatibility. </p>
|
| 33 |
+
</div>
|
| 34 |
+
</div>
|
| 35 |
+
|
| 36 |
+
**Model Summary**:
|
| 37 |
+
|
| 38 |
+
Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM. Try out our [Granite-Docling-258](https://huggingface.co/spaces/ibm-granite/granite-docling-258m-demo) demo today.
|
| 39 |
+
|
| 40 |
+
- **Developed by**: IBM Research
|
| 41 |
+
- **Model type**: Multi-modal model (image+text-to-text)
|
| 42 |
+
- **Language(s)**: English (NLP)
|
| 43 |
+
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
| 44 |
+
- **Release Date**: September 17, 2025
|
| 45 |
+
|
| 46 |
+
Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including:
|
| 47 |
+
|
| 48 |
+
- 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
|
| 49 |
+
- 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference
|
| 50 |
+
- 🧘 Improved Stability: Tends to avoid infinite loops more effectively
|
| 51 |
+
- 🧮 Enhanced Inline Equations: Better inline math recognition
|
| 52 |
+
- 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements
|
| 53 |
+
- 🌍 Japanese, Arabic and Chinese support (_experimental_)
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
## Getting started
|
| 58 |
+
|
| 59 |
+
The easiest way to use this model is through the [🐥Docling](https://github.com/docling-project/docling) library. It will automatically download this model and convert documents to various formats for you.
|
| 60 |
+
|
| 61 |
+
Install the latest version of `docling` through pip, then use the following CLI command:
|
| 62 |
+
|
| 63 |
+
```sh
|
| 64 |
+
# Convert to HTML and Markdown:
|
| 65 |
+
docling --to html --to md --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887" # accepts files, urls or directories
|
| 66 |
+
|
| 67 |
+
# Convert to HTML including layout visualization:
|
| 68 |
+
docling --to html_split_page --show-layout --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887"
|
| 69 |
+
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
<p align="center">
|
| 73 |
+
<img src="https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/granite_docling_split_page.png" alt="GraniteDocling result in split page view" width="900"/>
|
| 74 |
+
</p>
|
| 75 |
+
|
| 76 |
+
<details>
|
| 77 |
+
<summary>You can also set this model up within the Docling SDK:</summary>
|
| 78 |
+
|
| 79 |
+
```python
|
| 80 |
+
from docling.datamodel import vlm_model_specs
|
| 81 |
+
from docling.datamodel.base_models import InputFormat
|
| 82 |
+
from docling.datamodel.pipeline_options import (
|
| 83 |
+
VlmPipelineOptions,
|
| 84 |
+
)
|
| 85 |
+
from docling.document_converter import DocumentConverter, PdfFormatOption
|
| 86 |
+
from docling.pipeline.vlm_pipeline import VlmPipeline
|
| 87 |
+
|
| 88 |
+
source = "https://arxiv.org/pdf/2501.17887"
|
| 89 |
+
|
| 90 |
+
###### USING SIMPLE DEFAULT VALUES
|
| 91 |
+
# - GraniteDocling model
|
| 92 |
+
# - Using the transformers framework
|
| 93 |
+
|
| 94 |
+
converter = DocumentConverter(
|
| 95 |
+
format_options={
|
| 96 |
+
InputFormat.PDF: PdfFormatOption(
|
| 97 |
+
pipeline_cls=VlmPipeline,
|
| 98 |
+
),
|
| 99 |
+
}
|
| 100 |
+
)
|
| 101 |
+
|
| 102 |
+
doc = converter.convert(source=source).document
|
| 103 |
+
|
| 104 |
+
print(doc.export_to_markdown())
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
###### USING MACOS MPS ACCELERATOR
|
| 108 |
+
# For more options see the compare_vlm_models.py example.
|
| 109 |
+
|
| 110 |
+
pipeline_options = VlmPipelineOptions(
|
| 111 |
+
vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
converter = DocumentConverter(
|
| 115 |
+
format_options={
|
| 116 |
+
InputFormat.PDF: PdfFormatOption(
|
| 117 |
+
pipeline_cls=VlmPipeline,
|
| 118 |
+
pipeline_options=pipeline_options,
|
| 119 |
+
),
|
| 120 |
+
}
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
doc = converter.convert(source=source).document
|
| 124 |
+
|
| 125 |
+
print(doc.export_to_markdown())
|
| 126 |
+
```
|
| 127 |
+
</details>
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
Alternatively, you can use bare **transformers**, **vllm**, **onnx** or **mlx-vlm** to perform inference, and [docling-core](https://github.com/docling-project/docling-core) APIs to convert results to variety of output formats (md, html, etc.):
|
| 131 |
+
|
| 132 |
+
<details>
|
| 133 |
+
<summary>📄 Single page image inference using plain 🤗 tranformers 🤖</summary>
|
| 134 |
+
|
| 135 |
+
```python
|
| 136 |
+
# Prerequisites:
|
| 137 |
+
# pip install torch
|
| 138 |
+
# pip install docling_core
|
| 139 |
+
# pip install transformers
|
| 140 |
+
|
| 141 |
+
import torch
|
| 142 |
+
from docling_core.types.doc import DoclingDocument
|
| 143 |
+
from docling_core.types.doc.document import DocTagsDocument
|
| 144 |
+
from transformers import AutoProcessor, AutoModelForVision2Seq
|
| 145 |
+
from transformers.image_utils import load_image
|
| 146 |
+
from pathlib import Path
|
| 147 |
+
|
| 148 |
+
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
| 149 |
+
|
| 150 |
+
# Load images
|
| 151 |
+
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
|
| 152 |
+
|
| 153 |
+
# Initialize processor and model
|
| 154 |
+
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
|
| 155 |
+
model = AutoModelForVision2Seq.from_pretrained(
|
| 156 |
+
"ibm-granite/granite-docling-258M",
|
| 157 |
+
torch_dtype=torch.bfloat16,
|
| 158 |
+
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "sdpa",
|
| 159 |
+
).to(DEVICE)
|
| 160 |
+
|
| 161 |
+
# Create input messages
|
| 162 |
+
messages = [
|
| 163 |
+
{
|
| 164 |
+
"role": "user",
|
| 165 |
+
"content": [
|
| 166 |
+
{"type": "image"},
|
| 167 |
+
{"type": "text", "text": "Convert this page to docling."}
|
| 168 |
+
]
|
| 169 |
+
},
|
| 170 |
+
]
|
| 171 |
+
|
| 172 |
+
# Prepare inputs
|
| 173 |
+
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
|
| 174 |
+
inputs = processor(text=prompt, images=[image], return_tensors="pt")
|
| 175 |
+
inputs = inputs.to(DEVICE)
|
| 176 |
+
|
| 177 |
+
# Generate outputs
|
| 178 |
+
generated_ids = model.generate(**inputs, max_new_tokens=8192)
|
| 179 |
+
prompt_length = inputs.input_ids.shape[1]
|
| 180 |
+
trimmed_generated_ids = generated_ids[:, prompt_length:]
|
| 181 |
+
doctags = processor.batch_decode(
|
| 182 |
+
trimmed_generated_ids,
|
| 183 |
+
skip_special_tokens=False,
|
| 184 |
+
)[0].lstrip()
|
| 185 |
+
|
| 186 |
+
print(f"DocTags: \n{doctags}\n")
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
# Populate document
|
| 190 |
+
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
|
| 191 |
+
# create a docling document
|
| 192 |
+
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
|
| 193 |
+
print(f"Markdown:\n{doc.export_to_markdown()}\n")
|
| 194 |
+
|
| 195 |
+
## export as any format.
|
| 196 |
+
# Path("out/").mkdir(parents=True, exist_ok=True)
|
| 197 |
+
# HTML:
|
| 198 |
+
# output_path_html = Path("out/") / "example.html"
|
| 199 |
+
# doc.save_as_html(output_path_html)
|
| 200 |
+
# Markdown:
|
| 201 |
+
# output_path_md = Path("out/") / "example.md"
|
| 202 |
+
# doc.save_as_markdown(output_path_md)
|
| 203 |
+
|
| 204 |
+
```
|
| 205 |
+
</details>
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
<details>
|
| 209 |
+
<summary> 🚀 Fast Batch Inference with VLLM</summary>
|
| 210 |
+
|
| 211 |
+
```python
|
| 212 |
+
# Prerequisites:
|
| 213 |
+
# pip install vllm
|
| 214 |
+
# pip install docling_core
|
| 215 |
+
# place page images you want to convert into "img/" dir
|
| 216 |
+
|
| 217 |
+
import time
|
| 218 |
+
import os
|
| 219 |
+
from vllm import LLM, SamplingParams
|
| 220 |
+
from transformers import AutoProcessor
|
| 221 |
+
from PIL import Image
|
| 222 |
+
from docling_core.types.doc import DoclingDocument
|
| 223 |
+
from docling_core.types.doc.document import DocTagsDocument
|
| 224 |
+
from pathlib import Path
|
| 225 |
+
|
| 226 |
+
# Configuration
|
| 227 |
+
MODEL_PATH = "ibm-granite/granite-docling-258M"
|
| 228 |
+
IMAGE_DIR = "img/" # Place your page images here
|
| 229 |
+
OUTPUT_DIR = "out/"
|
| 230 |
+
PROMPT_TEXT = "Convert this page to docling."
|
| 231 |
+
|
| 232 |
+
messages = [
|
| 233 |
+
{
|
| 234 |
+
"role": "user",
|
| 235 |
+
"content": [
|
| 236 |
+
{"type": "image"},
|
| 237 |
+
{"type": "text", "text": PROMPT_TEXT},
|
| 238 |
+
],
|
| 239 |
+
},
|
| 240 |
+
]
|
| 241 |
+
|
| 242 |
+
|
| 243 |
+
# Ensure output directory exists
|
| 244 |
+
os.makedirs(OUTPUT_DIR, exist_ok=True)
|
| 245 |
+
|
| 246 |
+
# Initialize LLM
|
| 247 |
+
llm = LLM(model=MODEL_PATH, revision="untied", limit_mm_per_prompt={"image": 1})
|
| 248 |
+
processor = AutoProcessor.from_pretrained(MODEL_PATH)
|
| 249 |
+
|
| 250 |
+
sampling_params = SamplingParams(
|
| 251 |
+
temperature=0.0,
|
| 252 |
+
max_tokens=8192,
|
| 253 |
+
skip_special_tokens=False,
|
| 254 |
+
)
|
| 255 |
+
|
| 256 |
+
# Load and prepare all images and prompts up front
|
| 257 |
+
batched_inputs = []
|
| 258 |
+
image_names = []
|
| 259 |
+
|
| 260 |
+
for img_file in sorted(os.listdir(IMAGE_DIR)):
|
| 261 |
+
if img_file.lower().endswith((".png", ".jpg", ".jpeg")):
|
| 262 |
+
img_path = os.path.join(IMAGE_DIR, img_file)
|
| 263 |
+
with Image.open(img_path) as im:
|
| 264 |
+
image = im.convert("RGB")
|
| 265 |
+
|
| 266 |
+
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
|
| 267 |
+
batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}})
|
| 268 |
+
image_names.append(os.path.splitext(img_file)[0])
|
| 269 |
+
|
| 270 |
+
# Run batch inference
|
| 271 |
+
start_time = time.time()
|
| 272 |
+
outputs = llm.generate(batched_inputs, sampling_params=sampling_params)
|
| 273 |
+
|
| 274 |
+
# Postprocess all results
|
| 275 |
+
for img_fn, output, input_data in zip(image_names, outputs, batched_inputs):
|
| 276 |
+
doctags = output.outputs[0].text
|
| 277 |
+
output_path_dt = Path(OUTPUT_DIR) / f"{img_fn}.dt"
|
| 278 |
+
output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md"
|
| 279 |
+
|
| 280 |
+
with open(output_path_dt, "w", encoding="utf-8") as f:
|
| 281 |
+
f.write(doctags)
|
| 282 |
+
|
| 283 |
+
# Convert to DoclingDocument and save markdown
|
| 284 |
+
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [input_data["multi_modal_data"]["image"]])
|
| 285 |
+
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
|
| 286 |
+
doc.save_as_markdown(output_path_md)
|
| 287 |
+
|
| 288 |
+
print(f"Total time: {time.time() - start_time:.2f} sec")
|
| 289 |
+
|
| 290 |
+
```
|
| 291 |
+
</details>
|
| 292 |
+
|
| 293 |
+
💻 Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx)
|
| 294 |
+
|
| 295 |
+
## Intended Use
|
| 296 |
+
Granite-Docling is designed to complement the Docling library, not replace it. It integrates as a component within larger Docling library, consolidating the functions of multiple single-purpose models into a single, compact VLM.
|
| 297 |
+
However, Granite-Docling is **not** intended for general image understanding. For tasks focused solely on image-text input, we recommend using [Granite Vision models](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800), which are purpose-built and optimized for image-text processing.
|
| 298 |
+
|
| 299 |
+
## Evaluations
|
| 300 |
+
A comprehensive discussion of evaluation methods and findings has already been presented in our previous publication [[citation](https://arxiv.org/pdf/2503.11576)]. As this model is an update, we refer readers to that work for additional details.
|
| 301 |
+
The evaluation can be performed using the [docling-eval](https://github.com/docling-project/docling-eval) framework for the document related tasks, and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) for MMStar and OCRBench.
|
| 302 |
+
|
| 303 |
+
<table>
|
| 304 |
+
<thead>
|
| 305 |
+
<tr><th colspan="5"><b>Layout</b></th></tr>
|
| 306 |
+
<tr>
|
| 307 |
+
<th></th>
|
| 308 |
+
<th>MAP ↑</th>
|
| 309 |
+
<th>F1 ↑</th>
|
| 310 |
+
<th>Precision ↑</th>
|
| 311 |
+
<th>Recall ↑</th>
|
| 312 |
+
</tr>
|
| 313 |
+
</thead>
|
| 314 |
+
<tbody>
|
| 315 |
+
<tr>
|
| 316 |
+
<td><b>smoldocling-256m-preview</b></td>
|
| 317 |
+
<td>0.23</td><td>0.85</td><td>0.9</td><td>0.84</td>
|
| 318 |
+
</tr>
|
| 319 |
+
<tr>
|
| 320 |
+
<td><b>granite-docling-258m</b></td>
|
| 321 |
+
<td><b>0.27</b></td><td><b>0.86</b></td><td><b>0.92</b></td><td><b>0.88</b></td>
|
| 322 |
+
</tr>
|
| 323 |
+
</tbody>
|
| 324 |
+
</table>
|
| 325 |
+
|
| 326 |
+
<table>
|
| 327 |
+
<thead>
|
| 328 |
+
<tr><th colspan="7"><b>Full Page OCR</b></th></tr>
|
| 329 |
+
<tr>
|
| 330 |
+
<th></th>
|
| 331 |
+
<th>Edit-distance ↓</th>
|
| 332 |
+
<th>F1 ↑</th>
|
| 333 |
+
<th>Precision ↑</th>
|
| 334 |
+
<th>Recall ↑</th>
|
| 335 |
+
<th>BLEU ↑</th>
|
| 336 |
+
<th>Meteor ↑</th>
|
| 337 |
+
</tr>
|
| 338 |
+
</thead>
|
| 339 |
+
<tbody>
|
| 340 |
+
<tr>
|
| 341 |
+
<td><b>smoldocling-256m-preview</b></td>
|
| 342 |
+
<td>0.48</td><td>0.80</td><td>0.89</td>
|
| 343 |
+
<td>0.79</td><td>0.58</td><td>0.67</td>
|
| 344 |
+
</tr>
|
| 345 |
+
<tr>
|
| 346 |
+
<td><b>granite-docling-258m</b></td>
|
| 347 |
+
<td><b>0.45</b></td><td><b>0.84</b></td><td><b>0.91</b></td>
|
| 348 |
+
<td><b>0.83</b></td><td><b>0.65</b></td><td><b>0.72</b></td>
|
| 349 |
+
</tr>
|
| 350 |
+
</tbody>
|
| 351 |
+
<thead>
|
| 352 |
+
<tr><th colspan="7"><b>Code Recognition</b></th></tr>
|
| 353 |
+
<tr>
|
| 354 |
+
<th></th>
|
| 355 |
+
<th>Edit-distance ↓</th>
|
| 356 |
+
<th>F1 ↑</th>
|
| 357 |
+
<th>Precision ↑</th>
|
| 358 |
+
<th>Recall ↑</th>
|
| 359 |
+
<th>BLEU ↑</th>
|
| 360 |
+
<th>Meteor ↑</th>
|
| 361 |
+
</tr>
|
| 362 |
+
</thead>
|
| 363 |
+
<tbody>
|
| 364 |
+
<tr>
|
| 365 |
+
<td><b>smoldocling-256m-preview</b></td>
|
| 366 |
+
<td>0.114</td><td>0.915</td><td>0.94</td><td>0.909</td><td>0.875</td><td>0.889</td>
|
| 367 |
+
</tr>
|
| 368 |
+
<tr>
|
| 369 |
+
<td><b>granite-docling-258m</b></td>
|
| 370 |
+
<td><b>0.013</b></td><td><b>0.988</b></td><td><b>0.99</b></td><td><b>0.988</b></td>
|
| 371 |
+
<td><b>0.983</b></td><td><b>0.986</b></td>
|
| 372 |
+
</tr>
|
| 373 |
+
</tbody>
|
| 374 |
+
<thead>
|
| 375 |
+
<tr><th colspan="7"><b>Equation Recognition</b></th></tr>
|
| 376 |
+
<tr>
|
| 377 |
+
<th></th>
|
| 378 |
+
<th>Edit-distance ↓</th>
|
| 379 |
+
<th>F1 ↑</th>
|
| 380 |
+
<th>Precision ↑</th>
|
| 381 |
+
<th>Recall ↑</th>
|
| 382 |
+
<th>BLEU ↑</th>
|
| 383 |
+
<th>Meteor ↑</th>
|
| 384 |
+
</tr>
|
| 385 |
+
</thead>
|
| 386 |
+
<tbody>
|
| 387 |
+
<tr>
|
| 388 |
+
<td><b>smoldocling-256m-preview</b></td>
|
| 389 |
+
<td>0.119</td><td>0.947</td><td>0.959</td><td>0.941</td><td>0.824</td><td>0.878</td>
|
| 390 |
+
</tr>
|
| 391 |
+
<tr>
|
| 392 |
+
<td><b>granite-docling-258m</b></td>
|
| 393 |
+
<td><b>0.073</b></td><td><b>0.968</b></td><td><b>0.968</b></td><td><b>0.969</b></td>
|
| 394 |
+
<td><b>0.893</b></td><td><b>0.927</b></td>
|
| 395 |
+
</tr>
|
| 396 |
+
</tbody>
|
| 397 |
+
</table>
|
| 398 |
+
<table>
|
| 399 |
+
<thead>
|
| 400 |
+
<tr><th colspan="3"><b>Table Recognition (FinTabNet 150dpi)</b></th></tr>
|
| 401 |
+
<tr>
|
| 402 |
+
<th></th>
|
| 403 |
+
<th>TEDS (structure) ↑</th>
|
| 404 |
+
<th>TEDS (w/content) ↑</th>
|
| 405 |
+
</tr>
|
| 406 |
+
</thead>
|
| 407 |
+
<tbody>
|
| 408 |
+
<tr>
|
| 409 |
+
<td><b>smoldocling-256m-preview</b></td>
|
| 410 |
+
<td>0.82</td><td>0.76</td>
|
| 411 |
+
</tr>
|
| 412 |
+
<tr>
|
| 413 |
+
<td><b>granite-docling-258m</b></td>
|
| 414 |
+
<td><b>0.97</b></td><td><b>0.96</b></td>
|
| 415 |
+
</tr>
|
| 416 |
+
</tbody>
|
| 417 |
+
</table>
|
| 418 |
+
<table>
|
| 419 |
+
<thead>
|
| 420 |
+
<tr><th colspan="3"><b>Other Benchmarks</b></th></tr>
|
| 421 |
+
<tr>
|
| 422 |
+
<th></th>
|
| 423 |
+
<th>MMStar ↑</th>
|
| 424 |
+
<th>OCRBench ↑</th>
|
| 425 |
+
</tr>
|
| 426 |
+
</thead>
|
| 427 |
+
<tbody>
|
| 428 |
+
<tr>
|
| 429 |
+
<td><b>smoldocling-256m-preview</b></td>
|
| 430 |
+
<td>0.17</td><td>338</td>
|
| 431 |
+
</tr>
|
| 432 |
+
<tr>
|
| 433 |
+
<td><b>granite-docling-258m</b></td>
|
| 434 |
+
<td><b>0.30</b></td><td><b>500</b></td>
|
| 435 |
+
</tr>
|
| 436 |
+
</tbody>
|
| 437 |
+
</table>
|
| 438 |
+
|
| 439 |
+
|
| 440 |
+
|
| 441 |
+
💻 Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx)
|
| 442 |
+
|
| 443 |
+
|
| 444 |
+
## Supported Instructions
|
| 445 |
+
|
| 446 |
+
<table>
|
| 447 |
+
<tr>
|
| 448 |
+
<th>Description</th>
|
| 449 |
+
<th>Instruction</th>
|
| 450 |
+
<th>Short Instruction</th>
|
| 451 |
+
</tr>
|
| 452 |
+
<tr>
|
| 453 |
+
<td><b>Full conversion</b></td>
|
| 454 |
+
<td>Convert this page to docling.</td>
|
| 455 |
+
<td>-</td>
|
| 456 |
+
</tr>
|
| 457 |
+
<tr>
|
| 458 |
+
<td><b>Chart</b></td>
|
| 459 |
+
<td>Convert chart to table.</td>
|
| 460 |
+
<td><code><chart></code></td>
|
| 461 |
+
</tr>
|
| 462 |
+
<tr>
|
| 463 |
+
<td><b>Formula</b></td>
|
| 464 |
+
<td>Convert formula to LaTeX.</td>
|
| 465 |
+
<td><code><formula></code></td>
|
| 466 |
+
</tr>
|
| 467 |
+
<tr>
|
| 468 |
+
<td><b>Code</b></td>
|
| 469 |
+
<td>Convert code to text.</td>
|
| 470 |
+
<td><code><code></code></td>
|
| 471 |
+
</tr>
|
| 472 |
+
<tr>
|
| 473 |
+
<td><b>Table</b></td>
|
| 474 |
+
<td>Convert table to OTSL. (<a href="https://arxiv.org/pdf/2305.03393">Lysak et al., 2023</a>)</td>
|
| 475 |
+
<td><code><otsl></code></td>
|
| 476 |
+
</tr>
|
| 477 |
+
<tr>
|
| 478 |
+
<td rowspan="4"><b>Actions and Pipelines</b></td>
|
| 479 |
+
<td>OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237></td>
|
| 480 |
+
<td>-</td>
|
| 481 |
+
</tr>
|
| 482 |
+
<tr>
|
| 483 |
+
<td>Identify element at: <loc_247><loc_482><loc_252><loc_486></td>
|
| 484 |
+
<td>-</td>
|
| 485 |
+
</tr>
|
| 486 |
+
<tr>
|
| 487 |
+
<td>Find all 'text' elements on the page, retrieve all section headers.</td>
|
| 488 |
+
<td>-</td>
|
| 489 |
+
</tr>
|
| 490 |
+
<tr>
|
| 491 |
+
<td>Detect footer elements on the page.</td>
|
| 492 |
+
<td>-</td>
|
| 493 |
+
</tr>
|
| 494 |
+
</table>
|
| 495 |
+
|
| 496 |
+
|
| 497 |
+
|
| 498 |
+
# Model Architecture:
|
| 499 |
+
|
| 500 |
+
The architecture of granite-docling-258m consists of the following components:
|
| 501 |
+
|
| 502 |
+
(1) Vision encoder: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512).
|
| 503 |
+
|
| 504 |
+
(2) Vision-language connector: pixel shuffle projector (as in idefics3)
|
| 505 |
+
|
| 506 |
+
(3) Large language model: Granite 165M.
|
| 507 |
+
|
| 508 |
+
We built upon [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLM’s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling.
|
| 509 |
+
The model was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework, which provides a lightweight and efficient training setup for vision-language models
|
| 510 |
+
|
| 511 |
+
|
| 512 |
+
**Training Data**: Our training corpus consists of two principal sources: (1) publicly available datasets and (2) internally constructed synthetic datasets designed to elicit specific document understanding capabilities.
|
| 513 |
+
|
| 514 |
+
In particular, we incorporate:
|
| 515 |
+
|
| 516 |
+
* [**SynthCodeNet**](https://huggingface.co/datasets/ds4sd/SynthCodeNet) — a large-scale collection of synthetically rendered code snippets spanning over 50 programming languages
|
| 517 |
+
* [**SynthFormulaNet**](https://huggingface.co/datasets/ds4sd/SynthFormulaNet) — a dataset of synthetic mathematical expressions paired with ground-truth LaTeX representations
|
| 518 |
+
* [**SynthChartNet**](https://huggingface.co/datasets/ds4sd/SynthChartNet) — synthetic chart images annotated with structured table outputs
|
| 519 |
+
* [**DoclingMatix**](https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix) — a curated corpus of real-world document pages sampled from diverse domains
|
| 520 |
+
|
| 521 |
+
|
| 522 |
+
**Infrastructure**: We train granite-docling-258m using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|
| 523 |
+
|
| 524 |
+
**Responsible Use and Limitations** Some use cases for Vision Language Models can trigger certain risks and ethical considerations, including but not limited to: bias and fairness, misinformation, and autonomous decision-making.
|
| 525 |
+
Although our alignment processes include safety considerations, the model may in some cases produce inaccurate, biased, offensive or unwanted responses to user prompts. Additionally, whether smaller models may exhibit increased susceptibility
|
| 526 |
+
to hallucination in generation scenarios due to their reduced sizes, which could limit their ability to generate coherent and contextually accurate responses, remains uncertain. This aspect is currently an active area of research,
|
| 527 |
+
and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. We urge the community to use granite-docling-258m in a responsible way and avoid any malicious utilization. We recommend using this model only as part of the Docling library.
|
| 528 |
+
More general vision tasks may pose higher inherent risks of triggering unwanted output. To enhance safety, we recommend using granite-docling-258m alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas.
|
| 529 |
+
Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
|
| 530 |
+
|
| 531 |
+
**Resources**
|
| 532 |
+
|
| 533 |
+
- ⭐️ Learn about the latest updates with Docling: https://docling-project.github.io/docling/#features
|
| 534 |
+
- 🚀 Get started with Docling concepts, integrations and tutorials: https://docling-project.github.io/docling/getting_started/
|
| 535 |
+
- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
|
| 536 |
+
- 🖥️ Learn more about how to use Granite-Docling, explore the Docling library, and see what’s coming next for Docling in the release blog: https://ibm.com/new/announcements/granite-docling-end-to-end-document-conversion
|