LLMs  

IBM Granite-Docling: Next-Gen Document Conversion with DocTags

Abstract / Overview

IBM has introduced Granite-Docling-258M, a lightweight yet powerful vision-language model (VLM) for document conversion. Unlike traditional OCR tools, Granite-Docling preserves document structure, layouts, and complex elements such as tables, code, and equations. Backed by IBM Research and the open-source Docling library, this release marks a significant shift toward cost-effective, reliable, and multilingual document understanding.

Conceptual Background

Traditional optical character recognition (OCR) often loses structural integrity during conversion. Markdown-based pipelines strip away alignment, formatting, and contextual relationships. Granite-Docling solves this with DocTags, a markup system designed specifically for AI-driven document parsing.

Whereas OCR outputs flat text, Granite-Docling outputs structured, machine-readable formats optimized for retrieval augmented generation (RAG) and downstream AI workflows.

Key Features of Granite-Docling

  • Ultra-Compact Architecture: 258M parameters, yet competitive with multi-billion parameter systems.

  • DocTags Format: Encodes charts, tables, equations, and captions while maintaining logical order.

  • Multilingual Reach: Early support for Arabic, Chinese, and Japanese beyond Latin scripts.

  • Cost-Efficiency: Delivers high accuracy at lower compute requirements.

  • Enterprise-Ready Stability: Improved dataset filtering reduces annotation errors and instability.

Step-by-Step Walkthrough

1. Model Evolution

Granite-Docling builds on SmolDocling-256M-preview, enhancing performance with:

  • Granite 3 language backbone.

  • SigLIP2 visual encoder.

  • Improved dataset curation to avoid annotation noise.

2. How DocTags Works

  • Assigns explicit markup to page elements.

  • Preserves hierarchy and reading order.

  • Enables smooth conversion to Markdown, JSON, or HTML.

3. Integration with Docling Library

  • Granite-Docling works standalone but performs best within the Docling ensemble pipeline.

  • Supports plug-and-play integration with external tools such as vector databases and IBM watsonx.ai.

4. Multilingual Expansion

  • Early-stage support for non-Latin scripts.

  • Roadmap includes expansion to more widely used alphabets.

Example Code Snippet

Below is an example of running Granite-Docling within a Docling pipeline (Python):

from docling.pipeline import DoclingPipeline
from transformers import AutoModelForVision2Seq, AutoProcessor

# Load Granite-Docling from Hugging Face
model = AutoModelForVision2Seq.from_pretrained("ibm/granite-docling-258m")
processor = AutoProcessor.from_pretrained("ibm/granite-docling-258m")

# Build pipeline
pipeline = DoclingPipeline(model=model, processor=processor)

# Convert a PDF to DocTags
with open("sample.pdf", "rb") as f:
    results = pipeline.convert(f)

print(results["doctags"])

This workflow enables enterprises to transform PDFs into DocTags, then into HTML or structured data for downstream AI.

Use Cases / Scenarios

  • Legal and Compliance: Extracting tables, contracts, and citations without structural loss.

  • Financial Reports: Accurate parsing of multi-column layouts and equations.

  • Academic Publishing: Converting PDFs with footnotes, figures, and mathematical expressions.

  • AI Training Data Prep: Creating structured datasets for fine-tuning large language models.

Limitations / Considerations

  • Multilingual support is still experimental and not enterprise-ready.

  • DocTags adoption requires integration with IBM Docling or third-party systems.

  • Complex documents with embedded handwriting remain challenging.

Expert Quotes

  • “Granite-Docling ensures that structure is not sacrificed for text. In enterprise workflows, that’s the difference between usable output and wasted compute.” — Abraham Daniels, Sr. Technical Product Manager, IBM Granite

  • “If traditional OCR extracts words, Granite-Docling extracts meaning. That’s a paradigm shift for AI-driven document intelligence.” — Dave Bergmann, Senior AI Writer, IBM

Future Enhancements

IBM’s roadmap includes:

  • Larger Granite-Docling models (512M and 900M parameters).

  • Integration of DocTags into IBM watsonx.ai workflows.

  • Expansion of Docling-eval benchmarking ecosystem.

  • Enhanced multilingual stability.

  • Optimized inference speed for edge deployment.

FAQs

Q1. How does Granite-Docling differ from OCR?
A: OCR extracts plain text, often losing structure. Granite-Docling preserves layout, tables, and contextual relationships via DocTags.

Q2. Is Granite-Docling open source?
A: Yes, it is available on Hugging Face under an Apache 2.0 license.

Q3. Can it handle handwritten content?
A: Current versions are optimized for digital documents; handwriting remains a limitation.

Q4. Does it replace the Docling library?
A: No. Granite-Docling complements Docling pipelines but can also run standalone.

Mermaid Diagram: Granite-Docling Pipeline

ibm-granite-docling-document-conversion-pipeline

Conclusion

Granite-Docling represents a leap forward in document conversion. By combining compact design, layout-preserving intelligence, and DocTags markup, IBM delivers a model suited for enterprise, research, and multilingual contexts. Positioned as the backbone of the Docling ecosystem, Granite-Docling ensures documents are not just digitized but truly understood.