IBM Research and Hugging Face Introduce SmolDocling: A Compact Vision-Language Model for Document Conversion

IBM Research and Hugging Face Introduce SmolDocling: A Compact Vision-Language Model for Document Conversion

IBM Research and Hugging Face have unveiled SmolDocling, an ultra-compact vision-language model designed for end-to-end document conversion. Unlike traditional models that rely on large foundational architectures or complex pipelines, SmolDocling offers a lightweight, efficient solution for processing entire documents while preserving their structure, content, and spatial layout.

Key Features of SmolDocling:

Compact size: With 256M parameters, it rivals models up to 27 times larger while requiring significantly less computational power.

New markup format: Introduces DocTags, a structured, universal representation that captures all document elements, including tables, equations, charts, and code.

Diverse document handling: Extends beyond scientific papers to process business reports, patents, forms, and academic articles.

Public datasets contribution: Introduces new datasets for tasks like chart understanding, equation recognition, and code extraction.

Superior performance: Competes with larger models such as Qwen2.5-VL (7B) and GOT (580M), outperforming them in OCR accuracy, code recognition, and document layout analysis.

The model is currently available on

©Postnetwork-All rights reserved.