IBM Research and Hugging Face Introduce SmolDocling: A Compact Vision-Language Model for Document Conversion
IBM Research and Hugging Face have unveiled SmolDocling, an ultra-compact vision-language model designed for end-to-end document conversion. Unlike traditional models that rely on large foundational architectures or complex pipelines, SmolDocling offers a lightweight, efficient solution for processing entire documents while preserving their structure, content, and spatial layout.
Key Features of SmolDocling:
Compact size: With 256M parameters, it rivals models up to 27 times larger while requiring significantly less computational power.
New markup format: Introduces DocTags, a structured, universal representation that captures all document elements, including tables, equations, charts, and code.
Diverse document handling: Extends beyond scientific papers to process business reports, patents, forms, and academic articles.
Public datasets contribution: Introduces new datasets for tasks like chart understanding, equation recognition, and code extraction.
Superior performance: Competes with larger models such as Qwen2.5-VL (7B) and GOT (580M), outperforming them in OCR accuracy, code recognition, and document layout analysis.
The model is currently available on