IBM released Granite 4.0 3B Vision on March 27, a compact vision-language model designed to parse the kinds of documents that eat up enormous human time: forms, tables, charts, and the visual chaos of enterprise PDFs. The model is not the story. The dataset underneath it is.
ChartNet, which IBM researchers will describe in a paper at CVPR 2026, is a 1.5 million-sample multimodal dataset purpose-built for chart and document understanding. Each sample contains five aligned components: the plotting code that generated the chart, the rendered image, the underlying data table, a natural language summary, and question-answer pairs with reasoning traces. That five-way alignment is what separates it from typical synthetic chart datasets, and it is the reason the model learns to reason about charts rather than merely describe them. IBM's Hugging Face blog says 1.7 million samples; the peer-reviewed CVPR paper cites 1.5 million.
The architectural choice that makes it work is called DeepStack Injection, a variant of the DeepStack approach IBM published in 2024. Most vision-language models inject visual information into a language model at a single point, which forces the model to handle both high-level semantics and fine-grained spatial detail simultaneously. DeepStack routes abstract visual features into earlier layers where semantic understanding happens, and feeds high-resolution spatial features into later layers where layout and structure live. The result is a model that reads both what is in a document and where it is, which turns out to matter enormously for table extraction and key-value parsing.
The model ships as a LoRA adapter on top of Granite 4.0 Micro rather than as a standalone system, meaning the same deployment can serve both multimodal and text-only workloads, falling back to the base model when vision is not required. That modularity is deliberate: it keeps enterprise deployment practical without requiring a full model swap when document-processing pipelines encounter plain text.
On Chart2Summary, a benchmark evaluated by LLM-as-a-judge, Granite 4.0 3B Vision scored 86.4 percent, the highest of any model evaluated. On Chart2CSV, which tests whether a model can extract the underlying data table from a chart, it scored 62.1 percent, placing second behind Qwen3.5-9B at 63.4 percent, a model more than three times its size. The performance gap between a 3 billion parameter model and a 9 billion parameter model on this task is small enough to be interesting: size is not the only variable.
The more important number is on a different benchmark. On VAREX, a dataset of 1,777 real U.S. government forms ranging from simple flat layouts to complex nested structures, Granite 4.0 3B Vision achieved 85.5 percent exact-match accuracy zero-shot, without any task-specific fine-tuning. That means the model reads real government forms it has never seen before and extracts the correct key-value pairs at a rate that compares favorably to human reviewers. Exact match is a strict metric: the extracted pairs must match ground truth character for character. At that level of accuracy on real forms, the question is not whether the model can help with document processing, but what happens to the human reviewers who currently do it.
On table extraction, the model led across every benchmark evaluated: PubTablesV2 at 92.1 percent on cropped tables and 79.3 percent on full-page documents, OmniDocBench at 64.0 percent, and TableVQA at 88.1 percent, all measured by TEDS, a metric that evaluates both structural and content accuracy.
ChartNet itself was generated using a code-guided synthesis pipeline across 24 chart types and 6 plotting libraries. IBM supplemented the synthetic data with human-annotated and real-world subsets filtered for visual fidelity and semantic accuracy. The five-component alignment means that for any chart, the model sees the code that built it, the rendered output, the raw data, a human-written summary, and structured Q&A. That cross-modal grounding is what lets it go beyond pattern recognition and into actual reasoning about chart content.
The model is available on Hugging Face under an Apache 2.0 license. It can run standalone on individual images or be integrated with Docling, an open-source document processing library, for end-to-end pipelines handling multi-page PDFs. IBM specifically mentions financial report analysis and research document intelligence as target use cases.
What the benchmark results suggest, even before this model ships into production workflows, is that the bottleneck in enterprise document processing is no longer optical character recognition or layout understanding. The models have gotten good enough at that. The bottleneck is the cost and speed of human review for extracted content. When a model reads a government form at 85.5 percent exact-match accuracy, the economics of document processing change: what used to require a human to verify every field can now require a human to audit a sample. That shift has not landed yet in most enterprise pipelines, but the ChartNet paper, accepted at CVPR 2026, suggests the dataset underpinning it will accelerate the timeline.