Baidu Breaks OCR Bottleneck: Dozens of Pages in One Pass

Baidu researchers have developed an OCR model that processes dozens of document pages in a single inference pass – while previous systems topped out at around ten pages. The system, called Unlimited OCR, uses a novel attention mechanism named Reference Sliding Window Attention (R-SWA) to keep memory and processing speed constant, regardless of text volume.

Quick Facts

Unlimited OCR handles dozens of pages in one pass, compared to the previous limit of about ten pages
The innovation's core: R-SWA keeps the KV-cache at constant size instead of growing linearly
Baidu uses Deepseek OCR as its foundation and pairs it with a Mixture-of-Experts architecture (3 billion parameters, with 500 million active during inference)
Trained on roughly two million document samples – the system currently tops the most important OCR benchmark

The Problem: The KV-Cache Bottleneck

Previous OCR systems hit a technical wall. Language models store all processed tokens in a KV-cache during text generation – a buffer they reference later. With multi-page documents, this cache grows linearly with every new line. That causes exponential memory bloat and steadily declining speed. The practical workaround was crude: process each page separately, reset the cache, move to the next page – inefficient and slow.

Human Forgetting as a Model

Baidu solves this with an elegant analogy to human perception. When copying a book, you don't constantly re-read everything you've written. You focus on the source, the last few characters, and what comes next. Older passages fade through a kind of "soft forgetting."

That's exactly what R-SWA does: each newly generated token sees all visual reference tokens and the prompt – but when looking back at already-generated output, it only attends to the last 128 tokens. The KV-cache stays constant instead of growing. An additional trick: visual tokens are encoded once and remain unchanged, preventing them from blurring through ongoing state changes.

Aspect	Previous Systems	Unlimited OCR
Pages per pass	~10	Dozens
KV-cache growth	Linear	Constant
Latency across decoding steps	Rising	Flat

Architecture and Training

Unlimited OCR builds on Deepseek OCR. The DeepEncoder compresses a 1024×1024-pixel PDF image down to 256 tokens. The decoder network is a Mixture-of-Experts architecture with three billion parameters, of which only around 500 million are active during inference – saving compute. Training used roughly two million document samples, split 9-to-1 between single-page and multi-page data.

What This Means for You

This matters especially for German enterprises handling document processing – insurance, government, logistics, financial services. A system processing dozens of pages in one pass could dramatically speed up batch processing and reduce memory demands. Key questions remain: How well does Unlimited OCR handle German-language documents and specialized formats (forms, tables)? When will it become publicly available? Baidu has demonstrated a technical edge here – German and European teams should watch closely.

Sources

The Decoder

Editorially owned by Ideal Syka. Sources and method: Newsroom & method. Tips and corrections: ai@i6eal.de.

All analyses are based on i6eal's own measurements or on clearly labelled sources. Figures are snapshots and may change; corrections are disclosed transparently.