Baidu researchers have developed an OCR model that processes dozens of document pages in a single inference pass – while previous systems topped out at around ten pages. The system, called Unlimited OCR, uses a novel attention mechanism named Reference Sliding Window Attention (R-SWA) to keep memory and processing speed constant, regardless of text volume.
Quick Facts
- Unlimited OCR handles dozens of pages in one pass, compared to the previous limit of about ten pages
- The innovation's core: R-SWA keeps the KV-cache at constant size instead of growing linearly
- Baidu uses Deepseek OCR as its foundation and pairs it with a Mixture-of-Experts architecture (3 billion parameters, with 500 million active during inference)
- Trained on roughly two million document samples – the system currently tops the most important OCR benchmark
The Problem: The KV-Cache Bottleneck
Previous OCR systems hit a technical wall. Language models store all processed tokens in a KV-cache during text generation – a buffer they reference later. With multi-page documents, this cache grows linearly with every new line. That causes exponential memory bloat and steadily declining speed. The practical workaround was crude: process each page separately, reset the cache, move to the next page – inefficient and slow.
Human Forgetting as a Model
Baidu solves this with an elegant analogy to human perception. When copying a book, you don't constantly re-read everything you've written. You focus on the source, the last few characters, and what comes next. Older passages fade through a kind of "soft forgetting."
That's exactly what R-SWA does: each newly generated token sees all visual reference tokens and the prompt – but when looking back at already-generated output, it only attends to the last 128 tokens. The KV-cache stays constant instead of growing. An additional trick: visual tokens are encoded once and remain unchanged, preventing them from blurring through ongoing state changes.
| Aspect | Previous Systems | Unlimited OCR |
|---|---|---|
| Pages per pass | ~10 | Dozens |
| KV-cache growth | Linear | Constant |
| Latency across decoding steps | Rising | Flat |
Architecture and Training
Unlimited OCR builds on Deepseek OCR. The DeepEncoder compresses a 1024×1024-pixel PDF image down to 256 tokens. The decoder network is a Mixture-of-Experts architecture with three billion parameters, of which only around 500 million are active during inference – saving compute. Training used roughly two million document samples, split 9-to-1 between single-page and multi-page data.
What This Means for You
This matters especially for German enterprises handling document processing – insurance, government, logistics, financial services. A system processing dozens of pages in one pass could dramatically speed up batch processing and reduce memory demands. Key questions remain: How well does Unlimited OCR handle German-language documents and specialized formats (forms, tables)? When will it become publicly available? Baidu has demonstrated a technical edge here – German and European teams should watch closely.
Sources
Editorially owned by Ideal Syka. Sources and method: Newsroom & method. Tips and corrections: ai@i6eal.de.




