NewsOCRDocument RecognitionAttention Mechanism

Baidu Breaks OCR Bottleneck: Dozens of Pages in One Pass

Chinese researchers have built a document recognition system that shatters the previous ten-page limit. A modified attention mechanism keeps memory usage constant regardless of document length.

Dozens of pages instead of ten per pass

Baidu Breaks OCR Bottleneck: Dozens of Pages in One Pass

Baidu researchers have developed an OCR model that processes dozens of document pages in a single inference pass – while previous systems topped out at around ten pages. The system, called Unlimited OCR, uses a novel attention mechanism named Reference Sliding Window Attention (R-SWA) to keep memory and processing speed constant, regardless of text volume.

Quick Facts

  • Unlimited OCR handles dozens of pages in one pass, compared to the previous limit of about ten pages
  • The innovation's core: R-SWA keeps the KV-cache at constant size instead of growing linearly
  • Baidu uses Deepseek OCR as its foundation and pairs it with a Mixture-of-Experts architecture (3 billion parameters, with 500 million active during inference)
  • Trained on roughly two million document samples – the system currently tops the most important OCR benchmark

The Problem: The KV-Cache Bottleneck

Previous OCR systems hit a technical wall. Language models store all processed tokens in a KV-cache during text generation – a buffer they reference later. With multi-page documents, this cache grows linearly with every new line. That causes exponential memory bloat and steadily declining speed. The practical workaround was crude: process each page separately, reset the cache, move to the next page – inefficient and slow.

Human Forgetting as a Model

Baidu solves this with an elegant analogy to human perception. When copying a book, you don't constantly re-read everything you've written. You focus on the source, the last few characters, and what comes next. Older passages fade through a kind of "soft forgetting."

That's exactly what R-SWA does: each newly generated token sees all visual reference tokens and the prompt – but when looking back at already-generated output, it only attends to the last 128 tokens. The KV-cache stays constant instead of growing. An additional trick: visual tokens are encoded once and remain unchanged, preventing them from blurring through ongoing state changes.

Aspect Previous Systems Unlimited OCR
Pages per pass ~10 Dozens
KV-cache growth Linear Constant
Latency across decoding steps Rising Flat

Architecture and Training

Unlimited OCR builds on Deepseek OCR. The DeepEncoder compresses a 1024×1024-pixel PDF image down to 256 tokens. The decoder network is a Mixture-of-Experts architecture with three billion parameters, of which only around 500 million are active during inference – saving compute. Training used roughly two million document samples, split 9-to-1 between single-page and multi-page data.

What This Means for You

This matters especially for German enterprises handling document processing – insurance, government, logistics, financial services. A system processing dozens of pages in one pass could dramatically speed up batch processing and reduce memory demands. Key questions remain: How well does Unlimited OCR handle German-language documents and specialized formats (forms, tables)? When will it become publicly available? Baidu has demonstrated a technical edge here – German and European teams should watch closely.

Sources

Editorially owned by Ideal Syka. Sources and method: Newsroom & method. Tips and corrections: ai@i6eal.de.

Share
← All articles

All analyses are based on i6eal's own measurements or on clearly labelled sources. Figures are snapshots and may change; corrections are disclosed transparently.