Deepseek's OCR system compresses image-based text so AI can handle much longer documents

3 weeks ago 11

ARTICLE AD BOX

Chinese AI company Deepseek has built an OCR system that compresses image-based text documents for language models, aiming to let AI handle much longer contexts without running into memory limits.

The main idea is that processing text as an image can use less compute than working with the digital text itself. According to Deepseek's technical paper, their OCR can compress text by up to a factor of ten while keeping 97 percent of the original information.

DeepSeek-OCR extracts structured chart data from financial reports and renders it in Markdown.

Deepseek OCR's deep parsing mode can convert financial charts into structured data, automatically generating Markdown tables and graphs. | Image: DeepseekThe system has two core parts: DeepEncoder, which handles image processing, and a text generator built on Deepseek3B-MoE with 570 million active parameters. DeepEncoder itself uses 380 million parameters to analyze each image and produce a compressed version.

Block diagram of DeepSeek-OCR with SAM-ViTDet, 16× convolution compressor, CLIP ViT-300M, and DeepSeek-3B MoE decoder.

DeepEncoder joins Meta's 80-million-parameter SAM (Segment Anything Model) for image segmentation with OpenAI's 300-million-parameter CLIP, which links images and text. A 16x compressor sits between them, drastically cutting the number of image tokens. A 1,024 by 1,024 pixel image starts with 4,096 tokens. SAM processes them, and the compressor reduces this to just 256 tokens, which are then passed to the compute-intensive CLIP model.

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

Deepseek OCR can work with a range of image resolutions. At lower resolutions, it needs only 64 "vision tokens" per image; at higher resolutions, up to 400. By comparison, conventional OCR systems often require thousands of tokens for the same task.

DeepSeek OCR converts Chinese geometry problems into Markdown, extracts figures as vector graphs, and renders them anew.

In OmniDocBench tests, Deepseek OCR outperformed GOT-OCR 2.0 using just 100 vision tokens compared to 256. With fewer than 800 tokens, it also beat MinerU 2.0, which requires more than 6,000 tokens per page.

Edit distances of OCR models (English/Chinese) on OmniDocBench: DeepSeek-OCR Gundam-M†200dpi achieves the best results.

Token requirements depend on the document. Simple presentations use 64 tokens. Books and reports need about 100. Complex newspapers require Deepseek's "Gundam mode" with up to 800 tokens.

Four DeepSeek OCR modes – Resize 64/100, Padding 256/400×R, Multi-page n·100/256+256/400, Sliding n·100/256+256/400×R

The system supports a wide range of document types, from plain text to diagrams, chemical formulas, and geometric figures. It works in about 100 languages, can keep the original formatting, output plain text, and still provide general image descriptions.

For training, the researchers used 30 million PDF pages in roughly 100 languages, including 25 million in Chinese and English, along with 10 million synthetic diagrams, 5 million chemical formulas, and 1 million geometric figures.

Recommendation

Processes 33 million pages per day

In real-world use, Deepseek OCR can process over 200,000 pages per day on a single Nvidia A100 GPU. With 20 servers, each running eight A100s, throughput jumps to 33 million pages daily.

loss of information over time, distance, and resolution

This kind of throughput could help build training datasets for other AI models. Modern language models need massive amounts of text, and Deepseek OCR can extract it from documents. Both the code and model weights are publicly available.

Read Entire Article