MLLMs for OCR and Document AI: Beyond Traditional Text Recognition
Multimodal LLMs are replacing traditional OCR pipelines for document understanding. They read layouts, understand context, and extract structured data from messy real-world documents.
View all mllms depths βDepth ladder for this topic:
Traditional OCR converts pixels to text. Thatβs it. It doesnβt understand what itβs reading. A receipt, a contract, and a handwritten note all produce flat text streams with no structure. Multimodal LLMs change this fundamentally β they understand documents the way humans do.
The Shift from OCR to Document Understanding
Traditional OCR pipeline:
Image β Text detection β Character recognition β Raw text β Post-processing β Structured data
MLLM approach:
Image β Model β Structured data (directly)
The MLLM sees the document holistically. It understands that a number next to βTotal:β is a price. It knows that text in a box at the top is likely a header. It can read handwriting, interpret tables, and handle rotated or skewed text β all in one pass.
What MLLMs Can Do That OCR Canβt
Layout Understanding
Traditional OCR reads left-to-right, top-to-bottom. It doesnβt understand columns, sidebars, or complex layouts. Feed it a newspaper page and you get interleaved text from multiple articles.
MLLMs understand spatial relationships:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "data": invoice_b64, "media_type": "image/png"}},
{"type": "text", "text": """Extract all fields from this invoice as JSON:
{vendor, invoice_number, date, due_date,
line_items: [{description, qty, unit_price, total}],
subtotal, tax, total, payment_terms}"""}
]
}]
)
The model handles varied invoice layouts β different vendors, different formats, different languages β without any format-specific rules.
Contextual Interpretation
OCR tells you a document contains βNet 30.β An MLLM understands this means payment is due in 30 days and can extract it as {"payment_terms": "Net 30", "payment_due_days": 30}.
Handwriting and Degraded Text
MLLMs handle messy handwriting, faded text, stamps overlapping text, and coffee stains surprisingly well. They use context to infer illegible characters β the same way humans do.
Multi-Language Documents
A single document might contain English text, Chinese characters, and Arabic numbers. MLLMs handle this without switching OCR engines or language models.
Best Models for Document AI
| Model | Strengths | Best For |
|---|---|---|
| GPT-4o | Excellent general document understanding | Varied document types |
| Claude 3.5 Sonnet/Opus | Strong structured extraction, good with tables | Invoice/form processing |
| Gemini 2.0 Flash | Fast, good quality, handles long documents | High-volume processing |
| Qwen2-VL 72B | Strong OCR, competitive quality, open weights | Self-hosted pipelines |
| GOT-OCR | Specialized for OCR, very accurate | When pure text extraction is enough |
Production Architecture
For Low Volume (<1000 docs/day)
Direct API calls work fine:
async def process_document(image_bytes: bytes, doc_type: str) -> dict:
prompt = EXTRACTION_PROMPTS[doc_type] # Type-specific extraction prompt
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "data": base64.b64encode(image_bytes).decode()}},
{"type": "text", "text": prompt}
]
}]
)
result = json.loads(response.content[0].text)
return validate_extraction(result, doc_type)
For High Volume (>1000 docs/day)
Add a pipeline with classification β routing β extraction:
Document β Classify (fast model) β Route to specialized prompt β Extract (best model for type) β Validate β Store
β
[invoice prompt]
[receipt prompt]
[contract prompt]
[form prompt]
Use a cheaper model (GPT-4o-mini, Gemini Flash) for classification, and a stronger model for extraction of complex documents.
Hybrid Approach
For structured documents with known layouts (government forms, standardized invoices), traditional OCR + template matching is faster and cheaper. Use MLLMs as a fallback for documents that donβt match known templates.
def process_document(image):
# Try template matching first (fast, cheap)
result = template_ocr(image)
if result.confidence > 0.95:
return result
# Fall back to MLLM (slower, more capable)
return mllm_extract(image)
Accuracy and Validation
MLLMs make mistakes β just different mistakes than OCR. Common issues:
- Hallucinated fields: The model invents data that isnβt in the document. Always validate extracted values against the source.
- Number transposition: β1,234β becomes β1,243β. Critical for financial documents.
- Date format ambiguity: β03/04/2026β β March 4 or April 3? Specify the expected format in your prompt.
Validation strategies:
def validate_invoice(extracted: dict, image: bytes) -> dict:
# Arithmetic check: do line items sum to subtotal?
computed_subtotal = sum(item['total'] for item in extracted['line_items'])
if abs(computed_subtotal - extracted['subtotal']) > 0.01:
extracted['_warnings'].append('Line item totals don\'t match subtotal')
# Cross-reference check: verify total = subtotal + tax
if abs(extracted['subtotal'] + extracted['tax'] - extracted['total']) > 0.01:
extracted['_warnings'].append('Total doesn\'t match subtotal + tax')
# Confidence: ask the model to verify its own extraction
verification = verify_with_second_pass(extracted, image)
return extracted
Cost Comparison
For a typical invoice (one page, ~500 tokens of extracted data):
| Method | Cost per Document | Accuracy |
|---|---|---|
| Traditional OCR (Tesseract) | ~$0.001 | 85-92% |
| Cloud OCR (Google/AWS) | $0.005-0.01 | 92-96% |
| GPT-4o-mini | ~$0.005 | 93-97% |
| Claude Sonnet | ~$0.01 | 95-98% |
| GPT-4o | ~$0.02 | 96-99% |
The MLLM approach costs more per document but requires dramatically less engineering effort β no templates, no format-specific rules, no OCR post-processing logic.
When to Use What
Use traditional OCR when: documents are clean and structured, volume is very high, cost is the primary concern, and you have engineering resources for post-processing.
Use MLLMs when: document formats vary, layout is complex, you need structured extraction (not just text), documents include handwriting or mixed languages, or you want to ship fast without building format-specific parsers.
Use hybrid when: you have a mix of standardized and varied documents, cost matters at scale, and you want high accuracy with reasonable cost.
Simplify
β Multimodal LLM Safety: Alignment Challenges Across Modalities
Go deeper
MLLMs in the Wild: Real-World Visual Understanding Beyond Benchmarks β
Related reads
Stay ahead of the AI curve
Weekly insights on AI β explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.