🟣 Technical 9 min read

MLLMs for OCR and Document AI: Beyond Traditional Text Recognition

Multimodal LLMs are replacing traditional OCR pipelines for document understanding. They read layouts, understand context, and extract structured data from messy real-world documents.

View all mllms depths β†’

Traditional OCR converts pixels to text. That’s it. It doesn’t understand what it’s reading. A receipt, a contract, and a handwritten note all produce flat text streams with no structure. Multimodal LLMs change this fundamentally β€” they understand documents the way humans do.

The Shift from OCR to Document Understanding

Traditional OCR pipeline:

Image β†’ Text detection β†’ Character recognition β†’ Raw text β†’ Post-processing β†’ Structured data

MLLM approach:

Image β†’ Model β†’ Structured data (directly)

The MLLM sees the document holistically. It understands that a number next to β€œTotal:” is a price. It knows that text in a box at the top is likely a header. It can read handwriting, interpret tables, and handle rotated or skewed text β€” all in one pass.

What MLLMs Can Do That OCR Can’t

Layout Understanding

Traditional OCR reads left-to-right, top-to-bottom. It doesn’t understand columns, sidebars, or complex layouts. Feed it a newspaper page and you get interleaved text from multiple articles.

MLLMs understand spatial relationships:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "data": invoice_b64, "media_type": "image/png"}},
            {"type": "text", "text": """Extract all fields from this invoice as JSON:
            {vendor, invoice_number, date, due_date, 
             line_items: [{description, qty, unit_price, total}],
             subtotal, tax, total, payment_terms}"""}
        ]
    }]
)

The model handles varied invoice layouts β€” different vendors, different formats, different languages β€” without any format-specific rules.

Contextual Interpretation

OCR tells you a document contains β€œNet 30.” An MLLM understands this means payment is due in 30 days and can extract it as {"payment_terms": "Net 30", "payment_due_days": 30}.

Handwriting and Degraded Text

MLLMs handle messy handwriting, faded text, stamps overlapping text, and coffee stains surprisingly well. They use context to infer illegible characters β€” the same way humans do.

Multi-Language Documents

A single document might contain English text, Chinese characters, and Arabic numbers. MLLMs handle this without switching OCR engines or language models.

Best Models for Document AI

ModelStrengthsBest For
GPT-4oExcellent general document understandingVaried document types
Claude 3.5 Sonnet/OpusStrong structured extraction, good with tablesInvoice/form processing
Gemini 2.0 FlashFast, good quality, handles long documentsHigh-volume processing
Qwen2-VL 72BStrong OCR, competitive quality, open weightsSelf-hosted pipelines
GOT-OCRSpecialized for OCR, very accurateWhen pure text extraction is enough

Production Architecture

For Low Volume (<1000 docs/day)

Direct API calls work fine:

async def process_document(image_bytes: bytes, doc_type: str) -> dict:
    prompt = EXTRACTION_PROMPTS[doc_type]  # Type-specific extraction prompt
    
    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "data": base64.b64encode(image_bytes).decode()}},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    
    result = json.loads(response.content[0].text)
    return validate_extraction(result, doc_type)

For High Volume (>1000 docs/day)

Add a pipeline with classification β†’ routing β†’ extraction:

Document β†’ Classify (fast model) β†’ Route to specialized prompt β†’ Extract (best model for type) β†’ Validate β†’ Store
                                         ↓
                                   [invoice prompt]
                                   [receipt prompt]  
                                   [contract prompt]
                                   [form prompt]

Use a cheaper model (GPT-4o-mini, Gemini Flash) for classification, and a stronger model for extraction of complex documents.

Hybrid Approach

For structured documents with known layouts (government forms, standardized invoices), traditional OCR + template matching is faster and cheaper. Use MLLMs as a fallback for documents that don’t match known templates.

def process_document(image):
    # Try template matching first (fast, cheap)
    result = template_ocr(image)
    if result.confidence > 0.95:
        return result
    
    # Fall back to MLLM (slower, more capable)
    return mllm_extract(image)

Accuracy and Validation

MLLMs make mistakes β€” just different mistakes than OCR. Common issues:

  • Hallucinated fields: The model invents data that isn’t in the document. Always validate extracted values against the source.
  • Number transposition: β€œ1,234” becomes β€œ1,243”. Critical for financial documents.
  • Date format ambiguity: β€œ03/04/2026” β€” March 4 or April 3? Specify the expected format in your prompt.

Validation strategies:

def validate_invoice(extracted: dict, image: bytes) -> dict:
    # Arithmetic check: do line items sum to subtotal?
    computed_subtotal = sum(item['total'] for item in extracted['line_items'])
    if abs(computed_subtotal - extracted['subtotal']) > 0.01:
        extracted['_warnings'].append('Line item totals don\'t match subtotal')
    
    # Cross-reference check: verify total = subtotal + tax
    if abs(extracted['subtotal'] + extracted['tax'] - extracted['total']) > 0.01:
        extracted['_warnings'].append('Total doesn\'t match subtotal + tax')
    
    # Confidence: ask the model to verify its own extraction
    verification = verify_with_second_pass(extracted, image)
    
    return extracted

Cost Comparison

For a typical invoice (one page, ~500 tokens of extracted data):

MethodCost per DocumentAccuracy
Traditional OCR (Tesseract)~$0.00185-92%
Cloud OCR (Google/AWS)$0.005-0.0192-96%
GPT-4o-mini~$0.00593-97%
Claude Sonnet~$0.0195-98%
GPT-4o~$0.0296-99%

The MLLM approach costs more per document but requires dramatically less engineering effort β€” no templates, no format-specific rules, no OCR post-processing logic.

When to Use What

Use traditional OCR when: documents are clean and structured, volume is very high, cost is the primary concern, and you have engineering resources for post-processing.

Use MLLMs when: document formats vary, layout is complex, you need structured extraction (not just text), documents include handwriting or mixed languages, or you want to ship fast without building format-specific parsers.

Use hybrid when: you have a mix of standardized and varied documents, cost matters at scale, and you want high accuracy with reasonable cost.

Simplify

← Multimodal LLM Safety: Alignment Challenges Across Modalities

Go deeper

MLLMs in the Wild: Real-World Visual Understanding Beyond Benchmarks β†’

Related reads

mllmsocrdocument-aivision-languageextraction

Stay ahead of the AI curve

Weekly insights on AI β€” explained at the level that's right for you. No hype, no jargon, just what matters.

No spam. Unsubscribe anytime. We respect your inbox.