Building Multimodal AI Applications: Patterns and Pitfalls

Multimodal AI applications accept inputs that aren’t just text. Images, audio, video, documents, structured data — and often combinations. Building these applications well requires understanding not just the models, but the architectural patterns that make them reliable, efficient, and maintainable.

This is a practical guide for engineers and product teams building multimodal systems.

What “multimodal” means architecturally

At the system level, multimodal means your application handles input routing and format normalization across multiple modalities. The model itself handles the cross-modal understanding.

The key architectural insight: multimodal capability lives at the model level, but multimodal reliability lives at the system level. The model may be able to process images, but your system must:

Validate and normalize image inputs
Handle format variations (JPEG, PNG, WebP, HEIC, TIFF)
Enforce size limits and preprocessing
Handle the cases where the image is malformed, blank, or irrelevant
Manage cost and latency implications of image tokens

The model is a capable component. The system is responsible for using it correctly.

The modality matrix

Different input modalities have different handling requirements:

Modality	Primary format	Token cost	Latency impact	Key preprocessing
Text	UTF-8 string	Low	Low	Sanitization, length limits
Images	Base64 / URL	Medium (varies by size)	Medium	Resize, format normalize
Documents (PDFs)	Text extract or image	Medium-High	Medium	OCR, structure preserve
Audio	Base64 WAV/MP3	High (if transcribed)	High	Transcription, noise reduction
Video	Frames + audio	Very High	High	Frame sampling, chunking

Design your modality handling with these cost and latency characteristics in mind. A user uploading a 50MB video shouldn’t receive the same synchronous handling as a user typing a question.

Core architectural patterns

Pattern 1: Unified multimodal context

The simplest pattern: normalize all inputs to a unified context that gets passed to a multimodal model in a single call.

[Text input]  ─┐
[Image input] ─┤→ [Context builder] → [Multimodal LLM] → [Response]
[Document]    ─┘

When to use: Simple applications with well-bounded input types and lengths. Works when all inputs fit within context window limits and the combined cost per request is acceptable.

When it breaks: When inputs are too large to include together. When some modalities need preprocessing the model can’t do natively (video → frame extraction). When cost is a constraint and most requests are text-only (you’re paying for multimodal capability you don’t use on most requests).

Pattern 2: Modality-routing pipeline

Route different input types through specialized processing before aggregation:

[Input] → [Modality detector]
              ↓
     ┌────────┼────────┐
  [Text]  [Images]  [Audio]
     │        │        │
  [Direct] [Resize] [Transcribe]
     └────────┼────────┘
              ↓
        [LLM context]
              ↓
          [Response]

When to use: When different modalities need significantly different preprocessing. When you want to optimize cost by only invoking expensive modality processing when that modality is actually present.

Key benefit: You can add new modality handlers independently without restructuring the whole pipeline.

Pattern 3: Pre-processing + retrieval

Expensive modalities (long documents, video) are pre-processed offline into indexed, retrievable chunks. At query time, the user’s question retrieves relevant chunks, which are then passed to the LLM.

[Offline ingestion]
Documents/Video → Process → Embed → Vector store

[Query time]
User question → Retrieve relevant chunks → LLM → Response

When to use: When your documents or video content is relatively static and will be queried many times. When content is too large to fit in context. When you need semantic search across a large corpus of multimodal content.

Examples: Knowledge base with image-rich documentation. Video library with semantic search. Product catalog with images and specs.

Pattern 4: Sequential multimodal agents

An agent that can dynamically invoke different modality-processing tools as needed:

User: "What's in this image, and find me similar products?"
           ↓
    [LLM with tool access]
       ↓          ↓
[Analyze image] [Search catalog]
       ↓          ↓
    [Combine results] → [Response]

When to use: Complex tasks that require dynamic combinations of modalities. When you don’t know at request time which processing will be needed. When the task requires iteration (analyze image → search → compare → answer).

Complexity warning: Agent patterns are powerful but harder to debug and test. Start with simpler patterns and graduate to agents when simpler patterns hit real limits.

Input handling: the unglamorous foundation

The difference between a demo and a production application is mostly input handling. Real users send:

Images that are corrupted, blank, or in unexpected formats
PDFs that are scanned at 72 DPI, password-protected, or malformed
Audio files in MP3, OGG, FLAC, and other formats
Images that are 10MB when your limit is 1MB
Images that contain no relevant content whatsoever

Image handling

from PIL import Image
import io

def normalize_image(raw_bytes: bytes, max_dim: int = 1568) -> bytes:
    """Normalize image for LLM processing."""
    img = Image.open(io.BytesIO(raw_bytes))
    
    # Convert to RGB (handles RGBA, L, etc.)
    if img.mode != 'RGB':
        img = img.convert('RGB')
    
    # Resize if too large (many APIs recommend <1568px on longest side)
    if max(img.size) > max_dim:
        ratio = max_dim / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)
    
    # Convert to JPEG for consistent format and smaller size
    output = io.BytesIO()
    img.save(output, format='JPEG', quality=85)
    return output.getvalue()

Document handling

For PDFs:

First attempt: text extraction (pdfplumber, pymupdf) — fast, accurate for text-native PDFs
Fallback: render to images and use vision model — handles scanned documents, preserves layout
Detect which path is appropriate: if extracted text is very short relative to page count, assume it’s scanned and use vision path

Input validation

Validate before processing:

File size limits (define and enforce)
Format validation (not just extension — check magic bytes)
Content safety checks before passing to expensive models
Dimension limits for images

Graceful degradation

When an input modality fails, your application should degrade gracefully:

try:
    image_context = process_image(image_bytes)
except ImageProcessingError as e:
    logger.warning(f"Image processing failed: {e}")
    image_context = "[Image could not be processed]"
    # Continue with available context rather than failing the whole request

Prompt design for multimodal

Multimodal prompts differ from text-only in important ways.

Make modality explicit: Don’t assume the model knows what to do with each input. “The user has provided an image of a product and a text description. Analyze both to…” is clearer than mixing modalities without context.

Reference modalities specifically: “Based on the image provided…” or “The receipt in the image shows…” grounds the model in the specific input.

Handle missing or poor quality inputs: Include instructions for when modalities are unavailable or low quality. “If the image is unclear or does not contain the expected content, note this and respond based on the text alone.”

Coordinate cross-modal analysis: When asking the model to synthesize across modalities, explicitly request that synthesis. “Compare what the user says in the transcript with what’s visible in the screenshots to identify any discrepancies.”

Cost management in multimodal applications

Multimodal inputs, especially images, add significant token cost. Vision tokens can be expensive:

OpenAI: images cost $0.00765 per tile (170×170 pixels) in high-detail mode
A standard 1024×1024 image → ~765 image tokens → ~$0.006 per image in input
Claude: images priced based on pixel count; a 1:1 ratio image is ~1,000 tokens

At scale, image costs dominate. Mitigations:

Right-size images before sending. Most vision tasks don’t benefit from full-resolution images. A 512×512 image works as well as a 2048×2048 for most scene description tasks.

Cache image processing results. If the same document is processed multiple times, cache the extracted content rather than re-running vision processing.

Route by modality. For applications where most requests are text-only, use a lighter (non-vision) model for text requests and only route to multimodal models when an image/document is present.

Reject clearly irrelevant images early. A quick content classifier that determines whether an image is relevant before passing it to an expensive model pays for itself on applications with high rates of irrelevant or accidental image uploads.

Testing multimodal applications

Testing is harder when inputs are images and audio. Standard approaches:

Build a fixture library. Collect representative inputs: common document formats, typical image types, edge cases (blank images, corrupted files, adversarial inputs). Version these fixtures alongside your code.

Automated quality checks. For structured extraction tasks, you can validate outputs against expected schemas and value ranges automatically. For open-ended descriptions, LLM-as-judge patterns work better than exact matching.

Cross-modal consistency tests. If your app processes both image and text describing the same content, they should produce consistent outputs. Deliberate consistency tests catch regressions.

Human evaluation for quality bar. At some point, automated metrics aren’t enough. Regular (weekly or monthly) manual review of a sample of outputs gives you calibration that automated evals miss.

The maturity model

For teams building multimodal applications:

Level 1: Accept images alongside text, pass to multimodal model, return response. No special image handling.

Level 2: Format normalization, size limits, error handling. Basic cost monitoring.

Level 3: Modality routing, pre-processing pipelines, document handling, application caching.

Level 4: Retrieval-augmented multimodal (semantic search across indexed images/documents), quality evals, cost optimization per modality.

Most production applications should be at Level 3. Level 2 is the minimum for anything beyond a demo. Level 4 is for applications where multimodal capability is core to the product value.