Building Multimodal AI Applications: Patterns and Pitfalls
How to architect applications that process multiple input types — text, images, audio, documents. The patterns that work, the tradeoffs to navigate, and the failure modes to anticipate.
View all multimodal ai depths →Depth ladder for this topic:
Multimodal AI applications accept inputs that aren’t just text. Images, audio, video, documents, structured data — and often combinations. Building these applications well requires understanding not just the models, but the architectural patterns that make them reliable, efficient, and maintainable.
This is a practical guide for engineers and product teams building multimodal systems.
What “multimodal” means architecturally
At the system level, multimodal means your application handles input routing and format normalization across multiple modalities. The model itself handles the cross-modal understanding.
The key architectural insight: multimodal capability lives at the model level, but multimodal reliability lives at the system level. The model may be able to process images, but your system must:
- Validate and normalize image inputs
- Handle format variations (JPEG, PNG, WebP, HEIC, TIFF)
- Enforce size limits and preprocessing
- Handle the cases where the image is malformed, blank, or irrelevant
- Manage cost and latency implications of image tokens
The model is a capable component. The system is responsible for using it correctly.
The modality matrix
Different input modalities have different handling requirements:
| Modality | Primary format | Token cost | Latency impact | Key preprocessing |
|---|---|---|---|---|
| Text | UTF-8 string | Low | Low | Sanitization, length limits |
| Images | Base64 / URL | Medium (varies by size) | Medium | Resize, format normalize |
| Documents (PDFs) | Text extract or image | Medium-High | Medium | OCR, structure preserve |
| Audio | Base64 WAV/MP3 | High (if transcribed) | High | Transcription, noise reduction |
| Video | Frames + audio | Very High | High | Frame sampling, chunking |
Design your modality handling with these cost and latency characteristics in mind. A user uploading a 50MB video shouldn’t receive the same synchronous handling as a user typing a question.
Core architectural patterns
Pattern 1: Unified multimodal context
The simplest pattern: normalize all inputs to a unified context that gets passed to a multimodal model in a single call.
[Text input] ─┐
[Image input] ─┤→ [Context builder] → [Multimodal LLM] → [Response]
[Document] ─┘
When to use: Simple applications with well-bounded input types and lengths. Works when all inputs fit within context window limits and the combined cost per request is acceptable.
When it breaks: When inputs are too large to include together. When some modalities need preprocessing the model can’t do natively (video → frame extraction). When cost is a constraint and most requests are text-only (you’re paying for multimodal capability you don’t use on most requests).
Pattern 2: Modality-routing pipeline
Route different input types through specialized processing before aggregation:
[Input] → [Modality detector]
↓
┌────────┼────────┐
[Text] [Images] [Audio]
│ │ │
[Direct] [Resize] [Transcribe]
└────────┼────────┘
↓
[LLM context]
↓
[Response]
When to use: When different modalities need significantly different preprocessing. When you want to optimize cost by only invoking expensive modality processing when that modality is actually present.
Key benefit: You can add new modality handlers independently without restructuring the whole pipeline.
Pattern 3: Pre-processing + retrieval
Expensive modalities (long documents, video) are pre-processed offline into indexed, retrievable chunks. At query time, the user’s question retrieves relevant chunks, which are then passed to the LLM.
[Offline ingestion]
Documents/Video → Process → Embed → Vector store
[Query time]
User question → Retrieve relevant chunks → LLM → Response
When to use: When your documents or video content is relatively static and will be queried many times. When content is too large to fit in context. When you need semantic search across a large corpus of multimodal content.
Examples: Knowledge base with image-rich documentation. Video library with semantic search. Product catalog with images and specs.
Pattern 4: Sequential multimodal agents
An agent that can dynamically invoke different modality-processing tools as needed:
User: "What's in this image, and find me similar products?"
↓
[LLM with tool access]
↓ ↓
[Analyze image] [Search catalog]
↓ ↓
[Combine results] → [Response]
When to use: Complex tasks that require dynamic combinations of modalities. When you don’t know at request time which processing will be needed. When the task requires iteration (analyze image → search → compare → answer).
Complexity warning: Agent patterns are powerful but harder to debug and test. Start with simpler patterns and graduate to agents when simpler patterns hit real limits.
Input handling: the unglamorous foundation
The difference between a demo and a production application is mostly input handling. Real users send:
- Images that are corrupted, blank, or in unexpected formats
- PDFs that are scanned at 72 DPI, password-protected, or malformed
- Audio files in MP3, OGG, FLAC, and other formats
- Images that are 10MB when your limit is 1MB
- Images that contain no relevant content whatsoever
Image handling
from PIL import Image
import io
def normalize_image(raw_bytes: bytes, max_dim: int = 1568) -> bytes:
"""Normalize image for LLM processing."""
img = Image.open(io.BytesIO(raw_bytes))
# Convert to RGB (handles RGBA, L, etc.)
if img.mode != 'RGB':
img = img.convert('RGB')
# Resize if too large (many APIs recommend <1568px on longest side)
if max(img.size) > max_dim:
ratio = max_dim / max(img.size)
new_size = (int(img.width * ratio), int(img.height * ratio))
img = img.resize(new_size, Image.LANCZOS)
# Convert to JPEG for consistent format and smaller size
output = io.BytesIO()
img.save(output, format='JPEG', quality=85)
return output.getvalue()
Document handling
For PDFs:
- First attempt: text extraction (
pdfplumber,pymupdf) — fast, accurate for text-native PDFs - Fallback: render to images and use vision model — handles scanned documents, preserves layout
- Detect which path is appropriate: if extracted text is very short relative to page count, assume it’s scanned and use vision path
Input validation
Validate before processing:
- File size limits (define and enforce)
- Format validation (not just extension — check magic bytes)
- Content safety checks before passing to expensive models
- Dimension limits for images
Graceful degradation
When an input modality fails, your application should degrade gracefully:
try:
image_context = process_image(image_bytes)
except ImageProcessingError as e:
logger.warning(f"Image processing failed: {e}")
image_context = "[Image could not be processed]"
# Continue with available context rather than failing the whole request
Prompt design for multimodal
Multimodal prompts differ from text-only in important ways.
Make modality explicit: Don’t assume the model knows what to do with each input. “The user has provided an image of a product and a text description. Analyze both to…” is clearer than mixing modalities without context.
Reference modalities specifically: “Based on the image provided…” or “The receipt in the image shows…” grounds the model in the specific input.
Handle missing or poor quality inputs: Include instructions for when modalities are unavailable or low quality. “If the image is unclear or does not contain the expected content, note this and respond based on the text alone.”
Coordinate cross-modal analysis: When asking the model to synthesize across modalities, explicitly request that synthesis. “Compare what the user says in the transcript with what’s visible in the screenshots to identify any discrepancies.”
Cost management in multimodal applications
Multimodal inputs, especially images, add significant token cost. Vision tokens can be expensive:
- OpenAI: images cost $0.00765 per tile (170×170 pixels) in high-detail mode
- A standard 1024×1024 image → ~765 image tokens → ~$0.006 per image in input
- Claude: images priced based on pixel count; a 1:1 ratio image is ~1,000 tokens
At scale, image costs dominate. Mitigations:
Right-size images before sending. Most vision tasks don’t benefit from full-resolution images. A 512×512 image works as well as a 2048×2048 for most scene description tasks.
Cache image processing results. If the same document is processed multiple times, cache the extracted content rather than re-running vision processing.
Route by modality. For applications where most requests are text-only, use a lighter (non-vision) model for text requests and only route to multimodal models when an image/document is present.
Reject clearly irrelevant images early. A quick content classifier that determines whether an image is relevant before passing it to an expensive model pays for itself on applications with high rates of irrelevant or accidental image uploads.
Testing multimodal applications
Testing is harder when inputs are images and audio. Standard approaches:
Build a fixture library. Collect representative inputs: common document formats, typical image types, edge cases (blank images, corrupted files, adversarial inputs). Version these fixtures alongside your code.
Automated quality checks. For structured extraction tasks, you can validate outputs against expected schemas and value ranges automatically. For open-ended descriptions, LLM-as-judge patterns work better than exact matching.
Cross-modal consistency tests. If your app processes both image and text describing the same content, they should produce consistent outputs. Deliberate consistency tests catch regressions.
Human evaluation for quality bar. At some point, automated metrics aren’t enough. Regular (weekly or monthly) manual review of a sample of outputs gives you calibration that automated evals miss.
The maturity model
For teams building multimodal applications:
Level 1: Accept images alongside text, pass to multimodal model, return response. No special image handling.
Level 2: Format normalization, size limits, error handling. Basic cost monitoring.
Level 3: Modality routing, pre-processing pipelines, document handling, application caching.
Level 4: Retrieval-augmented multimodal (semantic search across indexed images/documents), quality evals, cost optimization per modality.
Most production applications should be at Level 3. Level 2 is the minimum for anything beyond a demo. Level 4 is for applications where multimodal capability is core to the product value.
Simplify
← Multimodal AI in Autonomous Driving: How Self-Driving Cars Perceive the World
Go deeper
Multimodal AI for Content Moderation: Beyond Text Filters →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.