MLLMs for Code and Visual Reasoning: When Models Read Diagrams, Screenshots, and Whiteboards
Multimodal LLMs can now look at a screenshot, diagram, or whiteboard sketch and generate working code or structured analysis. Here's what works, what doesn't, and how to build with it.
View all mllms depths βDepth ladder for this topic:
The most surprising capability of modern multimodal LLMs isnβt understanding photographs β itβs understanding structured visual information. Diagrams, wireframes, screenshots, architecture sketches, handwritten notes. Feed a whiteboard photo to Claude or GPT-4o and ask for working code. It works more often than youβd expect.
What MLLMs Can See
Modern vision-language models process images at resolutions up to 2048x2048 (sometimes higher with tiling) and can identify:
- UI elements: Buttons, text fields, dropdowns, navigation, layout structure
- Diagrams: Flowcharts, architecture diagrams, sequence diagrams, entity-relationship diagrams
- Handwriting: Legible handwritten notes, equations, pseudocode
- Code: Screenshots of code editors, terminal output, error messages
- Data visualization: Charts, graphs, tables in screenshots
The accuracy depends heavily on image quality, clarity, and how conventionally the content is structured.
Screenshot to Code
The most immediately useful application: turning a screenshot or mockup into working code.
How It Works
import anthropic
client = anthropic.Anthropic()
def screenshot_to_code(image_path: str, framework: str = "react") -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data
}
},
{
"type": "text",
"text": f"""Convert this UI screenshot into a working {framework}
component. Use Tailwind CSS for styling. Match the layout,
colors, and spacing as closely as possible. Include all
visible text content."""
}
]
}]
)
return response.content[0].text
What Works Well
- Marketing pages and landing pages β relatively flat layouts with clear visual hierarchy
- Form interfaces β the model recognizes input fields, labels, and buttons reliably
- Dashboard layouts β cards, grids, sidebar navigation
- Mobile app screens β standard mobile UI patterns
What Struggles
- Pixel-perfect reproduction β the model captures layout and intent, not exact pixels
- Complex interactivity β hover states, animations, drag-and-drop
- Custom components β the model defaults to standard UI patterns; highly custom designs get approximated
- Dense data tables β small text in screenshots may be misread
Production Tips
- High-resolution screenshots. The difference between a 720p and 1440p screenshot is dramatic for code quality.
- Annotate when possible. Red circles, arrows, or text annotations help the model focus.
- Specify the tech stack explicitly. βReact with Tailwindβ produces better results than βmake a website.β
- Iterate. The first generation is a starting point. Paste the code back with the screenshot and ask for corrections.
Diagram Understanding
Architecture Diagrams
MLLMs can parse architecture diagrams and generate:
- Text descriptions of the system
- Infrastructure-as-code (Terraform, CloudFormation)
- API specifications
- Sequence diagrams in Mermaid/PlantUML
Input: Photo of whiteboard with boxes labeled "API Gateway",
"Auth Service", "User DB", arrows showing request flow
Prompt: "Describe this architecture and generate a Mermaid
sequence diagram for the main request flow."
The model handles standard diagramming conventions (boxes for services, arrows for data flow, cylinders for databases) well. Unusual or artistic diagram styles confuse it.
Flowcharts to Code
Flowcharts map naturally to code. The model can follow decision diamonds, process boxes, and flow arrows to generate:
# From a flowchart image showing order processing logic:
def process_order(order):
if not validate_inventory(order):
return notify_customer("out_of_stock")
if order.total > 1000:
if not fraud_check(order):
return flag_for_review(order)
payment = process_payment(order)
if payment.success:
fulfill_order(order)
send_confirmation(order)
else:
return retry_payment(order, max_attempts=3)
The accuracy is highest for simple, well-drawn flowcharts and degrades with complexity.
Entity-Relationship Diagrams
ERDs to SQL schema is a natural fit:
Input: ER diagram photo showing Users, Orders, Products with relationships
Output:
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
price DECIMAL(10, 2) NOT NULL
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
total DECIMAL(10, 2) NOT NULL,
status VARCHAR(50) DEFAULT 'pending',
created_at TIMESTAMP DEFAULT NOW()
);
Error Message Analysis
A underappreciated use case: screenshot an error message (terminal output, browser console, stack trace) and ask the MLLM to diagnose it.
This works better than pasting text for several reasons:
- Terminal screenshots capture colors and formatting that indicate severity
- Browser devtools screenshots show the network tab, console, and DOM simultaneously
- Stack traces in IDEs include syntax highlighting that helps identify the relevant frames
Prompt: "Here's a screenshot of my terminal after running the deploy
script. Identify the root cause of the failure and suggest a fix."
Whiteboard to Specification
Perhaps the highest-leverage use case for teams: photograph a whiteboard after a planning session and generate a structured specification.
Prompt: "This whiteboard photo is from our sprint planning.
Extract all the user stories, acceptance criteria, and
technical notes. Format as a structured document I can
paste into our project tracker."
The model handles legible handwriting surprisingly well. Messy handwriting or overlapping notes still cause errors, but for reasonably neat whiteboard content, extraction accuracy is 80-90%.
Building Reliable Visual Reasoning Pipelines
For production use, add verification layers:
async def visual_reasoning_pipeline(image: bytes, task: str) -> dict:
# Step 1: Describe what the model sees
description = await mllm.describe(image)
# Step 2: Generate the output
output = await mllm.generate(image, task)
# Step 3: Self-verify
verification = await mllm.verify(
image, output,
prompt="Compare this output against the original image. "
"List any discrepancies."
)
# Step 4: Flag low-confidence results
if verification.has_discrepancies:
output.confidence = "low"
output.issues = verification.discrepancies
return output
The self-verification step catches many errors. The model is often better at identifying mistakes in generated output than at getting it right the first time.
Limitations
- Spatial reasoning is approximate. The model understands relative positioning (left of, above, inside) but struggles with precise spatial relationships.
- Small text is unreliable. Anything below ~10px in a screenshot may be misread or ignored.
- Complex diagrams with many elements. Beyond ~20-30 distinct elements, the model starts dropping or confusing items.
- Handwriting quality matters enormously. Neat printing: 90%+ accuracy. Cursive or messy: 50-70%.
The Practical Impact
Visual reasoning turns MLLMs from text tools into general-purpose understanding tools. The ability to point a model at a screenshot, diagram, or whiteboard and get structured output is genuinely new and genuinely useful.
The teams getting the most value are treating it as a first-draft tool: generate, verify, refine. The model does in seconds what would take minutes or hours of manual transcription or coding. The human provides the judgment that the model canβt.
Simplify
β MLLMs for Chart and Data Understanding: Reading Graphs Like a Human
Go deeper
MLLMs for Grounded UI Agents: Why Vision-Language Models Matter β
Related reads
Stay ahead of the AI curve
Weekly insights on AI β explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.