Benchmarking Multimodal LLMs: What to Measure and How

Evaluating multimodal LLMs is harder than evaluating text-only models. The model needs to understand images, text, sometimes video and audio, and reason across modalities. Standard NLP benchmarks don’t capture this. Here’s how to evaluate multimodal models meaningfully.

Why Standard Benchmarks Aren’t Enough

A model can score 90% on a vision-language benchmark and still fail at your specific use case. Reasons:

Benchmark contamination: Popular benchmarks may be in training data
Task mismatch: Benchmarks test general capabilities; you need specific ones
Modality gaps: Most benchmarks test image+text; fewer cover video, audio, or multi-image reasoning
Real-world messiness: Benchmarks use clean images; your inputs may be blurry phone photos

Use public benchmarks for initial filtering, then build custom evaluations for your use case.

The Major Public Benchmarks

Visual Question Answering

VQAv2: Classic benchmark. “What color is the car?” type questions. Saturated — top models score >85%. Useful as a baseline sanity check, not for differentiation.

OK-VQA: Requires outside knowledge beyond what’s in the image. “What year was this building constructed?” Tests the model’s ability to combine visual recognition with world knowledge.

TextVQA: Questions about text appearing in images (signs, documents, screens). Critical for document understanding use cases.

Visual Reasoning

MMMU: Multi-discipline multimodal understanding. College-level questions with diagrams, charts, and images from subjects like physics, chemistry, and art. The gold standard for testing expert-level multimodal reasoning.

MathVista: Mathematical reasoning with visual inputs — geometry problems, chart interpretation, scientific figures. Exposes models that can talk about images but can’t reason about them.

MMBench: Comprehensive evaluation across 20 ability dimensions including spatial reasoning, attribute recognition, and action prediction. Good for profiling a model’s strengths and weaknesses.

Real-World Tasks

RealWorldQA: Questions about real-world images — navigation, cooking, shopping, maintenance. Tests practical visual understanding rather than academic reasoning.

SEED-Bench: 19K questions across 12 evaluation dimensions including spatial understanding, instance interaction, and visual reasoning. One of the more comprehensive benchmarks.

MME: Perception and cognition benchmark with both yes/no questions and open-ended evaluation.

Video Understanding

Video-MME: Multi-modal evaluation for video models. Tests temporal understanding, event sequencing, and video-text alignment.

MVBench: 20 temporal understanding tasks from action recognition to event ordering.

Building Your Own Evaluation

Public benchmarks tell you about general capability. Custom evaluations tell you about your use case.

Step 1: Define Your Task

Be specific. Not “can the model understand images” but:

“Can the model extract line items from scanned receipts with >95% accuracy?”
“Can the model identify defects in circuit board photos within 500ms?”
“Can the model describe product images accurately enough for accessibility text?”

Step 2: Collect Evaluation Data

Aim for 100-500 examples that represent your production distribution:

# Example evaluation dataset structure
eval_examples = [
    {
        "id": "receipt_001",
        "image": "eval_images/receipt_001.jpg",
        "prompt": "Extract all line items with prices from this receipt.",
        "expected": {
            "items": [
                {"name": "Coffee", "price": 4.50},
                {"name": "Muffin", "price": 3.25},
            ],
            "total": 7.75
        },
        "difficulty": "easy",  # clean, well-lit receipt
    },
    {
        "id": "receipt_002",
        "image": "eval_images/receipt_002.jpg",
        "prompt": "Extract all line items with prices from this receipt.",
        "expected": { ... },
        "difficulty": "hard",  # crumpled, faded, partial
    },
]

Include edge cases:

Poor image quality (blur, low light, occlusion)
Unusual layouts or formats
Ambiguous cases where reasonable answers differ
Adversarial examples (misleading visual content)

Step 3: Choose Metrics

For extraction tasks:

Exact match accuracy
Field-level precision and recall
Edit distance for text extraction

For generation tasks (descriptions, answers):

Human evaluation (gold standard but expensive)
LLM-as-judge (use a strong model to rate outputs)
ROUGE/BLEU for text similarity (weak but cheap)
Task-specific metrics (factual accuracy, completeness)

For classification tasks:

Accuracy, precision, recall, F1
Confusion matrix to identify systematic errors
Per-class performance (models often fail on rare classes)

Step 4: Run Evaluations Systematically

import json
from datetime import datetime

def run_evaluation(model_name, eval_set, prompt_template):
    results = []
    for example in eval_set:
        response = call_model(model_name, example["image"], 
                            prompt_template.format(**example))
        score = compute_metrics(response, example["expected"])
        results.append({
            "id": example["id"],
            "model": model_name,
            "score": score,
            "difficulty": example["difficulty"],
            "response": response,
            "timestamp": datetime.now().isoformat(),
        })
    
    return {
        "model": model_name,
        "overall_score": sum(r["score"] for r in results) / len(results),
        "by_difficulty": aggregate_by(results, "difficulty"),
        "failures": [r for r in results if r["score"] < 0.5],
        "results": results,
    }

Step 5: Compare and Iterate

Run the same evaluation across multiple models and prompt variations. Track results over time — model updates can change performance.

Key comparisons:

Model A vs Model B on your task
Same model, different prompting strategies
Performance by difficulty level
Cost per correct answer (price × tokens / accuracy)

Evaluation Pitfalls

Leaking Answers in the Prompt

If your prompt contains hints about the expected answer, you’re testing prompt following, not visual understanding. Keep prompts neutral.

Over-Relying on Automated Metrics

For generation tasks, automated metrics correlate weakly with actual quality. Budget for human evaluation on at least a subset of your data.

Ignoring Latency and Cost

A model that’s 5% more accurate but 10x slower and 5x more expensive might be the wrong choice. Include operational metrics in your evaluation:

Model	Accuracy	P95 Latency	Cost/1K queries
GPT-4o	92.3%	3.2s	$8.50
Claude Sonnet	91.1%	2.8s	$6.20
Gemini 2.5 Pro	93.0%	4.1s	$7.80
Llama 3.3 90B	87.5%	1.9s	$2.10

Testing Only the Happy Path

If all your evaluation images are clean and well-lit, you’ll be surprised when production images aren’t. Include degraded, edge-case, and adversarial examples in every evaluation.

Continuous Evaluation

Don’t evaluate once and forget. Build evaluation into your pipeline:

Pre-deployment: Full evaluation on the test set before any model or prompt change
Shadow evaluation: Log production inputs and evaluate new models offline
Periodic re-evaluation: Weekly or monthly runs to catch drift (model updates, data changes)
User feedback loop: Track corrections and failures to expand your evaluation set

The models improve regularly. So should your evaluation.

Benchmarking Multimodal LLMs: What to Measure and How

Benchmarking Multimodal LLMs: What to Measure and How

Why Standard Benchmarks Aren’t Enough

The Major Public Benchmarks

Visual Question Answering

Visual Reasoning

Real-World Tasks

Video Understanding

Building Your Own Evaluation

Step 1: Define Your Task

Step 2: Collect Evaluation Data

Step 3: Choose Metrics

Step 4: Run Evaluations Systematically

Step 5: Compare and Iterate

Evaluation Pitfalls

Leaking Answers in the Prompt

Over-Relying on Automated Metrics

Ignoring Latency and Cost

Testing Only the Happy Path

Continuous Evaluation

Related reads

Stay ahead of the AI curve