Evaluating Language Models: Metrics, Benchmarks, and What Actually Matters
BLEU, ROUGE, perplexity, MMLU — the metrics used to evaluate language models are often misunderstood. This guide explains what each measures, when to use it, and why leaderboard scores don't tell the whole story.
View all nlp depths →Depth ladder for this topic:
“Our model scores 89.2 on MMLU.” Cool. But what does that mean for your application? Model evaluation is full of numbers that look precise but require context to interpret. Let’s build that context.
Automatic metrics for text generation
Perplexity
Measures how “surprised” the model is by a text. Lower perplexity = the model predicts the text better. Formally, it’s the exponentiated average negative log-likelihood of the tokens.
Good for: Comparing models trained on similar data. Tracking training progress. Bad for: Comparing models across different tokenizers or vocabularies. Judging output quality (a model can have low perplexity but generate boring, repetitive text).
BLEU (Bilingual Evaluation Understudy)
Counts n-gram overlaps between generated text and reference text. Originally designed for machine translation.
Good for: Machine translation, where there’s a clear reference translation. Bad for: Open-ended generation, summarization, or any task where many valid outputs exist. A perfectly good summary might share few n-grams with the reference and score poorly.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Similar to BLEU but focuses on recall (what fraction of reference n-grams appear in the generated text). ROUGE-L measures the longest common subsequence.
Good for: Summarization evaluation, where you want to check that key content from the reference is captured. Bad for: Same limitations as BLEU — penalizes valid paraphrases.
BERTScore
Uses BERT embeddings to compute semantic similarity between generated and reference text, rather than exact n-gram matches. Captures paraphrases better than BLEU/ROUGE.
Good for: Any generation task where meaning matters more than exact wording. Bad for: Still requires reference texts. Still a proxy for human judgment.
Benchmark suites
MMLU (Massive Multitask Language Understanding)
57 multiple-choice subjects from elementary math to professional law. Tests breadth of knowledge and reasoning.
What it tells you: How well the model performs on academic/knowledge tasks across domains. What it doesn’t: How well the model follows instructions, handles conversation, or produces useful outputs in practice.
HumanEval / MBPP
Code generation benchmarks. Given a function signature and docstring, generate the implementation. Evaluated by running test cases.
What it tells you: Raw coding ability on isolated function-level tasks. What it doesn’t: Ability to work with existing codebases, debug, or handle ambiguous requirements.
MT-Bench / Chatbot Arena
Multi-turn conversation benchmarks. MT-Bench uses GPT-4 as a judge to score responses. Chatbot Arena uses human preferences via blind comparisons.
What they tell you: Conversational quality as perceived by humans (Arena) or a strong model (MT-Bench). What they don’t: Performance on your specific use case.
GPQA, ARC, HellaSwag
Reasoning benchmarks testing graduate-level science questions (GPQA), grade-school science (ARC), and commonsense reasoning (HellaSwag).
The gap between benchmarks and reality
Benchmarks are useful for tracking progress and comparing models at a high level. But they have systemic limitations:
Benchmark contamination. Models may have seen benchmark questions during training. Scores improve without genuine capability improvement.
Narrow evaluation. Multiple-choice benchmarks test recognition, not generation. A model might identify the right answer from options but fail to produce it from scratch.
Missing dimensions. Benchmarks rarely test safety, instruction following, tool use, or long-context performance — capabilities that matter enormously in practice.
Overfitting to format. Models are increasingly optimized for benchmark formats. They may score well on MMLU-style multiple choice but struggle with the same knowledge in a different format.
What to do instead
For production applications, build task-specific evaluations:
- Collect representative inputs from your actual use case
- Define what “good” looks like for each input (reference answers, rubrics, or pass/fail criteria)
- Use LLM-as-judge for scalable evaluation — have a strong model score outputs against your rubric
- Include human evaluation for a subset — automated metrics calibrate against human judgment
- Track over time — Run your eval suite on every model change, prompt update, or provider switch
eval_rubric = """
Score the response on:
1. Accuracy (0-5): Is the information correct?
2. Completeness (0-5): Does it address all parts of the question?
3. Clarity (0-5): Is it well-written and easy to understand?
4. Relevance (0-5): Does it stay on topic?
"""
The bottom line
Public benchmarks answer: “Is this model generally capable?” Your custom evals answer: “Does this model work for my use case?” You need both, but the second matters more. A model that ranks #5 on MMLU might be #1 for your specific application. Test with your data, evaluate on your criteria, and trust your measurements over leaderboard positions.
Simplify
← Keyword Extraction and Topic Modeling: Making Sense of Large Text Collections
Go deeper
NLP for Legal Documents: Contract Analysis with AI →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.