Evaluating Language Models: Metrics, Benchmarks, and What Actually Matters

“Our model scores 89.2 on MMLU.” Cool. But what does that mean for your application? Model evaluation is full of numbers that look precise but require context to interpret. Let’s build that context.

Automatic metrics for text generation

Perplexity

Measures how “surprised” the model is by a text. Lower perplexity = the model predicts the text better. Formally, it’s the exponentiated average negative log-likelihood of the tokens.

Good for: Comparing models trained on similar data. Tracking training progress. Bad for: Comparing models across different tokenizers or vocabularies. Judging output quality (a model can have low perplexity but generate boring, repetitive text).

BLEU (Bilingual Evaluation Understudy)

Counts n-gram overlaps between generated text and reference text. Originally designed for machine translation.

Good for: Machine translation, where there’s a clear reference translation. Bad for: Open-ended generation, summarization, or any task where many valid outputs exist. A perfectly good summary might share few n-grams with the reference and score poorly.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Similar to BLEU but focuses on recall (what fraction of reference n-grams appear in the generated text). ROUGE-L measures the longest common subsequence.

Good for: Summarization evaluation, where you want to check that key content from the reference is captured. Bad for: Same limitations as BLEU — penalizes valid paraphrases.

BERTScore

Uses BERT embeddings to compute semantic similarity between generated and reference text, rather than exact n-gram matches. Captures paraphrases better than BLEU/ROUGE.

Good for: Any generation task where meaning matters more than exact wording. Bad for: Still requires reference texts. Still a proxy for human judgment.

Benchmark suites

MMLU (Massive Multitask Language Understanding)

57 multiple-choice subjects from elementary math to professional law. Tests breadth of knowledge and reasoning.

What it tells you: How well the model performs on academic/knowledge tasks across domains. What it doesn’t: How well the model follows instructions, handles conversation, or produces useful outputs in practice.

HumanEval / MBPP

Code generation benchmarks. Given a function signature and docstring, generate the implementation. Evaluated by running test cases.

What it tells you: Raw coding ability on isolated function-level tasks. What it doesn’t: Ability to work with existing codebases, debug, or handle ambiguous requirements.

MT-Bench / Chatbot Arena

Multi-turn conversation benchmarks. MT-Bench uses GPT-4 as a judge to score responses. Chatbot Arena uses human preferences via blind comparisons.

What they tell you: Conversational quality as perceived by humans (Arena) or a strong model (MT-Bench). What they don’t: Performance on your specific use case.

GPQA, ARC, HellaSwag

Reasoning benchmarks testing graduate-level science questions (GPQA), grade-school science (ARC), and commonsense reasoning (HellaSwag).

The gap between benchmarks and reality

Benchmarks are useful for tracking progress and comparing models at a high level. But they have systemic limitations:

Benchmark contamination. Models may have seen benchmark questions during training. Scores improve without genuine capability improvement.

Narrow evaluation. Multiple-choice benchmarks test recognition, not generation. A model might identify the right answer from options but fail to produce it from scratch.

Missing dimensions. Benchmarks rarely test safety, instruction following, tool use, or long-context performance — capabilities that matter enormously in practice.

Overfitting to format. Models are increasingly optimized for benchmark formats. They may score well on MMLU-style multiple choice but struggle with the same knowledge in a different format.

What to do instead

For production applications, build task-specific evaluations:

Collect representative inputs from your actual use case
Define what “good” looks like for each input (reference answers, rubrics, or pass/fail criteria)
Use LLM-as-judge for scalable evaluation — have a strong model score outputs against your rubric
Include human evaluation for a subset — automated metrics calibrate against human judgment
Track over time — Run your eval suite on every model change, prompt update, or provider switch

eval_rubric = """
Score the response on:
1. Accuracy (0-5): Is the information correct?
2. Completeness (0-5): Does it address all parts of the question?
3. Clarity (0-5): Is it well-written and easy to understand?
4. Relevance (0-5): Does it stay on topic?
"""

The bottom line

Public benchmarks answer: “Is this model generally capable?” Your custom evals answer: “Does this model work for my use case?” You need both, but the second matters more. A model that ranks #5 on MMLU might be #1 for your specific application. Test with your data, evaluate on your criteria, and trust your measurements over leaderboard positions.