Evaluating Text Generation: Metrics, Methods, and What Actually Works
How do you measure whether generated text is good? BLEU and ROUGE have known flaws. LLM-as-judge is promising but imperfect. This guide covers the full evaluation landscape for text generation.
View all nlp depths →Depth ladder for this topic:
Evaluating generated text is one of the hardest problems in NLP. Unlike classification (accuracy is clear) or regression (MSE is obvious), text quality is subjective, multidimensional, and context-dependent. A summary can be factually correct but boring. A translation can be fluent but miss the meaning. Here’s how the field navigates this.
Traditional Metrics (and Their Limits)
BLEU (Bilingual Evaluation Understudy)
Originally designed for machine translation. Measures n-gram overlap between generated text and reference text.
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'cat', 'sat', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'is', 'on', 'the', 'mat']
score = sentence_bleu(reference, candidate) # ~0.61
Problems:
- Penalizes valid paraphrases (“the feline rested on the rug” scores 0)
- Doesn’t measure fluency or coherence
- Requires reference text, which may not exist or may not be unique
- Correlates poorly with human judgment for longer text
Still useful for: Machine translation benchmarks, tracking relative improvements during training.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Measures overlap between generated and reference summaries. ROUGE-L (longest common subsequence) is most common.
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
scores = scorer.score(
"The president signed the bill into law on Tuesday",
"On Tuesday the president signed the legislation"
)
# rougeL: precision=0.71, recall=0.71, fmeasure=0.71
Problems: Same as BLEU — it’s a surface-level text overlap metric. A generated summary could have high ROUGE by copying sentences from the source without actually summarizing.
BERTScore
Uses BERT embeddings to compare semantic similarity rather than surface-level overlap:
from bert_score import score
P, R, F1 = score(
["The cat sat on the mat"],
["A feline rested on the rug"],
lang="en"
)
# F1 ≈ 0.92 (much better than BLEU's 0.0)
Better because: Captures paraphrases and semantic equivalence. Still limited: Doesn’t evaluate factual correctness, coherence, or task-specific quality.
LLM-as-Judge
The most significant evaluation development of the past two years. Use a strong LLM to evaluate generated text:
def evaluate_summary(source: str, summary: str) -> dict:
prompt = f"""Evaluate this summary on a 1-5 scale for each criterion:
SOURCE: {source}
SUMMARY: {summary}
Criteria:
1. Faithfulness: Does the summary only contain information from the source?
2. Coverage: Does it capture the key points?
3. Coherence: Is it well-organized and readable?
4. Conciseness: Is it appropriately brief without losing meaning?
Return JSON: {{"faithfulness": n, "coverage": n, "coherence": n, "conciseness": n, "reasoning": "..."}}"""
return call_llm(prompt, model="gpt-4o")
Why It Works
- Correlates highly with human judgment (often >0.8 Spearman correlation)
- Evaluates nuanced quality dimensions that metrics can’t
- Scales without human annotators
- Can be customized to specific evaluation criteria
Why It’s Imperfect
Position bias: LLM judges tend to prefer the first option in pairwise comparisons. Mitigate by randomizing order and averaging.
Self-preference bias: Models tend to rate their own outputs higher than other models’ outputs. Use a different model for judging than for generating.
Verbosity bias: Longer responses tend to get higher scores regardless of quality. Explicitly instruct the judge to not favor length.
Inconsistency: The same judge can give different scores to the same text across runs. Average multiple evaluations or use lower temperature.
Best Practices
def robust_llm_evaluation(text_a, text_b, criteria):
scores = []
# Run both orderings to mitigate position bias
for ordering in [(text_a, text_b), (text_b, text_a)]:
score = judge_model.evaluate(*ordering, criteria=criteria)
scores.append(score)
# Average across orderings
return aggregate(scores)
Task-Specific Evaluation
Summarization
- Faithfulness (factual consistency with source): Most important. Use NLI-based metrics or LLM-as-judge.
- Coverage: Does the summary capture all key points? Compare against multiple reference summaries.
- Compression ratio: Is the summary actually shorter? (Surprisingly, some models produce “summaries” longer than the input.)
Translation
- COMET: Learned metric trained on human quality judgments. State-of-the-art correlation with human evaluation.
- chrF: Character-level F-score. Better than BLEU for morphologically rich languages.
Creative Writing
Hardest to evaluate. No reference text exists. Options:
- Human evaluation panels with rubrics
- LLM-as-judge with creativity-specific criteria
- Reader engagement metrics (if deployed)
Code Generation
The easiest: does the code pass the test cases?
def evaluate_code(generated_code: str, test_cases: list) -> dict:
passed = 0
for test in test_cases:
try:
exec(generated_code + "\n" + test['assertion'])
passed += 1
except:
pass
return {"pass_rate": passed / len(test_cases)}
Building an Evaluation Pipeline
A production-grade evaluation pipeline combines multiple approaches:
class TextEvaluationPipeline:
def evaluate(self, generated: str, reference: str = None, source: str = None):
results = {}
# Automated metrics (fast, cheap)
if reference:
results['bertscore'] = compute_bertscore(generated, reference)
results['rouge'] = compute_rouge(generated, reference)
# LLM-as-judge (slower, more nuanced)
results['llm_judge'] = llm_evaluate(
generated, source=source,
criteria=['faithfulness', 'coherence', 'helpfulness']
)
# Task-specific checks
results['length_appropriate'] = check_length(generated, target_range)
results['no_hallucinations'] = check_factuality(generated, source)
return results
The Uncomfortable Truth
No single metric captures text quality. The best evaluation combines:
- Automated metrics for fast feedback during development
- LLM-as-judge for nuanced quality assessment at scale
- Human evaluation for final validation and calibrating your automated methods
The teams that build the best text generation systems are the ones that invest the most in evaluation — not in model architecture.
Simplify
← Text Summarization: From Extractive to Abstractive to LLM-Powered
Go deeper
Text Preprocessing in 2026: What Still Matters and What Doesn't →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.