Prompting With Evaluation Rubrics

Many prompt failures are not really prompt failures. They are specification failures.

You asked for a summary, got a summary, and still disliked the result. Why? Usually because you never defined what “good” meant. The model filled in that gap with generic assumptions.

An evaluation rubric fixes that.

What a rubric does

A rubric is a short set of criteria the model should optimize against when producing or revising output.

Instead of saying:

Write a strong executive summary.

you say:

Write a 150-word executive summary. Good output should:

State the decision in the first sentence

Include one concrete number

Name the main risk

Avoid jargon and filler

That small change usually improves quality immediately.

Why rubrics work

Models are pattern completion engines. If your definition of quality is vague, they default to the average pattern they saw in training. That often means polished but generic language.

Rubrics sharpen the target.

They help with:

prioritization
tone control
output consistency
easier review by humans

The best places to use them

Rubrics are especially valuable for:

summaries
outreach drafts
support replies
research synthesis
content editing
data extraction review

Any task where humans routinely say “this is close, but not quite right” is a good candidate.

A practical rubric template

Keep it short. Usually 3 to 6 criteria is enough.

Good rubric categories include:

correctness
completeness
tone
concision
audience fit
use of evidence

Example:

Draft a customer follow-up email. Good output should:

open with the next step, not background

stay under 120 words

sound calm and direct

include the deadline

avoid blame language

That is much easier for a model to execute than “make this better.”

Rubrics also improve iteration

When an output misses the mark, you can revise against the rubric:

“Criterion 2 is weak. Add a concrete example.”
“This passes correctness but fails concision. Rewrite under 80 words.”

This makes multi-turn prompting much more efficient because the model knows exactly what to repair.

The trap to avoid

Do not build 20-point rubrics unless the task truly requires it. Long scorecards often confuse the signal, create tradeoff conflicts, and make outputs feel mechanical.

A rubric should clarify the center of gravity, not simulate a corporate compliance checklist.

Bottom line

Prompting gets better when quality is explicit.

If you find yourself repeatedly editing AI output for the same reasons, stop rewriting and start defining a rubric. In practice, that is one of the fastest ways to turn a frustrating prompt into a reusable workflow.