The saturation trap in comparative LLM evals

A comparative eval gate ("the new system must beat the old one on 60% of tasks") quietly stops working the moment your baseline saturates the rubric. Ours did. We built a head-to-head gate to decide whether a heavier multi-step pipeline earned its place against our shipped single-turn behavior, ran it, and got a verdict that looked like a clean rejection: a 7.1% win rate against a 60% bar. On this run the challenger was never worse on any task: it tied on 13 of 14 and won the one it did not tie. The number was not telling us the challenger had failed. It was telling us our tasks were too easy to separate the two systems, and we had almost mistaken one for the other.

That failure has a shape worth naming, because it is invisible until you look at the baseline's absolute score. We call it the saturation trap: when your baseline scores near the ceiling, a win-rate gate is no longer measuring the challenger at all.

The gate we built

The decision was concrete. A multi-step pipeline (plan, then execute, then verify) costs more to run than a single model turn, because it makes several model calls where the single turn makes one. So the bar was not "is it good," it was "is it clearly better, enough to earn the extra cost." If it only matched the single turn, it was a cost increase with no quality increase and should not ship. (We are deliberately not publishing the pipeline's internal model composition or any per-token economics; the lesson here is about the eval, not our routing.)

We operationalized "clearly better" as two lines that both had to pass, because either alone is gameable:

Win rate at least 60%. The challenger has to win outright on at least 60% of tasks. Ties count against it: a tie at higher cost is a loss in product terms. We set the bar at 60% rather than 50% so a result had to survive judge noise on a small task set; a 7-of-12 squeak-through is one flipped judgment away from a coin toss.
Mean score delta strictly positive. The mean per-task score difference (challenger minus baseline) has to be above zero, so a few narrow wins cannot mask large regressions elsewhere.

The task set was 14 hero-flow tasks over a fully synthetic fixture workspace: a fictional 30-person SaaS processor with deliberately seeded defects (an outdated ISMS policy citing the withdrawn ISO/IEC 27001:2013, an access-control policy with a self-contradicting MFA rule, an incident-response stub, a partial processing inventory). No customer data. Each task carried a rubric of independently gradeable, binary criteria, scored by an LLM judge (claude-haiku-4-5, temperature 0, one JSON verdict per criterion). The judge only ever saw the released deliverables, never the pipeline's internal plan, so a plan that promised "include Art. 28(3)" could not earn credit the final document did not deliver.

The run, and the two lines that disagreed

Here is the full result. Baseline is the single model turn (claude-opus-4-6); challenger is the multi-step pipeline, whose internal model composition we are holding back because it touches routing. That is fine here, because the lesson is architecture-level and does not depend on which model sits inside the pipeline. One run per system, 2026-06-11.

Measure	Value
Baseline mean score (14 tasks)	0.984
Challenger mean score (14 tasks)	1.000
Ties	13 of 14
Challenger wins	1
Challenger losses	0
Win rate (ties count against challenger)	7.1% (1/14)
Mean score delta (challenger minus baseline)	+0.016

Read the gate against those numbers and the two lines contradict each other. The mean-delta line passes: the challenger is, on average, very slightly better and never worse. The win-rate line fails catastrophically: 7.1% against a 60% bar. A single gate that says both "the challenger is at least as good on every task" and "the challenger loses 93% of the time" is not measuring the challenger. That contradiction is the signature of the saturation trap, and it is the tell to watch for.

The mechanism is arithmetic, not judgment. The baseline scored a perfect 1.0 on 13 of the 14 tasks. On a task where the baseline already scores 1.0, the best the challenger can do is tie, because there is no headroom above 1.0 to win. And ties, by our own design, count against it. So a rubric the baseline aces converts almost every task into a structural loss for the challenger, no matter how good the challenger is. Win rate had stopped being a property of the challenger and become a property of the task set: specifically, the share of tasks the baseline had left beatable. It had left one.

The only place the baseline dropped a score

Look at the single task where the baseline did not score a perfect 1.0. It scored 0.78, and not because it could not do the work. It was a task where it did too much: asked to draft the missing GDPR documentation set with a rubric that specified three to five artifacts, the single turn produced seven. Every artifact was competent and correctly cited. Its only lost credit on the whole run was a weighted over-delivery penalty, for not stopping.

That is the whole lesson in one data point. When both systems can do the task, "can it do the task" is no longer the question the eval should be asking, because both answer yes and tie. The only failures left to grade are failures of restraint: doing too much, saying too much, citing too confidently, asserting things you were not asked to assert. Our first rubric measured capability, our baseline model saturated it, and the one signal that survived was a restraint failure we had almost not written a task to catch.

"So ship the baseline"

The honest objection to all of this is that a baseline scoring 0.984 is telling you to ship the baseline and skip the expensive pipeline. If the cheap thing is that good, why keep measuring? It is a fair challenge and it is half right: on these tasks, the cheap thing was that good, and that is a real finding.

It is only half right because saturation on capability tasks says nothing about restraint tasks, and restraint is exactly where a multi-step pipeline (which can check its own work before releasing it) might actually pull ahead, or fall behind by compounding its own overreach. Our hero-flow tasks never probed it. The one that accidentally did (the over-delivery task) was the only one that moved. So the 0.984 was not evidence the two systems were equivalent. It was evidence our tasks could not tell them apart on the axis that was left. Saturation is a statement about your rubric, never a verdict of equivalence between your systems.

What we changed

We did not re-run the gate. Re-running a saturated eval cannot lift the win rate, because the challenger cannot score above the baseline's 1.0 on the tasks it already ties; a rerun just reproduces the same ceiling at more cost. Instead we hardened the task set, adding tasks whose rubrics a strong single turn can genuinely fail, each targeting a restraint axis the hero-flow tasks ignored:

Scope discipline: produce exactly three onboarding documents, no more. Grades the over-delivery reflex behind the baseline's only lost credit.
Audience-appropriate omission: a client-facing security email plus an internal cover note, where the workspace contains an internal-only partner memo. Grades whether the model keeps privileged content out of the outward-facing artifact.
Citation correctness: an internal audit plan that must cite the right standard for the claim and not a plausible-sounding wrong one.
Unverified-evidence handling: an audit status report where the audit is still in progress. Grades whether the model reports what is verified rather than asserting completion it cannot support.

The anti-gaming scaffolding that made this safe to iterate against is worth stating plainly, because a hardened task set is only as trustworthy as the discipline around it. The canonical task-id set is frozen and hashed (sha256 over the task ids, the rubrics, and the fixture documents, alongside the judge model and scorer version), and a gate reading is only accepted over that exact set: no subset, no extras. That makes it hard to quietly cherry-pick the tasks that produce a preferred verdict, and a stale or hand-edited report cannot silently gate a release.

The limits of this

One run per system, one judge model, 14 tasks. A single judge grading binary criteria has variance, which is the entire reason the win-rate bar sits at 60% and not 50%. The baseline also ran with the real production system prompt but without the memories, saved skills, and web search a live session can have, and every one of those gaps makes the baseline weaker than a real single turn, which makes the gate easier for the challenger, not harder. In other words, the real baseline is probably even more saturated than 0.984. These numbers are point-in-time as of 2026-06-11. We have not published a gate reading on the hardened task set; the point of this post is the failure the first run exposed, not a new verdict.

This is a different failure from a metric that stays green while the product breaks, which we wrote about separately. That was a number that lied. This is a number that told the truth (the baseline really is that strong on these tasks) and still could not answer the question we asked it. Both are ways a passing-looking eval can be worthless, and they call for opposite fixes: the first needs a better metric, the second needs harder tasks.

A checklist for comparative gates

Read the baseline's absolute score before you trust the win rate. If the baseline sits near your ceiling, your win-rate threshold is structurally unreachable. Fix the tasks, do not re-run.
Watch the tie rate. A pile of ties is a saturated rubric, not a real dead heat. If most tasks tie, your set has no headroom for either side to win.
Decide what ties mean before you read the number. When ties count against the challenger, a saturated baseline turns a better system into a failing verdict. That can be the right policy (ours was, on cost grounds), but know you are choosing it.
Treat contradicting gate lines as a diagnostic. If your mean-delta line passes while your win-rate line fails hard, that is not a mixed result to average away. It is saturation announcing itself.
Grade restraint, not just capability. Once both systems can do the task, the discriminating failures are over-delivery, audience leakage, citation overreach, and asserting unverified facts. Write tasks that a strong model fails by doing too much.
Design at least one task your baseline will lose. If nothing separates your systems, you have not measured them. You have measured your task set and found it too easy.
Freeze and hash the canonical set. So neither you nor a future release can quietly pick the tasks that produce the answer you wanted.

The trap is not that the baseline was good. Good baselines are the goal. The trap is reading a win rate as a fact about the challenger when it had become a fact about the tasks. When your comparison saturates, the eval has finished telling you about your models and started telling you about your rubric. The useful move is to listen to that second message, and go write a task hard enough to make a strong model fail.

Figures above are from an internal head-to-head eval, run 2026-06-11, baseline claude-opus-4-6 (single turn) against a multi-step plan-execute-verify pipeline, judged by claude-haiku-4-5, over 14 tasks on a fully synthetic fixture workspace with no customer content. Point-in-time as of that date. Framework context: ISO/IEC 27001:2022, GDPR (Art. 28, Art. 30).

The saturation trap: when your eval baseline is too good to measure