Self-healing metrics: when a green eval lies

The most dangerous number in an evaluation is the one that stays green while the product breaks. We have one. On the exact run where our model's per-document output fell to roughly a third of its standalone depth, a parity metric scored a near-perfect 1.0. The metric was not noisy or miscalibrated. It was measuring the wrong thing, in a way the model could satisfy without doing the work. We call that a self-healing metric, and finding ours was the most useful thing our eval suite did that week.

What we were testing

A common task in our product is multi-document analysis: a user drops several policy documents in at once and asks the assistant to map each one against a framework's clauses, for example the Annex A controls of ISO/IEC 27001:2022. A customer told us that analyzing several documents in one request produced thinner results than analyzing each document on its own. Before we changed anything, we built an eval to reproduce and measure that gap (internal ticket ISM-549, baseline run 2026-06-04).

The setup was deliberately faithful to production. Model: GLM-4.7, served via OpenRouter, matching the evaluated request path. Fixtures: seven synthetic, anonymized policies (no customer content), shaped like the real scenario that triggered the complaint. The question the eval had to answer was simple: does each document get the same analytical depth when seven are analyzed together as when each is analyzed alone?

The first metric, and why it healed itself

The obvious way to score "did each document get analyzed" is to count the clause-mapping rows. Standalone analysis produces a table with a row per clause, so grouped analysis should too. We tried exactly that first.

Row parity came back at roughly 1.0. Given explicit numbered clauses and a "map each clause" instruction, GLM-4.7 dutifully keeps one table row per clause even when you hand it seven documents at once. By that metric, nothing was wrong. The eval was ready to pass.

The product was visibly worse. The rows were present; the analysis around them was gone. On its own, each clause row carried an explanation, the strengths, the gaps, and a short summary. Grouped, the model kept the skeleton and stripped the muscle: the same table shape, far less of the per-clause reasoning. The metric had healed itself. The output satisfied "one row per clause" while dropping much of the per-clause analysis, and a row count cannot tell a full analysis apart from a hollow one.

What a better metric showed

So we stopped scoring structure and started scoring substance. The primary metric became content-volume parity: grouped output characters divided by the sum of the standalone output characters, with a pass gate at 0.6. On the 2026-06-04 baseline (GLM-4.7, two samples):

Measure	Sample 1	Sample 2
Content-volume parity	0.34	0.35
Standalone characters	48,263	47,651
Grouped characters	16,201	16,609
Clause-mapping row parity	1.00	0.98
LLM judge: documents rated not diluted	0 of 7	0 of 7

On the same 2026-06-04 baseline, a token-level depth proxy told the same story: about 2,747 output tokens for a single document, against about 595 tokens per document when the seven were grouped, a per-document ratio of 0.22.

Three details matter here. Content-volume parity sat at 0.34, well under the 0.6 gate. An independent LLM judge, asked only to rate each document's analysis as diluted or not, rated 0 of 7 documents undiluted in both samples. Row parity, the metric we nearly shipped behind, averaged 0.99 across the two samples (0.98 to 1.00). The three substance signals all pointed the same way: content-volume parity at roughly a third, the token proxy lower still at 0.22 per document, and the judge rating every grouped document diluted. The one instrument that disagreed was the structural row count.

Why it happens, and the limits of what we know

Our reading of the data is that GLM-4.7 self-budgets its effort across the documents in a single decode: asked to do seven analyses in one response, it spreads a roughly fixed budget thin rather than producing seven full ones. We did not instrument the model's internals, so that is a description of the behavior we measured, not a proven mechanism, and it is specific to this model and request shape. Two scope facts are load-bearing. The regression appears at production scale: it shows up at seven documents, not at two, so a smoke test on a small input would have passed clean. And it did not reproduce when we ran the same fixtures against a frontier model on a direct API. The regression appeared only on the exact model and request path we evaluate and ship, which is the whole argument for evaluating the path you actually run instead of a convenient stand-in.

Why this is worse in compliance than elsewhere

In most products, "shallower output" is a quality nit. In our compliance workflows it is not cosmetic: a control-mapping table with every row present and no analysis looks complete in a screenshot while giving an auditor none of the per-clause reasoning they would need to rely on it. That is what makes a self-healing metric genuinely hazardous here. It was not just imprecise. It was green on precisely the failure mode that matters most in our domain, a confident-looking deliverable with the substance quietly removed.

The fix that follows from this is to give each document its own full-depth pass instead of folding all seven into one decode, and the eval flips to a pass once it does. But the fix is the easy part. We could not have built it while our headline metric insisted nothing was broken. The metric change had to come first, because you cannot repair what you cannot see.

How to catch a self-healing metric

A self-healing metric is any metric a model can satisfy without doing the underlying work. Structural metrics are the most prone to it: row counts, item counts, "valid JSON", "all required fields present", section headers. They are cheap to compute and cheap to game, often by the model rather than by you. Before you trust one, run these checks:

Score substance, not just shape. If a metric only checks structure (rows, fields, headers), pair it with one that checks content (characters, tokens, or a judge rating thoroughness). Shape is necessary, never sufficient.
Add an independent instrument. A cheap LLM judge that rates quality and disagrees with your headline metric is the signal you want. Agreement is corroboration; a split means your structural metric is probably healing itself.
Test at production scale. A regression that only appears at seven documents will pass at two. Size your fixtures like the real input, not like a unit test.
Test the model and path you ship. Our dilution did not reproduce on a different model or a direct API. A proxy that is easier to run is a proxy that can miss the bug that only your production path has.
Ask what the cheapest passing output looks like. If a model could satisfy your metric while doing almost none of the real work, assume it eventually will. Design the metric so the cheapest way to pass is to actually do the task.

The eval did its job in the end, but only after we stopped trusting the flattering number. When a metric is green, the more useful question is not "are we passing" but "could the model pass this without doing the work." If the answer is yes, the metric is healing itself, and it will keep doing so right up until a customer is the one who notices.

Figures above are from our internal multi-document depth-parity baseline, run 2026-06-04 on GLM-4.7 via OpenRouter, and are point-in-time as of that date. Framework context: ISO/IEC 27001:2022, Annex A.

Self-healing metrics: when a green eval hides a real regression

What we were testing

The first metric, and why it healed itself

What a better metric showed

Why it happens, and the limits of what we know

Why this is worse in compliance than elsewhere

How to catch a self-healing metric

Related Posts

The model ID lottery: same request, different draw

The vocabulary collision: when a safety classifier flags your whole domain

The saturation trap: when your eval baseline is too good to measure