LLM Evals: Build a Production Regression Suite

A team we worked with had 14 prompt files, 6 model versions tested, and a Google Doc titled “eval results — Q1.” Every prompt change required two days of “spot checking” in a notebook before going to production. Twice in six weeks, a prompt update degraded quality on a specific customer segment and was caught only by a support ticket.

This is the modal state of LLM evaluation in 2026: a notebook nobody runs, a spreadsheet nobody trusts, and production changes gated on vibes. It is also the single largest reason teams cannot ship AI improvements faster than once a month — and the single largest reason impressive AI pilots become shelfware once a real release cadence is required.

LLM evals are how you reverse this. Done right, an evals regression suite is the same thing for an LLM app that a unit test suite is for a backend: a gate that runs on every pull request, blocks regressions automatically, and turns “ship faster” from a wish into a property of the system. This guide walks through what to build, what to skip, and how to evolve it without turning evals into a second product that needs its own roadmap.

This is part of the broader question of why your AI experiments are failing. The honest answer is usually that there is no measurement layer between “the prompt looks good in dev” and “the customer is unhappy in prod.”

What an LLM Evals Regression Suite Actually Is

An LLM evals regression suite has four parts:

A golden dataset — a versioned set of representative inputs with labels, rubrics, or reference outputs.
A set of scorers — code that grades a model output. Some deterministic, some LLM-as-judge, some human.
A runner — the orchestration that runs the dataset through the system under test and produces a scored report.
A CI/CD gate — the rule that blocks a pull request from merging when scores regress past a threshold.

That is the whole pattern. Everything else — drift dashboards, A/B test reports, online evals — is built on this foundation. Teams that try to build the dashboards first end up with a metrics product and no quality system. Teams that build the four-part foundation first end up with a quality system that incidentally produces good dashboards.

The Golden Dataset Is the Whole Game

The dataset is where 80% of the value lives and 80% of the work happens. Most failed evals programs failed because the dataset was thin, biased, or unreviewed.

A production-quality golden dataset has these properties:

Representativeness. The distribution of inputs in the dataset mirrors the distribution of real production traffic. If 30% of your real queries are billing-related, 30% of your dataset is billing-related. You can only get this right with LLM tracing in production — sampling real traces is the only way to capture the true distribution.

Difficulty curve. Not all examples are equal. A good dataset includes easy cases (the system must never regress on these), medium cases (where most movement happens), and hard cases (where you measure ceiling improvement). Without the easy tier, you cannot block silly regressions. Without the hard tier, you cannot measure progress.

Labels appropriate to the task. Reference outputs for tasks with verifiable answers (SQL generation, classification). Rubrics for open-ended tasks (summarization, customer responses). Pairwise preferences for tasks where “good” is comparative. Mixing label types in one dataset is fine; mixing label types under one scorer is not.

Version control. The dataset is a file in your repo (or a referenced artifact). Changes are reviewed. Additions are appended, not silently replaced. When a stakeholder asks “did this regression exist last quarter?” you can check out the old dataset and answer.

A documented contribution process. When a customer hits a novel failure mode, that failure becomes a row in the dataset. This is the single most valuable feedback loop in the entire system. If adding a row takes a half-day of process, no one does it, and the dataset goes stale.

Start with 50 examples, not 5,000

Most teams stall on dataset construction because they aim for a number that lets them avoid hard judgment calls. A reviewed, balanced 50-example dataset is more useful than a 5,000-example dataset scraped from logs and never opened. Ship with 50. Add 10 per week. Within a quarter you have 200 examples graded by people who understand the product — which is worth far more than 5,000 graded by no one.

Scorers: The Three Kinds You Actually Need

A common mistake is treating all scorers as if they were the same thing. They are not. Each has a job.

Deterministic scorers. Exact match, regex, JSON schema validation, SQL execution equivalence, presence/absence of required fields. These are fast, free, and unambiguous. They are also the only scorers you should trust without a second look. Every eval suite should have as many deterministic checks as the task allows — even open-ended tasks usually have some deterministic structure (length bounds, required citations, format compliance, banned phrases).

Model-based scorers (LLM-as-judge). A model grades the output against a rubric. These are powerful, cheap, and fast — and they have systematic biases that will silently corrupt your decisions unless you measure and mitigate. We cover this in depth in LLM-as-judge: when it works and when it breaks. The short version: use them, calibrate them against a small human-labeled subset every release, and never use them as the only scorer for a high-stakes decision.

Human scorers. Subject-matter experts grading a small sample. Slow, expensive, and the only source of ground truth. The right cadence is small and frequent: 30–50 examples per release, reviewed by the same one or two SMEs each time so their grading distribution is stable.

The production pattern: deterministic scorers run on every PR (fast, free, never wrong). LLM-as-judge scorers run on every PR (cheap, fast, calibrated). Human scorers run on a sample per release and on every “this should be a hill we die on” workflow. None of these alone is enough. Together they form a defensible quality signal.

CI/CD: Where Evals Become a Quality Gate

This is the step that separates “a team that does evals” from “a team that has evals.” The gate.

The contract is simple: every pull request that touches a prompt, a model version, a retrieval configuration, a tool schema, or any code in the agent loop triggers an evals run. The run executes the full golden dataset against the candidate change. The result is compared against the score on main. If the regression exceeds a per-metric threshold, the PR cannot merge until the author either fixes the regression or marks specific examples as accepted changes (with reviewer approval).

A few practical notes:

Run the full suite, not a sample. At dataset sizes of a few hundred, the full suite costs cents and runs in minutes. Sampling introduces variance that masks small regressions. Run everything.

Separate the cost gate. Quality regression is one signal. Cost regression is another. A prompt change that improves quality 1% and costs 40% more should still trigger a review, not a silent pass. Track tokens and dollars per example as first-class metrics.

Tier the thresholds. A 0.5% drop on the “easy” subset is unacceptable. A 2% drop on the “hard” subset may be fine if mean quality went up. One global threshold makes everything either too loose or too tight.

Show reviewers the diffs. The PR comment should not say “score went from 0.84 to 0.81.” It should link the specific examples that changed and show old output vs new output side by side. The diff is where reviewers make smart calls. The number alone forces dumb ones.

Fail closed in CI, fail open in prod. A flaky eval should block a merge until investigated. A flaky eval should never page someone at 2am. These are different code paths with different SLAs.

The Eval Maturity Model

Teams ask “how good does our evals system need to be?” The honest answer is: as good as the next decision you have to make. A useful internal maturity model:

Level 1 — Vibes. No dataset, no scorers, judgment calls in notebooks. This is where most teams start and where most LLM apps are at first deploy. It works for a demo. It does not work for a release cadence.

Level 2 — Golden dataset, manual run. 50–200 reviewed examples. Scorers exist (mostly deterministic). The team runs evals before a release — sometimes. Catches the worst regressions. Does not catch the subtle ones because it is not automatic.

Level 3 — CI/CD-gated regression suite. The four-part system described above. Every PR runs the suite. Regressions block merges. Cost and quality both tracked. This is the level at which a team can ship multiple prompt or model changes per week without fear. It is the minimum bar for production AI.

Level 4 — Online evals + drift detection. Same suite extended to grade a sample of live traffic continuously. Detects model upgrades the vendor did not announce, slow context drift, segment-specific quality drops. Requires Level 3 to even be coherent — you cannot detect drift if you do not know what stable looks like.

Level 5 — Eval-driven development. Engineers write new eval examples before writing the prompt or tool that should pass them, the way TDD works in backend code. Reserved for systems where the team has internalized the discipline. Most orgs never reach this. The ones that do ship AI changes weekly with confidence.

The right level is one above where you are. If you are at Level 1 and your CFO wants Level 5, the answer is “we have to walk through Level 3 first or none of this compounds.” Before you scale AI: production-ready is the right framing for that conversation.

Evals in CI/CD: A Worked Setup

What this looks like in practice:

Repo layout. /evals/datasets/ holds versioned JSONL files. /evals/scorers/ holds scorer functions. /evals/configs/ defines which scorers run on which dataset. The dataset is reviewed in pull requests like any other artifact.
Local run. make eval runs the full suite locally. Same command CI uses. No “it worked on my machine” eval reports.
PR trigger. A GitHub Actions workflow runs on any PR touching /prompts/, /agents/, /retrieval/, /evals/, or model config files. Output is posted as a PR comment.
The comment. Shows aggregate scores, per-tier scores (easy/medium/hard), cost delta, and a table of the top 10 examples that changed — with a link to a hosted diff view of old output vs new output.
The gate. A required status check. PR cannot merge if aggregate quality drops below threshold, cost rises above threshold, or any “easy” example regresses. Override requires a code-owner approval.
Failure logging. Production traces that surface novel failure modes (via your LLM tracing and prompt rollback signals) are queued for SME review and, once labeled, added to the dataset.

Step 6 is the loop that makes the whole thing compound. Production discovers new failure modes. Those failures become golden dataset entries. The updated dataset catches the next regression in CI before it ships. That is the flywheel.

Stop Shipping Prompts on Vibes

If your team is gating LLM releases on notebook spot-checks and Slack screenshots, your release cadence is the bottleneck — not the model. metacto builds production-grade LLM evaluation systems for engineering teams: golden datasets, scorers, CI/CD integration, and the data pipelines that keep the suite from going stale.

What Breaks in Production (Evals Edition)

The failure modes specific to evals at scale:

Score drift from the judge model. Your LLM-as-judge runs on a vendor model. The vendor silently updates it. Your scores shift 3%. Was it your prompt or the judge? Pinning the judge model by exact snapshot version (not by latest alias) and tracking judge-model identity in eval results is mandatory.

Dataset contamination. Someone copied production failures into the dataset, then ran a prompt change against it, then took the improvement as evidence the prompt was better. It was overfit to the dataset. Hold-out splits, periodic dataset rotation, and a small never-seen test set guard against this.

Threshold creep. A small regression slips through. The team adjusts the threshold rather than fix the regression. Six months later thresholds are meaningless. Thresholds should require a code-owner approval to change, like CI required-check rules.

The dataset goes stale. Production behavior shifts. The dataset still reflects last quarter. The suite passes but customers are unhappy. The only fix is the closed loop in step 6 above — production failures must mechanically feed the dataset, or the dataset slowly stops being a representative sample of reality.

Cost surprise. A 200-example suite running on every PR sounds cheap. At 30 PRs a day with three judge scorers each running the same vendor model, the bill is real. Track eval spend as a first-class line item and route eval calls through your normal model routing layer so prompt caching applies.

How Evals Connect to the Rest of the System

Evals do not stand alone. They are one layer of the production-AI stack, tightly coupled to others:

Upstream: LLM tracing provides the production data your dataset is sampled from.
Adjacent: LLM-as-judge is the scorer technology powering most of the open-ended evals.
Downstream: Prompt versioning and prompt rollback are what evals enable — fast iteration without fear of breaking production.
Around it: AI agent observability is the larger system property that evals contribute to.

Skip any of these and the others get less useful. Tracing without evals is data without decisions. Evals without rollback is detection without recovery. Rollback without versioning is recovery without targets. This is why the prompt is not the product — the product is the whole loop.

At metacto, this loop is what every Operational AI engagement installs. Not because evals are interesting in themselves, but because every other improvement compounds on them. A team with evals can ship five prompt changes a week. A team without them ships one a month and prays.

Conclusion

A production-grade LLM evals regression suite is not a research artifact. It is a release engineering system: a golden dataset that reflects your actual users, scorers that combine deterministic, model-based, and human grading, CI/CD that blocks regressions automatically, and a closed loop that turns production failures into the dataset’s next row.

Most teams know this. Most teams have not built it. The cost of waiting is a release cadence that asymptotes at one prompt change a month — which is the cadence at which AI experiments become shelfware. The cost of building it is a few engineer-weeks of focused work and a commitment to not bypass the gate.

This is one layer of the system underneath the chat box. The gap between an impressive demo and production AI is built out of exactly these layers.

LLM Evals Regression Suite: FAQ

What is an LLM evals regression suite?

An LLM evals regression suite is a versioned golden dataset, a set of scorers (deterministic, LLM-as-judge, and human), a runner that grades model outputs, and a CI/CD gate that blocks pull requests that regress quality or cost past a threshold. It is the LLM equivalent of a unit test suite, and it is the mechanism that lets a team ship prompt and model changes weekly instead of monthly.

How big should a golden dataset be?

Start with 50 reviewed, representative examples — not 5,000 scraped from logs. A small, balanced, SME-graded dataset is more useful than a large unreviewed one. Grow it through a closed loop: production failures, surfaced via LLM tracing, become new dataset rows. Most production systems stabilize at 200–500 examples per major workflow, with separate datasets per workflow rather than one giant pile.

When should I use LLM-as-judge versus deterministic scorers?

Use deterministic scorers (exact match, regex, JSON schema, SQL equivalence) wherever the task admits them — they are fast, free, and unambiguous. Use LLM-as-judge for open-ended tasks where deterministic grading is impossible, but calibrate the judge against a small human-labeled subset every release and never let it be the only scorer on a high-stakes decision. Bias and drift in judge models are real and measurable.

How do I integrate LLM evals into CI/CD?

Trigger an evals run on every PR that touches prompts, retrieval, tool schemas, model config, or agent code. Run the full golden dataset (not a sample). Post a PR comment with aggregate and tiered scores, a cost delta, and side-by-side diffs of the top changed examples. Make the CI check required, gate merges on per-metric thresholds, and require a code-owner approval to override the gate.

What is the LLM eval maturity model?

Level 1 is vibes — no dataset, judgment calls. Level 2 is a golden dataset run manually before releases. Level 3 is a CI/CD-gated regression suite — the minimum bar for production AI. Level 4 adds online evals on live traffic and drift detection. Level 5 is eval-driven development, where engineers write eval examples before the code that should pass them. Pick one level above where you are; skipping levels does not compound.

How often should the golden dataset be updated?

Continuously. The closed loop is: LLM tracing in production surfaces novel failure modes, an SME reviews and labels them, the labeled examples are added to the dataset, the next PR's eval run catches related regressions before they ship. A dataset that has not been updated in a quarter is no longer a representative sample of production reality, and the suite quietly stops being a useful gate.

LLM Evals: How to Build a Regression Suite That Ships With Every Release