LLM Engineering

Building an LLM evaluation harness: shipping with confidence

Modulus May 15, 2026 6 min read

The thing that distinguishes production LLM systems from demos is the answer to one question: how do you know it works? "It seemed fine when I tested it" is not an answer. "We have an evaluation harness that runs on every deployment and our p95 accuracy on our task distribution is above our acceptance threshold" is an answer. The gap between those two states is where most LLM projects fail post-launch.

This guide covers the components of a production-grade LLM evaluation harness, the order in which to build them, and the specific failure modes each component is designed to catch.

TL;DR
  • Start with your task-specific eval set, not with a generic benchmark. 100–300 real examples from your production distribution is sufficient to start.
  • Build deterministic tests first (format validation, forbidden content, required elements) — they are fast, cheap, and catch a surprising number of regressions.
  • Model-graded evaluation (using a second LLM to score outputs) is effective for subjective quality but requires its own validation.
  • Run your eval in CI on every prompt change — treat prompt changes the same way you treat code changes.
  • The eval set is a living artifact. Add examples from every production failure.

Why evaluation is the most skipped step in LLM development

Evaluation is skipped because it feels like overhead. The model produces outputs that look reasonable. The demo works. Stakeholders are happy. Adding a formal eval harness feels like slowing down to add testing infrastructure to working code. This reasoning is exactly the reasoning that leads to a production LLM system producing confidently wrong outputs at scale six weeks after launch, with no automated mechanism to detect the regression.

LLMs introduce a novel failure mode relative to traditional software: they fail probabilistically, inconsistently, and in ways that are invisible without structured measurement. A model that works correctly 94% of the time fails 6% of the time — and without an eval harness, you will not know whether that 6% is stable or drifting upward. A prompt change that fixes one class of failure often introduces another. Detecting this without automated evaluation requires testing every prompt change by hand against hundreds of examples, which nobody does in practice.

Component 1: The task-specific eval set

The foundation of the harness is an eval set composed of real examples from your production task distribution. Generic benchmarks measure general capability. Your eval set measures whether the model does your specific task correctly.

Building the initial eval set: pull 100–300 examples from your existing data or from early user sessions. For each example, record the input and the correct output (or a set of acceptable outputs if there are multiple valid answers). If you are building before launch, construct synthetic examples that represent the realistic distribution of inputs you expect. Weight the eval set toward difficult cases — easy cases are not where failures happen.

The eval set should be stored in version control alongside your prompt templates. Every time the system produces a failure in production, add the failure case to the eval set. Over time, the eval set becomes a comprehensive regression suite that embeds institutional knowledge of every failure mode the system has ever exhibited.

Component 2: Deterministic tests

Before you build any sophisticated evaluation, implement deterministic tests — rule-based checks that do not require a judge model and produce unambiguous pass/fail results. These are the fastest, cheapest, and most reliable evaluation signals you have.

Categories of deterministic tests:

  • Format validation: If the model is supposed to return JSON, does it return valid JSON? If it is supposed to use a specific schema, does the output conform to that schema? If it is supposed to answer in a specific language, is the output in that language?
  • Required element checks: Does the output contain required elements (specific fields, citations, disclaimers, structured sections)?
  • Forbidden content checks: Does the output contain things it should never contain (PII, competitor mentions, off-topic content, toxic language)?
  • Length constraints: Is the output within the acceptable length range for the task?
  • Keyword presence/absence: Does the output include or exclude specific required or forbidden phrases?

Deterministic tests run in milliseconds and catch a significant fraction of prompt regressions. Implement them first, before any model-graded evaluation.

Component 3: Reference-based evaluation

For tasks where there is a single correct answer or a small set of correct answers, compare model outputs against a reference using string matching, embedding similarity, or structured diff. Reference-based evaluation is objective and cheap.

Appropriate tasks for reference-based evaluation: information extraction (did the model correctly extract the required fields?), classification (did the model assign the correct category?), translation quality relative to a reference translation, code generation where the output can be executed and tested programmatically.

For extraction and classification tasks, track precision, recall, and F1 across categories. For generation tasks, track ROUGE or embedding similarity scores, but calibrate these against human judgment initially — high ROUGE score does not guarantee high quality, and low ROUGE score does not guarantee low quality.

Component 4: Model-graded evaluation

For tasks where the correct output is subjective or varies significantly across acceptable answers, use a second LLM as a judge. Model-graded evaluation (also called LLM-as-judge) is effective for quality dimensions that are difficult to express as rules: coherence, helpfulness, relevance, tone appropriateness, factual accuracy relative to provided context.

Critical considerations for model-graded evaluation:

  • The judge prompt must be explicit about the evaluation criteria. Vague rubrics produce unreliable scores.
  • Validate the judge model's scores against human labels on a sample of your eval set before trusting it as a signal.
  • Use a different model family for judging than for generation — same-model judge has known self-preference biases.
  • Ask the judge for a score and a justification. Justifications make it easier to audit failures.
  • Track inter-rater reliability between the judge model and human reviewers on an ongoing sample. If the correlation drops, the judge has drifted.

Component 5: CI/CD integration

An eval harness that you run manually before major deployments is better than nothing. An eval harness that runs automatically on every prompt change, every model update, and every dependency change is a production-grade system. The goal is to treat prompt changes the same way software engineers treat code changes: nothing ships without passing the eval suite.

CI/CD integration for LLM evaluation:

  • Run deterministic tests and reference-based eval on every pull request. These are fast enough to run as blocking checks.
  • Run model-graded evaluation on a nightly schedule or on manual trigger for high-cost eval sets.
  • Set acceptance thresholds for each metric. A deployment that drops any metric below threshold is blocked pending review.
  • Track metric trends over time, not just pass/fail. A gradual decline is often more dangerous than a sudden drop because it is easier to miss.
  • Alert on any regression above a defined magnitude, even if the overall score remains above threshold.

Component 6: Production monitoring

The eval harness is a pre-deployment gate. Production monitoring is the ongoing signal after deployment. Both are required. Production monitoring components for LLM systems:

  • Log every input, retrieved context (for RAG), and output with a session ID that enables end-to-end tracing of any individual failure.
  • Run a lightweight quality classifier on a sampled percentage of production outputs (1–5% is typically sufficient for statistical significance).
  • Track output distribution metrics: length distributions, category distributions, refusal rates. Drift in any of these is a signal.
  • Implement user feedback mechanisms (thumbs up/down, correction flows) and route negative feedback back into the eval set.
  • Alert on anomalous output patterns — sudden spikes in refusals, unusual length distributions, high rates of format failures.

Eval harness build checklist

  • Task-specific eval set with 100+ real examples, version controlled alongside prompts.
  • Format validation tests covering all expected output schemas.
  • Required element and forbidden content tests.
  • Reference-based evaluation with precision/recall tracking for extraction/classification tasks.
  • Model-graded evaluation with validated judge prompt and human calibration sample.
  • Acceptance thresholds defined for each evaluation metric.
  • CI/CD pipeline that runs deterministic tests on every change.
  • Production logging infrastructure capturing inputs, contexts, and outputs.
  • Production monitoring with sampled quality classification.
  • Process for adding production failures to the eval set within 24 hours of discovery.

A proper evaluation harness is the difference between a system you hope works and a system you can prove works. It is non-negotiable in LLM development services done to production standard. For the security-specific tests that should live in your eval suite, see our guide on defending against prompt injection. For model selection evaluation methodology, see our piece on picking between Claude, GPT, Llama, and Mistral. Explore our custom LLM development service page, browse the full insights library, or view all our services.

Category LLM Engineering
← All insights
Related

We ship with proof, not hope.

Every Modulus LLM engagement includes an evaluation harness as a deliverable. Free discovery call.