The thing that distinguishes production LLM systems from demos is the answer to one question: how do you know it works? "It seemed fine when I tested it" is not an answer. "We have an evaluation harness that runs on every deployment and our p95 accuracy on our task distribution is above our acceptance threshold" is an answer. The gap between those two states is where most LLM projects fail post-launch.
This guide covers the components of a production-grade LLM evaluation harness, the order in which to build them, and the specific failure modes each component is designed to catch.
Evaluation is skipped because it feels like overhead. The model produces outputs that look reasonable. The demo works. Stakeholders are happy. Adding a formal eval harness feels like slowing down to add testing infrastructure to working code. This reasoning is exactly the reasoning that leads to a production LLM system producing confidently wrong outputs at scale six weeks after launch, with no automated mechanism to detect the regression.
LLMs introduce a novel failure mode relative to traditional software: they fail probabilistically, inconsistently, and in ways that are invisible without structured measurement. A model that works correctly 94% of the time fails 6% of the time — and without an eval harness, you will not know whether that 6% is stable or drifting upward. A prompt change that fixes one class of failure often introduces another. Detecting this without automated evaluation requires testing every prompt change by hand against hundreds of examples, which nobody does in practice.
The foundation of the harness is an eval set composed of real examples from your production task distribution. Generic benchmarks measure general capability. Your eval set measures whether the model does your specific task correctly.
Building the initial eval set: pull 100–300 examples from your existing data or from early user sessions. For each example, record the input and the correct output (or a set of acceptable outputs if there are multiple valid answers). If you are building before launch, construct synthetic examples that represent the realistic distribution of inputs you expect. Weight the eval set toward difficult cases — easy cases are not where failures happen.
The eval set should be stored in version control alongside your prompt templates. Every time the system produces a failure in production, add the failure case to the eval set. Over time, the eval set becomes a comprehensive regression suite that embeds institutional knowledge of every failure mode the system has ever exhibited.
Before you build any sophisticated evaluation, implement deterministic tests — rule-based checks that do not require a judge model and produce unambiguous pass/fail results. These are the fastest, cheapest, and most reliable evaluation signals you have.
Categories of deterministic tests:
Deterministic tests run in milliseconds and catch a significant fraction of prompt regressions. Implement them first, before any model-graded evaluation.
For tasks where there is a single correct answer or a small set of correct answers, compare model outputs against a reference using string matching, embedding similarity, or structured diff. Reference-based evaluation is objective and cheap.
Appropriate tasks for reference-based evaluation: information extraction (did the model correctly extract the required fields?), classification (did the model assign the correct category?), translation quality relative to a reference translation, code generation where the output can be executed and tested programmatically.
For extraction and classification tasks, track precision, recall, and F1 across categories. For generation tasks, track ROUGE or embedding similarity scores, but calibrate these against human judgment initially — high ROUGE score does not guarantee high quality, and low ROUGE score does not guarantee low quality.
For tasks where the correct output is subjective or varies significantly across acceptable answers, use a second LLM as a judge. Model-graded evaluation (also called LLM-as-judge) is effective for quality dimensions that are difficult to express as rules: coherence, helpfulness, relevance, tone appropriateness, factual accuracy relative to provided context.
Critical considerations for model-graded evaluation:
An eval harness that you run manually before major deployments is better than nothing. An eval harness that runs automatically on every prompt change, every model update, and every dependency change is a production-grade system. The goal is to treat prompt changes the same way software engineers treat code changes: nothing ships without passing the eval suite.
CI/CD integration for LLM evaluation:
The eval harness is a pre-deployment gate. Production monitoring is the ongoing signal after deployment. Both are required. Production monitoring components for LLM systems:
A proper evaluation harness is the difference between a system you hope works and a system you can prove works. It is non-negotiable in LLM development services done to production standard. For the security-specific tests that should live in your eval suite, see our guide on defending against prompt injection. For model selection evaluation methodology, see our piece on picking between Claude, GPT, Llama, and Mistral. Explore our custom LLM development service page, browse the full insights library, or view all our services.
Every Modulus LLM engagement includes an evaluation harness as a deliverable. Free discovery call.
Tell us what you’re building. Fixed-price proposal within 48 hours.