How to Build an Eval Harness Before Your RAG Goes Live

A legal operations team deploys a RAG system for contract review. The system works well at launch. Six months later, the team updates the document corpus. A senior associate notices the system is citing provisions that no longer apply. Nobody knows when the regression happened. The eval harness was never built. Every change since launch was a guess.

This is what deploying without evals looks like in practice: not a dramatic failure, but a slow undetected drift that surfaces at the worst possible moment.

The evaluation harness is not the last thing you build. It is what makes everything else stable enough to trust.

Deploying Without Evals Is Not Faster

Teams skip evaluation harnesses because they look like overhead. Writing test questions, curating reference answers, instrumenting metrics before the system has a single user: it is visible effort spent on something that feels hypothetical when the system is not yet in production.

The actual accounting is different. A RAG system without an eval harness has no way to determine whether a change improved or degraded quality. Every prompt adjustment is a leap of faith. Every chunking strategy update is untested. Every model provider update, every corpus addition, every retrieval parameter change: each one is either an improvement or a regression, and without the harness there is no instrument to tell which.

Without evals, a system in production accumulates undetected regressions until a visible failure forces investigation. By that point, the cause is buried under weeks or months of changes, each individually plausible, none individually tracked. The investigation costs more than the harness would have. The remediation disrupts users who had begun to rely on the system.

The evaluation harness is not a test suite written after the system is stable. It is the instrument used to make the system stable in the first place. Building it after the fact is like calibrating a scale after weighing all the samples.

The Four RAGAS Metrics That Matter

A RAG evaluation dashboard tying faithfulness, answer relevancy, context precision, and context recall to the system layer each metric diagnoses.

RAGAS (Retrieval Augmented Generation Assessment) is an open-source evaluation framework that measures RAG quality across four dimensions. Each metric points to a different layer of the system, which is what makes the framework useful for diagnosis rather than just scoring.

Faithfulness measures whether the generated answer contains only claims supported by the retrieved context. A faithful answer does not introduce information the retrieved documents do not contain. Low Faithfulness means the generation layer is hallucinating: producing plausible-sounding claims that have no grounding in the retrieved evidence. In a legal, medical, or financial application, this is not a quality problem. It is a liability.

Answer Relevancy measures whether the generated answer actually addresses the question asked. A system can be perfectly faithful while still failing to answer the question: it accurately summarizes the retrieved documents without addressing what the user needed. Low Answer Relevancy points to a mismatch between the query understanding and the generation framing.

Context Precision measures what proportion of the retrieved chunks were actually relevant to answering the question. Retrieval systems often fetch documents that are semantically adjacent to the query but do not contain the evidence needed. These irrelevant chunks dilute the context and can actively mislead the generation layer. Low Context Precision points to retrieval ranking: the system is retrieving too much noise alongside the signal.

Context Recall measures whether the retrieval surfaced all the chunks needed to answer the question completely. A system can retrieve relevant documents and still miss the critical piece of evidence. Low Context Recall points to ingestion or chunking: the evidence exists in the corpus but is either not ingested or fragmented in a way that makes it unretrievable for the specific query.

The diagnostic value of the four-metric framework is that each metric implicates a different component. You do not troubleshoot a retrieval problem with generation fixes, and vice versa. The metrics separate the signal.

Building the Question Set

The eval harness requires a test set of questions with reference answers. This is where most teams underinvest, and where most of the value actually lives.

The test set is not a representative sample of possible queries. It is a curated set of questions that covers the failure modes the system must not exhibit. The categories to cover:

Factual recall: a specific piece of information from a known document. The test verifies that the retrieval finds the right document and the generation extracts the right fact.

Multi-document synthesis: the correct answer requires combining information from two or more sources. This tests whether the retrieval surfaces the full evidence set and whether the generation integrates across sources without introducing contradictions.

Negative space: the question cannot be answered from the corpus. The correct answer is an explicit acknowledgment that the information is not available, not a hallucinated response that fills the gap with plausible-sounding content. This is often the most important test for enterprise systems, where a confident wrong answer is worse than an honest “I do not have that information.”

Adversarial: phrasing that might trigger hallucination. Queries that use terminology present in the corpus but in a different context, or that introduce a premise that is false and see whether the system corrects it or accepts it.

Domain-specific edge cases: the questions that experienced domain practitioners recognize as the ones where the system is most likely to fail. These come from the same expert interviews used to build the knowledge system.

The question set should be written by domain experts, not by the team that built the system. The team who built the system knows what the system can answer. The experts know what users will actually ask. The gap between those two sets of questions is where the most important failures live.

Minimum viable test set: fifty questions across the major document categories in the corpus. Run them against a reference baseline before go-live. The test set is not static: every user query that produces a wrong or incomplete answer is a candidate for inclusion. The harness grows as the system is used, which means it catches new failure modes as they emerge rather than only the ones visible before launch.

Instrumentation Before Launch

The evaluation harness has two operating modes: offline evaluation before deployment, and online monitoring in production.

Offline evaluation runs the test set against the system before any change is deployed. The score on each of the four RAGAS metrics establishes the baseline. Every subsequent change, from prompt rewording to corpus update to model version change, must pass the regression test before deployment.

Online monitoring samples production queries in near-real-time and evaluates them against the same metrics. The sample rate depends on volume and sensitivity: a low-volume internal tool might evaluate every query; a high-volume customer-facing system might sample at five percent and flag statistical deviations.

For online observability, LangSmith traces each RAG step, recording the retrieved chunks, the generated answer, token usage, and latency per step. For environments with data residency constraints, where sending production queries to a cloud observability service is a compliance concern, Langfuse provides a self-hosted alternative. The choice between them is a data governance decision, not a capability decision: both provide the observability needed for the monitoring layer.

What to log at minimum: the query text, the retrieved chunks with source metadata and document identifiers, the generated answer, confidence signals where available, and user feedback when the interface captures it. The logs are the raw material for the test set. Every time a user flags an answer as wrong, that query and the correct answer go into the test set for the next offline evaluation run.

Alert thresholds should be defined before launch: what Faithfulness score triggers a human review of recent queries; what Context Recall drop triggers an ingestion audit; what latency increase triggers an infrastructure review. The thresholds are organizational decisions, not engineering defaults. They reflect the risk tolerance of the specific application.

The Regression Test Protocol

Every change to a RAG system is a regression risk.

Prompt changes affect generation behavior in ways that are sometimes unpredictable. Chunking strategy updates change what evidence is available for retrieval. Embedding model changes alter the semantic space used for retrieval ranking. Retrieval parameter adjustments change the volume and composition of retrieved context. Document corpus updates add, modify, or supersede content that the generation layer has learned to rely on.

Each of these changes can improve the system on the specific case that motivated the change while degrading it on cases not considered. Without a regression test protocol, the team has no way to detect the degradation until it surfaces as a user complaint or a visible failure.

The regression protocol: implement the change in a staging environment, run the full test set, compare the four RAGAS scores against the baseline, investigate any metric that drops, and deploy only when the aggregate quality is stable or improved. The protocol does not require perfection on every test. It requires that no metric drops below a defined threshold without explicit acceptance.

For a legal RAG system, a five-point drop in Faithfulness after a corpus update means the system is producing more citations to provisions that do not apply. That is a compliance risk. The regression test catches it before it reaches the users who rely on the system for actual decisions.

Evals as Organizational Capability

Casetext, a legal AI company acquired by Thomson Reuters for USD 650 million [unverified: figure cited from public press reports; verify current reporting before citing], built its evaluation discipline before productizing its AI features. The eval harness was not a technical artifact. It was the mechanism by which the team built confidence in a system that needed to operate in a professional liability context.

The organizational dimension of evals is what most discussions miss. The team that builds the test set, reviews the RAGAS scores, and investigates each regression learns something about their domain and their users that no amount of log monitoring can replicate. They learn which queries are hard. They learn which document types produce retrieval failures. They learn where the generation layer is brittle and why.

A RAG system with an eval harness improves deliberately. The team has a feedback loop with enough resolution to identify causes and test fixes. A RAG system without one improves accidentally or degrades silently. The team responds to visible failures but has no instrument for detecting the failures that have not yet surfaced in a way the users can articulate.

The starting point is simpler than it looks. Do not wait for the system to be production-ready before building the first fifty questions. Build them from the same expert sessions used for discovery. Run them against the prototype. The gaps the harness reveals at prototype stage will shape the system architecture before a single production query arrives. That is the right time to discover them.

The eval harness is not overhead. It is what makes the system trustworthy enough to rely on.

Terraris.ai designs and deploys production RAG systems with evaluation harnesses built in from day one. Explore how we approach enterprise RAG implementation.