Build the eval harness first

Most teams building LLM applications skip the evaluation harness. They iterate on prompts, swap models, add retrieval steps, and feel their way to whether the system is improving. This is the LLM-application equivalent of refactoring without tests — you can do it, but you will not know whether you are making things better or worse, and on the days when the demo wows you, you will not know why.

The minimum viable evaluation harness is not exotic. It is a CSV.

What an eval harness actually is

Strip the framing language away and an eval harness is five things:

- a fixed set of inputs that represent the messy reality of what users send
- a way to run the current pipeline end-to-end on each input
- a way to record the output, ideally diffed against the previous run
- a small set of graded judgments — pass / fail / ambiguous — per input
- some way to roll those judgments into a number you can stare at

That’s the entire definition. It does not need a framework. It does not need a UI. The first version can be a Python script and a CSV, and probably should be.

When to build it

Day one. Before retrieval, before agent orchestration, before any model upgrade or prompt iteration. The eval harness is not what you build after the pipeline stabilises — the pipeline does not stabilise without the feedback loop, and the feedback loop is the eval harness.

The case for waiting usually goes: “I’ll build evals once I know what good looks like.” This reverses cause and effect. You will not know what good looks like until you have something that lets you compare today’s output to yesterday’s output on the same inputs.

The minimum viable version

Concretely, an afternoon’s work:

Twenty inputs. Hand-pick them from real usage if you have it, or write them by hand if you don’t. They should cover the messy realities — ambiguous queries, edge cases, the categories you know are hard. Do not pick easy inputs.
A script that runs your pipeline on each input. Whatever your pipeline is: a single LLM call, a RAG chain, a multi-step agent. The script writes inputs, outputs, and a timestamp to a CSV.
A column for your judgment. Pass, fail, or ambiguous, marked manually. Do not skip the manual step. The hand-labelling is where you discover what “good” actually means for your application.
A pass rate. Pass divided by total. That is your number. Watch it move as you change things.

That’s it. Twenty inputs, one script, one column, one number.

What changes once you have it

Three things, in roughly this order:

You stop reverting good changes. The first time you run the eval after a “great” prompt change, you will catch a regression you would have shipped. This pays for the harness in a single iteration.

You can make confident claims. “Adding the reranker improved hit rate from 0.62 to 0.78 on the test set” is a sentence you cannot say without an eval harness. It is also the foundation of every honest case study, every credible roadmap, and every productive conversation with a stakeholder.

Iteration speed goes up, not down. The intuition is that maintaining the harness will slow you down. The opposite happens: prompt changes go from half a day of vibes-based testing to twenty minutes of python eval.py. The harness is a forcing function for fast, honest iteration.

Graduating beyond the spreadsheet

The afternoon version is not the end state. It is the seed. As the application matures, the harness grows:

The set expands from twenty cases to a hundred or more.
An LLM-as-judge step replaces some fraction of the manual labelling, sample-checked against your own judgments.
The cases get categorised so you can track regressions per category.
The harness gets wired into CI, so a prompt change that drops the pass rate fails the build.
A small dashboard replaces the CSV.

But none of that matters until the afternoon version exists. The exotic version is what an afternoon turns into eventually, not what you start with.

Required reading

Hamel Husain’s field guide to rapidly improving AI products is the canonical write-up. If you read one piece on this topic, read that.

The shorter version

Build the harness first. Twenty inputs. One script. One column. One number. Everything else — retrieval depth, agent orchestration, model routing — is downstream of being able to measure whether you are making things better.