Eval Driven Development

Jan 21 2026 1 min read

I think when most people get an idea for a new LLM-powered non-deterministic piece of software, they start by building it. That’s how we’ve always built software, so it makes sense; but I think this gets it backwards. I think you start with evaluations before anything else.

Starting with evaluations sounds unobjectionable. But in practice it’s surprisingly difficult and I rarely see engineers doing it.

I’ve had a version of this conversation with a lot of people and I wanted to write it down for posterity.

Made Up Example

Let’s say you’re building a code review bot. You excitedly set up your Github web hooks, your LLM connections, spend some time massaging some prompts, vibe test it against a few PRs, and announce it with great fanfare to your team. Then the criticisms start rolling in:

“These comments add a lot of noise.”

“This comment is wrong.”

“I hate this.”

What happened?

References

Demystifying Evals for AI Agents - Anthropic