Eval Driven Development
I think when most people get an idea for a new LLM-powered non-deterministic piece of software, they start by building it. That’s how we’ve always built software, so it makes sense; but I think this gets it backwards. I think you start with evaluations before anything else.
Starting with evaluations sounds unobjectionable. But in practice it’s surprisingly difficult and I rarely see engineers doing it.
I’ve had a version of this conversation with a lot of people and I wanted to write it down for posterity.
Made Up Example
Let’s say you’re building a code review bot. You excitedly set up your Github web hooks, your LLM connections, spend some time massaging some prompts, vibe test it against a few PRs, and announce it with great fanfare to your team. Then the criticisms start rolling in:
“These comments add a lot of noise.”
“This comment is wrong.”
“I hate this.”
What happened?
References
- Demystifying Evals for AI Agents - Anthropic