How to Evaluate an Agent (Before It Embarrasses You)
Vibes are not a test suite. A practical approach to evals that catches regressions without a research budget.
How to Evaluate an Agent
The moment an agent does real work, "it seemed fine" stops being good enough. You need evals โ but you do not need a research lab.
Start with a golden set
Collect 10โ20 real inputs and the output you'd be happy with. That is your golden set. It is the single highest-leverage hour you will spend.
Three things worth measuring
- Task success โ did it accomplish the goal? Often a human yes/no, and that's fine to start.
- Tool correctness โ did it call the right tools with sane arguments?
- Cost and latency โ tokens and seconds per run. These creep up silently.
Grade with a rubric, not a gut
Write a one-paragraph rubric per task ("a good refund reply is polite, states the policy, and offers one next step"). You can grade by hand at first, then have a second model grade against the same rubric to scale.
Run evals on every change
Before you tweak a prompt or swap a model, run the golden set. After, run it again. A change that fixes one case and breaks three is common โ and invisible without evals.
The trap to avoid
Don't optimize your evals into a maze. Twenty real cases you actually look at beat a thousand synthetic ones you never read.