Measuring If Your Agent Actually Works
Move from "it seems fine" to evidence — with a test set you can build in an hour.
Measuring If Your Agent Actually Works
The moment an agent does real work, "it seems fine" stops being enough. You need a way to know — without a research budget.
Build a golden set
Collect 15-20 real inputs and the output you would be happy with. That list is the single most valuable hour you will spend on quality.
Measure three things
- Did it succeed? Often a human yes/no to start.
- Did it use tools correctly? Right tool, sane arguments.
- What did it cost? Tokens and seconds — these creep up silently.
Run it on every change
- Before you tweak a prompt or swap a model, run the golden set.
- After, run it again.
- Keep changes that help the set without breaking the rest.
The trap
Twenty real cases you actually read beat a thousand synthetic ones you never look at.