Payments are in test mode. Use card 4242 4242 4242 4242 with any future expiry & CVC.
Knowledge hub
Evaluation·7 min read

Measuring If Your Agent Actually Works

Move from "it seems fine" to evidence — with a test set you can build in an hour.

𝕏inf@

Measuring If Your Agent Actually Works

The moment an agent does real work, "it seems fine" stops being enough. You need a way to know — without a research budget.

Build a golden set

Collect 15-20 real inputs and the output you would be happy with. That list is the single most valuable hour you will spend on quality.

Measure three things

  • Did it succeed? Often a human yes/no to start.
  • Did it use tools correctly? Right tool, sane arguments.
  • What did it cost? Tokens and seconds — these creep up silently.

Run it on every change

  1. Before you tweak a prompt or swap a model, run the golden set.
  2. After, run it again.
  3. Keep changes that help the set without breaking the rest.

The trap

Twenty real cases you actually read beat a thousand synthetic ones you never look at.

Found this useful? Share it.

𝕏inf@