Payments are in test mode. Use card 4242 4242 4242 4242 with any future expiry & CVC.
Knowledge hub
Evaluationยท8 min read

How to Evaluate an Agent (Before It Embarrasses You)

Vibes are not a test suite. A practical approach to evals that catches regressions without a research budget.

๐•inf@

How to Evaluate an Agent

The moment an agent does real work, "it seemed fine" stops being good enough. You need evals โ€” but you do not need a research lab.

Start with a golden set

Collect 10โ€“20 real inputs and the output you'd be happy with. That is your golden set. It is the single highest-leverage hour you will spend.

Three things worth measuring

  • Task success โ€” did it accomplish the goal? Often a human yes/no, and that's fine to start.
  • Tool correctness โ€” did it call the right tools with sane arguments?
  • Cost and latency โ€” tokens and seconds per run. These creep up silently.

Grade with a rubric, not a gut

Write a one-paragraph rubric per task ("a good refund reply is polite, states the policy, and offers one next step"). You can grade by hand at first, then have a second model grade against the same rubric to scale.

Run evals on every change

Before you tweak a prompt or swap a model, run the golden set. After, run it again. A change that fixes one case and breaks three is common โ€” and invisible without evals.

The trap to avoid

Don't optimize your evals into a maze. Twenty real cases you actually look at beat a thousand synthetic ones you never read.

Found this useful? Share it.

๐•inf@