Payments are in test mode. Use card 4242 4242 4242 4242 with any future expiry & CVC.
All products

Evaluation

Agent Evaluation Checklist

A free, printable checklist for testing agents before they ship.

Free Free Evaluation
𝕏inf@

About this product

A one-page checklist that walks you through building a golden set, writing a grading rubric, and catching regressions before they reach production. The fastest way to move from "it seemed fine" to evidence.

What's inside — free

Agent Evaluation Checklist

Print this. Run it before any agent does real work. The goal is to move from "it seemed fine" to evidence.

Build a golden set

  • [ ] Collected 15-20 real inputs (not synthetic).
  • [ ] Wrote the output you'd be happy with for each.
  • [ ] Stored them somewhere you'll actually re-run.

Write a rubric

  • [ ] One paragraph per task describing a "good" answer.
  • [ ] Defined what counts as a hard fail.

Measure three things

  • [ ] Task success — did it accomplish the goal? (human yes/no is fine to start)
  • [ ] Tool correctness — right tool, sane arguments?
  • [ ] Cost & latency — tokens and seconds per run, tracked over time.

Run on every change

  • [ ] Baseline the current agent on the golden set.
  • [ ] Re-run before AND after every prompt or model change.
  • [ ] Reject changes that fix one case but break others.

Before go-live

  • [ ] Guardrails tested (it refuses what it must).
  • [ ] A human approval gate on anything irreversible.
  • [ ] A kill switch you can hit without a deploy.