Evaluation·8 min read

How to Evaluate an Agent (Before It Embarrasses You)

Vibes are not a test suite. A practical approach to evals that catches regressions without a research budget.

How to Evaluate an Agent

The moment an agent does real work, "it seemed fine" stops being good enough. You need evals — but you do not need a research lab.

Start with a golden set

Collect 10–20 real inputs and the output you'd be happy with. That is your golden set. It is the single highest-leverage hour you will spend.

Three things worth measuring

Task success — did it accomplish the goal? Often a human yes/no, and that's fine to start.
Tool correctness — did it call the right tools with sane arguments?
Cost and latency — tokens and seconds per run. These creep up silently.

Grade with a rubric, not a gut

Write a one-paragraph rubric per task ("a good refund reply is polite, states the policy, and offers one next step"). You can grade by hand at first, then have a second model grade against the same rubric to scale.

Run evals on every change

Before you tweak a prompt or swap a model, run the golden set. After, run it again. A change that fixes one case and breaks three is common — and invisible without evals.

The trap to avoid

Don't optimize your evals into a maze. Twenty real cases you actually look at beat a thousand synthetic ones you never read.

Found this useful? Share it.

𝕏 in f @

How to Evaluate an Agent (Before It Embarrasses You)

How to Evaluate an Agent

Start with a golden set

Three things worth measuring

Grade with a rubric, not a gut

Run evals on every change

The trap to avoid

More on Evaluation

Measuring If Your Agent Actually Works

AI Agents for Your Small Business: Where to Start

Connecting Tools With MCP: A Walkthrough

Put this knowledge into practice

Campaign Strategist

Creator Studio Agent

Data Analyst Agent