Guardrails: What Your Agent Must Never Do
Capability without limits is a liability. A practical layering of guardrails that doesn't neuter the agent.
Guardrails: What Your Agent Must Never Do
An agent you can't trust unsupervised isn't saving you time โ it's adding a review job. Guardrails are how you earn the right to look away.
Three layers
- Prompt-level โ explicit "never" rules in the system prompt. Cheap, first line, not sufficient alone.
- Tool-level โ the strongest layer. If the agent can't call
delete_customer, no prompt injection can make it. Scope tools tightly. - Approval gates โ for irreversible or outward-facing actions (sending email, spending money, deleting data), require a human confirm.
Default to reversible
Design tools so mistakes are recoverable. "Draft" instead of "send." "Archive" instead of "delete." Most of safety is making the worst case boring.
Watch the inputs, not just the outputs
Prompt injection hides instructions inside the data your agent reads โ a web page, an email, a PDF. Treat retrieved content as untrusted. Never let fetched text silently change what tools the agent is allowed to use.
A good guardrail is invisible when things go right and decisive when they go wrong.