Safety·6 min read

Guardrails: What Your Agent Must Never Do

Capability without limits is a liability. A practical layering of guardrails that doesn't neuter the agent.

Guardrails: What Your Agent Must Never Do

An agent you can't trust unsupervised isn't saving you time — it's adding a review job. Guardrails are how you earn the right to look away.

Three layers

Prompt-level — explicit "never" rules in the system prompt. Cheap, first line, not sufficient alone.
Tool-level — the strongest layer. If the agent can't call delete_customer, no prompt injection can make it. Scope tools tightly.
Approval gates — for irreversible or outward-facing actions (sending email, spending money, deleting data), require a human confirm.

Default to reversible

Design tools so mistakes are recoverable. "Draft" instead of "send." "Archive" instead of "delete." Most of safety is making the worst case boring.

Watch the inputs, not just the outputs

Prompt injection hides instructions inside the data your agent reads — a web page, an email, a PDF. Treat retrieved content as untrusted. Never let fetched text silently change what tools the agent is allowed to use.

A good guardrail is invisible when things go right and decisive when they go wrong.

Found this useful? Share it.

𝕏 in f @

Guardrails: What Your Agent Must Never Do

Guardrails: What Your Agent Must Never Do

Three layers

Default to reversible

Watch the inputs, not just the outputs

More on Safety

AI Agents for Your Small Business: Where to Start

Connecting Tools With MCP: A Walkthrough

AI Agents in Plain English

Put this knowledge into practice

Campaign Strategist

Creator Studio Agent

Data Analyst Agent