Building Guardrails Without Breaking Agent Capabilities
November 15, 2025
There's a tension at the heart of agent design: the more capable your agent, the more dangerous it can be. The natural response is to add guardrails—constraints that prevent bad outcomes. But too many guardrails leave you with an agent that's safe but useless.
I've spent the last year learning how to build guardrails that constrain the bad without breaking the good. Here's what I've figured out.
The Guardrail Paradox
Strong agents need autonomy. Autonomy creates risk. Guardrails reduce risk. But guardrails also reduce autonomy, which reduces capability.
The naive approach—adding constraints until nothing can go wrong—also ensures nothing can go right. You end up with agents that can only operate in narrow, predictable contexts. That's not what agents are for.
Types of Guardrails
Not all guardrails are equal. I categorize them by how they constrain behavior:
Hard Constraints
Binary rules that can't be violated:
- Action blocks: "Never delete production data"
- Resource limits: "Maximum 100 API calls per task"
- Scope boundaries: "Only operate on these systems"
These are necessary but dangerous. They're rigid and can prevent legitimate actions in edge cases.
Soft Constraints
Preferences that guide but don't prevent:
- Risk preferences: "Prefer conservative approaches when uncertain"
- Efficiency weights: "Value speed, but not at the cost of correctness"
- Style guidelines: "Prefer simple solutions over clever ones"
These preserve flexibility while shaping behavior. They're subtler but harder to enforce consistently.
Approval Gates
Points where human approval is required:
- Risk-based: Require approval for high-stakes actions
- Uncertainty-based: Require approval when confidence is low
- Scope-based: Require approval for actions outside normal scope
Key Insight:
The best guardrail strategies mix all three types. Hard constraints for safety critical boundaries, soft constraints for optimization, and approval gates for judgment calls.Designing Effective Guardrails
Principle 1: Guardrails Should Be Specific
Vague guardrails create confusion:
- Bad: "Be careful with production systems"
- Good: "Require human approval before modifying production databases"
Specificity makes guardrails enforceable and debuggable.
Principle 2: Default to Safe, Allow Override
Rather than blocking actions entirely, require explicit acknowledgment of risk:
- Default behavior is conservative
- Agents can escalate with justification
- Overrides are logged and reviewable
This preserves capability while adding friction to risky actions.
Principle 3: Context-Aware Constraints
Guardrails should adapt to context:
- Stricter in production, looser in development
- Tighter for high-stakes decisions, relaxed for experimentation
- More constraints for novel situations, fewer for routine
Static guardrails are either too tight or too loose. Dynamic guardrails match constraint level to actual risk.
Common Guardrail Mistakes
Mistake 1: Guardrails as a Substitute for Design
Adding guardrails to compensate for poor agent design. If your agent requires extensive guardrails to be safe, the problem is the agent design, not the lack of constraints.
Mistake 2: Over-Constraining
Guardrails that prevent the agent from being useful. I've seen agents with so many constraints they could only handle trivial cases—at which point, why use an agent at all?
Mistake 3: Inconsistent Application
Guardrails that apply in some contexts but not others, creating unpredictable behavior. The agent learns to work around inconsistencies rather than internalizing safe patterns.
Warning Sign:
If you find yourself constantly adding new guardrails to block specific bad behaviors, you're playing whack-a-mole. Step back and redesign the underlying system.Testing Guardrails
Guardrails need testing as much as agent logic does:
- Positive tests: Verify guardrails block intended behaviors
- Negative tests: Verify guardrails don't block legitimate actions
- Adversarial tests: Try to find workarounds
- Edge case tests: What happens in unusual situations?
I maintain a test suite specifically for guardrail behavior. It's caught as many bugs as my agent logic tests.
The Goal: Aligned Autonomy
The end goal isn't constrained agents—it's aligned agents. Agents that make good choices not because they're forced to, but because they understand what "good" means.
Guardrails are training wheels. They're necessary while you're learning, but the goal is to need them less over time as your agents become more reliably aligned with intended behavior.
Evolution Strategy
How I evolve guardrails over time:
- Start conservative: More guardrails than you think you need
- Monitor behavior: Track when guardrails activate and why
- Identify friction: Find guardrails that block legitimate actions
- Relax carefully: Remove constraints one at a time with monitoring
- Reinforce failures: Add new guardrails for problems that emerge
This creates a virtuous cycle: as agents prove reliable, guardrails relax, enabling more capability, which creates more opportunities to prove reliability.
Measuring Guardrail Effectiveness
Track metrics that tell you if guardrails are working:
- Block rate: How often do guardrails prevent actions?
- False positive rate: How often do guardrails block legitimate actions?
- Override rate: How often are guardrails overridden with justification?
- Incident rate: How often do bad outcomes occur despite guardrails?
Good guardrails have low false positive rates and incident rates. High block rates or override rates suggest your constraints are misaligned with actual needs.
How are you balancing safety and capability in your agent systems? I'm always interested in new guardrail patterns. Find me at matt@emmons.club.
© 2026 Matt Emmons