Building Tool Harnesses That Don't Break

May 10, 2024

I learned this lesson the hard way. My agent had been running smoothly for weeks, handling database queries and API calls without issues. Then one Tuesday morning, it started failing every single task. No code changes, no deployment—just sudden, catastrophic failure.

Turns out an external API had changed its error response format slightly. My tool harness was parsing responses in a way that assumed a specific structure, and when that structure shifted, the whole thing fell apart. The agent kept retrying, each time hitting the same parsing error, burning through tokens and getting nowhere.

This is the fragility problem with AI agents. They're only as reliable as the tools they use, and most tool integrations are surprisingly brittle.

What Is a Tool Harness, Anyway?

I think of a tool harness as the layer between your agent and the outside world. It's responsible for:

Translating agent intent into actual API calls or function invocations
Handling responses and errors in a way the agent can understand
Validating inputs before they cause problems
Providing clear feedback when things go wrong

Most implementations I see treat this as an afterthought. The agent can call tools, so we're done, right? Not quite.

The Fragility Points

In my experience, tool harnesses break in predictable ways:

Parsing assumptions: Expecting specific JSON structures, field names, or data types
Timeout handling: Not distinguishing between slow responses and complete failures
Error propagation: Returning raw error messages that confuse the agent
State management: Assuming tools are stateless when they're not

Common Mistake:

Returning stack traces or raw API errors to your agent. They can't debug your code—they need actionable feedback.

Defensive Patterns That Actually Work

After that Tuesday morning disaster, I rebuilt my tool harness with three principles in mind:

1. Response Normalization

Never pass raw API responses directly to your agent. Always normalize them into a consistent format your agent can rely on. This means:

{
  "success": true/false,
  "data": {...},  // Always same structure
  "error": "human-readable message",
  "retry_recommended": true/false,
  "context": {"key": "value"}  // Relevant metadata
}

The key is consistency. Whether the API returns a 200, 404, or 500, your agent sees the same structure. It doesn't need to understand HTTP status codes—it needs to know if it should try again, try something different, or give up.

2. Graceful Degradation

Tools should fail in ways that preserve agent options. If an API call fails, the harness should provide fallback data or suggest alternatives rather than just returning an error.

For example, if a user lookup API fails, my harness returns:

Cached data if available (with staleness noted)
Partial data from other sources
Clear explanation of what's missing
Suggested alternative approaches

3. Input Validation with Feedback

Validate inputs before making external calls, and when validation fails, tell the agent exactly what's wrong in terms it can act on.

Instead of "Invalid input," return "The date parameter must be in YYYY-MM-DD format. You provided 'yesterday'." The agent can then fix its approach rather than guessing.

The Retry Problem

One pattern I see constantly: agents stuck in retry loops. They call a tool, it fails, they retry with the exact same parameters, it fails again, repeat ad infinitum.

Your harness needs to break these cycles. I implement retry budgets—each tool invocation gets a limited number of retries, and when those are exhausted, the harness returns a definitive failure with alternative suggestions.

Implementation Tip:

Track retry counts at the harness level, not the agent level. Agents shouldn't need to manage retry logic—that's infrastructure concern.

Testing Tool Harnesses

Here's something I wish I'd started doing earlier: chaos testing for tool harnesses. I have a test suite that deliberately introduces failures:

Random API delays and timeouts
Malformed response data
Unexpected status codes
Rate limiting and quota exceeded errors

If my harness can handle these gracefully, it can probably handle whatever production throws at it.

The Bigger Picture

Robust tool harnesses are what separate toy agents from production systems. The agent's intelligence doesn't matter if its connection to the world is fragile.

Since rebuilding my harness with these patterns, I've had zero outages caused by external API changes. The agent still fails sometimes—that's inevitable—but the failures are recoverable and debuggable.

Where to Start

If you're building agents with tool access, audit your current harness:

Where do you make assumptions about external systems?
What happens when those assumptions are violated?
Can your agent recover from tool failures, or does it get stuck?

Fix the fragility points before they cause problems. Your future self will thank you.

I'm always interested in hearing about how others are building robust agent systems. What patterns have worked for you? Reach out at matt@emmons.club.