The Harness Is the Product

March 27, 2026

At the turn of the year, Jensen Huang proudly proclaimed 2025 the "year of AI agents," a sentiment echoed by Sam Altman and other tech leaders. Everyone throws around the word "agent" these days, but what does it actually mean?

What Is an Agent?

An agent is not just an LLM with tools attached. In my view, a true agent has four essential properties:

Perception: It can sense its environment. Read files, parse errors, examine test output, consume logs.
Reasoning: It plans multiple steps toward a goal you define.
Action: It can modify the world. Edit code, run commands, write files.
Feedback loop: It observes results and adapts. Retry, self-correct, change strategy.

The key distinction: an LLM responds. An agent pursues. The LLM gives you text. The agent iterates, self-corrects, and keeps going until the job is done.

Think about the evolution of coding tools. Autocomplete tools had no agency. They predicted the next tokens, reacting to immediate context. Pair programming tools had goals but needed supervision at every step. Modern tools like Claude Code and Codex unlock real agency. They iterate independently.

Model Plus Harness

An agent is not just a model. It's a model plus a harness. The model provides reasoning, code generation, pattern recognition, knowledge. The harness provides everything else:

Tools to read, edit, search, and execute
Context management and persistent memory
Curated system prompts
Execution loop and retry logic
Success criteria and termination conditions

The model gives you intelligence. The harness gives you agency.

The secret:

The harness matters more than the model.

The Evidence

HAL CORE-Bench evaluates agents on scientific reproducibility tasks. Claude Opus 4.5 scored 42% with the standard CORE-Agent scaffold. The same model scored 78% with the Claude Code scaffold. That's nearly double, just from a better harness. With further manual adjustments, the score climbed to 95%. One benchmark co-creator declared it "solved."

The pattern repeats across benchmarks. On SWE-Bench Pro, a basic scaffold achieves 23%. An optimized scaffold with the same model hits 45%. That 22-point swing dwarfs the gap between any two frontier models. On SWE-Bench Verified, switching scaffolds produces up to 15-point differences.

Even tiny harness details move the needle dramatically. In February 2026, an engineer named Can Bölük tested three edit formats: patch (diff-style), str_replace (find and swap), and "hashline" (reference lines by content hash). Across sixteen models, hashline matched or beat alternatives for most. One model went from 6.7% to 68.3%. A ten-fold improvement from a formatting choice.

Meanwhile, when you control for the scaffold, the models themselves are remarkably close. On SWE-Bench Verified, the top six models span just 7 percentage points. On SWE-Bench Pro, the top six span 4.9 points. Everyone has a good model.

If you're obsessing over which model to use, you might be optimizing the wrong variable.

The Agentic Loop

So what's in a harness? Five components:

System prompt: Instructions, persona, constraints that shape the model's reasoning
Tool definitions: What the agent can do—read, search, edit, execute
Context management: What information is available when, how history gets compressed
Execution loop: How the agent plans, acts, observes, iterates
Success criteria: How the agent decides it's finished—the most underrated component

The loop is simple: observe, think, act, verify. Repeat until done. The agent reads context, makes surgical edits, runs tests, checks output. If something fails, it tries again. It doesn't give up after one attempt and ask you what to do. It decides when it has finished.

Principles for Agentic Coding

The harness is half the equation. The other half is how you use it. Here's what I've learned:

Define Success Specifically

The agent must decide when it has finished. Make that decision unambiguous. Vague goals produce vague results.

Plan first: Have the agent outline its approach before writing code. Review the plan together. A good plan saves iterations.
Use tests: "Make this pass the tests" is unambiguous. "Make this work" is not. Test-driven development gives the agent a verification loop.
Prompt specifically: Say what you mean. Every unclear detail compounds. Invest two minutes in a good prompt to save twenty minutes of back-and-forth.
Let it ask questions: When you're not sure what you want, have the agent ask clarifying questions first. Agents surface edge cases and ambiguities you hadn't considered.

Manage Your Context

The context window is the agent's working memory. It fills up. Long conversations accumulate noise, errors, outdated assumptions. The agent loses track.

Even with 1-million-token windows, performance degrades at scale. Anthropic's own benchmarks show Opus 4.6 dropping nearly 14 percentage points over a 750K token span. Just because you can use 1 million tokens doesn't mean you should.

AGENTS.md or CLAUDE.md for persistent memory loaded into every session. Keep these concise. Let the agent write to them when conventions crystallize.
Skills for reusable prompts in specific workflows. Things you need sometimes but not every session. Write once, reuse forever.
Fresh agents for each feature or when stuck. Polluted context breeds confused reasoning. Start clean.
Lean repos. Dead code and old experiments pollute search results. Give the agent less noise to wade through.

Write Things Down

Agents lose context between sessions. If you made progress, persist it. Ask the agent to write structured summaries: experiment logs, architecture decisions, status documents. If you don't write it down, the next session starts from zero.

In my research repo, I maintain EXPERIMENTS.md, PAPER_STATUS.md, and auxiliary files for detailed analysis. The compound effect over months is enormous.

Force Tool Use Over Guesses

If a tool can check it, search it, compile it, test it, or validate it, the agent should use it. Don't let the agent hallucinate file contents. Make it read the file. Don't let it guess code works. Make it run tests.

Tools ground the agent in reality. They reduce hallucination. The agent inherits your full terminal. Anything you can do from the command line, it can do too.

Iterate

Don't expect perfection on the first pass. You'll be disappointed. But through iteration, you converge to a stable solution. Push back when the agent gets something wrong. It will reconsider and improve. Each iteration narrows the gap between what you asked and what you got.

The agentic loop is your friend. Use it.

Conclusion

If you take one thing from this essay, let it be this: the gap between models is small, but the gap between how people use them is not. The differentiator is no longer the model itself—it's everything around it. The harness. The context engineering. The curated documentation. The clarity of thought in a prompt. These things are still within your control.

For a long time, the bottleneck to progress was knowledge and technical skill. Now, the bottleneck is shifting toward taste, specification, and judgment. This shift may be uncomfortable. I have no idea what the bottleneck will be a year from now. Agentic coding may look completely different. But in the near term, I'm confident that investing in learning how to steer these agents will have positive compound effects.

You're welcome to disagree—I'd likely agree with many points against these tools. The important thing is that you think about them, understand what they do, and know why you are or are not using them.

Thanks for reading. Questions or thoughts? Reach out at matt@emmons.club.