AI Agent Harness: The Layer That Makes LLMs Productive

"The harness is doing the heavy lifting. The model is the easy part to swap." — Martin Fowler, Harness Engineering

Why this term suddenly shows up everywhere

Until mid-2025, almost nobody said "harness." It was coding agent, copilot, assistant.

Then Anthropic published Effective harnesses for long-running agents, OpenAI followed with Harness engineering: leveraging Codex, and Martin Fowler turned it into a discipline of its own.

In December 2025, Meta acquired the startup Manus for around two billion dollars per industry reports. What Meta wanted wasn't the model. It was the harness.

The term comes from horseback riding. A harness turns horsepower into useful work. An agent harness turns raw model intelligence into useful software work.

Without this layer, a language model is just an answer machine. With it, it becomes a digital teammate that grinds through tasks for hours.

What an agent harness actually is

Our preferred analogy from daily use at IJONIS: the model is the engine. The harness is everything else in the car. Chassis, suspension, gearbox, aerodynamics, driver's seat, steering. You can drop a Formula 1 engine into a bad car — it still won't win. But with a Formula 1 car you can swap the engine and the car stays fast. The car makes the difference, not the block under the hood.

For more technically minded readers, there's a second analogy: the model is the Central Processing Unit (CPU). The context window is Random Access Memory (RAM). The harness is the operating system. The agent is the application.

Swap only the model and you've swapped the CPU, or the engine. The car, the program, the workflow on top is still as good or as bad as it was.

A harness isn't a library like LangChain. Libraries are building blocks you use to build a harness.

The harness itself is the finished system: Codex Command Line Interface (CLI), Claude Code, Cursor, Pi. A product you install and run. Compared to raw Application Programming Interface (API) calls, a harness ships with the following pieces.

Layer	Model API alone	With harness
Loop	Not provided	Built in
Tools	Implement yourself	MCP, prebuilt
Memory	Lost per session	Persistent (CLAUDE.md, notes)
Permissions	None	Sandbox + approval gates
Observability	None	Tracing, logs, evals

Seven components make a harness:

The loop. Model says something, calls a tool, observes the result, picks the next move. This pattern is called ReAct.
Tool calls. Reading files, running shell commands, web search, calling Model Context Protocol (MCP) servers.
Context management. System prompts, history compaction, retrieval. This is where the agent either remembers what mattered 50 steps ago or doesn't.
Memory and state. Scratchpads, todo lists, persistent notes like the CLAUDE.md pattern.
Permissions. Sandbox, allowlists, approval gates. What can the agent do without asking.
Observability. Logging, tracing, evals. Without it you fly blind.
Control flow. Retries, budget caps, stop conditions. So the agent doesn't burn tokens in a loop forever.

Why the harness matters more than the model

The harness mattering more than the model sounds counterintuitive at first. Models get the headlines, harnesses get a footnote. The data tells a different story: a good harness with a mid-tier model beats a bad harness running the top model. The harness impact is often larger than the jump from one model generation to the next.

The Terminal-Bench 2.0 leaderboard, maintained by a Stanford and Laude Institute collaboration, pits harness-model pairs against each other. The current top scores show how much the harness shapes the result:

App	Model	Score
Codex	GPT-5.5	82.0%
ForgeCode	GPT-5.4	81.8%
TongAgents	Gemini 3.1 Pro	80.2%
ForgeCode	Claude Opus 4.6	79.8%
ForgeCode	Gemini 3.1 Pro	78.4%
Simple Codex	GPT-5.3-Codex	75.1%

ForgeCode shows up on the same leaderboard with three different models. The spread between those three scores is 3.4 percentage points.

Switch Codex from GPT-5.5 to GPT-5.3-Codex and the harness from Codex to Simple Codex, and the score drops nearly seven points. The harness explains more variance than the model swap.

~$2BMeta-Manus acquisition, December 2025

82.0%Top score Terminal-Bench 2.0 (April 2026)

5xManus harness rebuilds in 6 months (source: Manus engineering blog 2025)

The second data source is the Software Engineering Benchmark (SWE), SWE-bench Verified. The same models move up and down by ten or more percentage points depending on which harness wraps them. If you only compare models, you're comparing the wrong layer.

The leading commercial harnesses in 2026

App	Maker	Form	Model-agnostic	Pricing
Claude Code	Anthropic	CLI + IDE	Limited	Pro/Max plan or API
Codex CLI	OpenAI	CLI + Cloud	Limited	ChatGPT plan or API
Cursor	Anysphere	IDE + Cloud Agents	Yes	from $20/mo + usage
Sourcegraph Amp	Sourcegraph	IDE	Yes	Subscription
Devin	Cognition	Fully autonomous cloud	Limited	from $500/mo
Zed Agent	Zed Industries	Editor-native	Yes	Editor free, agent paid
Gemini CLI	Google	CLI	No	Free tier available

Claude Code is the harness with the strongest coherence over long tasks. The CLAUDE.md pattern, subagents, hooks, skills, and plugins make it the most extensible platform. We use it at IJONIS to build our products and our agentic workflows.

Codex CLI shines on short, well-scoped tasks running in cloud containers. Fire and forget. It's the current number one on Terminal-Bench 2.0.

Cursor is the best Integrated Development Environment (IDE) -first harness. Version 3 added cloud VMs, parallel Agent Tabs, and Design Mode. If you live in the editor, Cursor is the right answer. We covered the head-to-head with Claude Code here.

Sourcegraph Amp combines code-review agents with composable sub-agents like Oracle, Librarian, and Painter. Strong if you have a large codebase and an existing code-search stack.

Devin takes the opposite approach: full autonomy without an editor. Pricey, but interesting for teams that want to delegate tickets directly to an agent.

Zed Agent is the native agent in Zed, the fast Rust-based editor. Worth it if you live in Zed already.

Gemini CLI is Google's answer. Free tier available, locked to Gemini models.

Open source: Pi, Goose, Aider, and the rest

The open-source side is more alive in 2026 than ever. Four reasons to pick OSS: data privacy, model freedom, extensibility, no vendor lock-in.

Pi by Earendil Inc., built by Mario Zechner, is the radically extensible option. MIT license, 15+ AI providers, mid-session model switching. Pi deliberately ships fewer defaults than Claude Code (no subagents, no plan mode) and instead hands you primitives to build with. If you want a harness your team can shape, this is where you start.

Goose by Block is MCP-native and not limited to coding. It works for general agent tasks, from DevOps to data pipelines.

Aider is the classic. Paul Gauthier defined the terminal pair-programming pattern with it. Auto-commits, repo map, model-agnostic. If you want to start minimal, start here.

Cline and its fork Roo Code cover the widest IDE surface: VSCode, JetBrains, Neovim, Emacs. Bring your own model, no subscription required.

Crush by Charm has the best terminal UI on the market. If you know Bubble Tea, you know what that means. opencode is the open-source clone of Claude Code, model-agnostic and community-driven. Continue runs as an IDE extension with custom assistants.

💡

When open source is the right call

When data privacy is non-negotiable (self-hosted models), when you need to bend the tool to internal conventions, or when per-seat license costs blow the budget. Pi and Aider run with any model, including local ones via Ollama.

How harnesses are benchmarked against each other

Three benchmarks dominate harness evaluation in 2026:

Terminal-Bench — the most important harness benchmark. Stanford and the Laude Institute maintain 89 tasks in version 2.0, spanning software engineering, security, ML, and data science. Every submission is a harness-model pair, not just a model. That's exactly what makes the benchmark useful.
the Software Engineering Benchmark (SWE), SWE-bench Verified — tests resolution of real GitHub issues. Vendors publish harness-model combinations there too. The spread between different harnesses on the same model regularly exceeds ten percentage points.
Vendor self-evals — Anthropic, OpenAI, and Cursor publish internal benchmarks on code-edit accuracy, tool-use success, and long-horizon consistency. Useful, but not independent.

Terminal-Bench remains the gold standard because it's external and transparent.

What a harness does well

Vorteile

Long tasks without losing coherence, thanks to memory and context compaction
Tool use without rebuilding the loop and recovery logic yourself
Safety guardrails and approval gates built in
Observability for debugging and production-grade evals
Faster onboarding than a self-built loop
Quick model switching without code changes (on agnostic harnesses)

Where harnesses break

Nachteile

Vendor lock-in via proprietary tool formats, hooks, and prompts
Opaque system prompts shape output but aren't visible to you
Token costs explode through loops, sub-agents, and context compaction
Auto tool-use can run destructive commands without approval
Memory drift: the agent misremembers or stales out across sessions
Harness engineering is an unstable discipline, Manus rebuilt theirs five times

⚠️

Take permissions seriously

A harness with shell access and no approval gates is a one-click privilege escalation. We covered this in detail: Your AI Agent Has More Permissions Than Your CTO.

Which harness for which team?

There's no winner. There are only good fits between team, use case, and tool.

Engineering teams with their own IT and a CLI preference do best with Claude Code. CLAUDE.md, hooks, and skills hand you full control.
IDE-centric teams pick Cursor or Cline. If you can't leave the JetBrains world, take Cline.
Full autonomy and cloud execution ship from Devin or Codex Cloud. Drop in a task, get a pull request out.
Open source required, or privacy-critical? Pi, Aider, or Goose. Pi for extensibility, Aider for simplicity, Goose for non-coding tasks.
General agents beyond coding (DevOps, research, data pipelines) belong in Goose or a self-built harness on top of MCP.

Harness engineering becomes its own discipline

Martin Fowler, Addy Osmani, and Adnan Masood all published their own articles on "harness engineering" in 2026. The term isn't a buzzword anymore. It describes the practice of bending a harness to your organization: CLAUDE.md conventions, internal MCP servers, hooks for compliance, skills for repeated tasks.

At IJONIS we build exactly that layer. Writing code became a commodity in 2026. What's scarce is the architecture above it. Configure a harness well and you'll outperform whoever subscribed to the most expensive model.

Frequently asked questions about agent harnesses

What is the difference between an agent harness and a framework like LangChain?

LangChain is a library of building blocks. A harness like Claude Code or Cursor is the finished product you install and run. A harness can use LangChain internally, but it isn't a framework itself.

Which harness is best in 2026?

There's no single winner. Codex CLI currently leads Terminal-Bench 2.0 at 82.0 percent. Claude Code is strongest on long-running tasks. Cursor wins inside the IDE. Pi and Aider win when open source is mandatory.

Should we build our own harness?

Only if you have very specific compliance or data-privacy requirements. For most teams, an existing harness with custom CLAUDE.md conventions, hooks, and MCP servers is the right call.

What does it cost to run a harness?

Subscription harnesses run from $20 to $500 per seat per month. Token costs sit on top and can hit $50 to $200 per person per month under heavy use. Open-source harnesses have zero license cost, only API spend.

What role does MCP play?

MCP (Model Context Protocol) became the de facto standard in 2026 for how harnesses connect to tools and data sources. Nearly every modern harness speaks MCP natively. That makes switching harnesses easier because your own MCP servers stay portable.

Which harness we use at IJONIS

The bottom line: don't compare GPT-5.5 to Claude Opus 4.6. Compare Codex CLI to Claude Code to Cursor to Pi. The layer that decides success or failure is called the harness. Models swap in hours. A harness your team masters takes months to build.

At IJONIS in Hamburg we use Claude Code as our main tool, Cursor for IDE-heavy work, and Pi for experiments. If you're unsure which harness fits your stack, talk to us. We know where they break from daily use.