Foundations of Harness Engineering — harn.app Knowledge Base

Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems

UIUC / Meta / Stanford

The field-defining 2026 survey. Ning, Tieu, Fu and co-authors reframe code from a generated artifact into the operational substrate of agentic AI — the medium through which agents reason, act, observe, and verify. Organizes the literature into three connected layers: harness interface (code for reasoning, acting, and environment modeling), harness mechanisms (planning, memory, tool use, control through the Plan-Execute-Verify loop, and harness optimization), and scaling the harness (multi-agent orchestration over shared code-centric substrates). Closes with a research agenda: harness-level evaluation beyond task success, self-evolving harnesses without regression, transactional shared state, human-in-the-loop as durable harness state, and multimodal code-harness systems. This is the academic spine of everything else on this page.

Key Takeaways

Reliable harnesses share four properties: executable, inspectable, stateful, governed
Plan-Execute-Verify unifies planning, execution, and debugging as one cybernetic loop
Agentic Harness Engineering (AHE) treats the harness itself as an object of optimization, edited by an Evolution Agent under governed mutation
Multi-agent systems converge faster when the shared substrate is executable (tests, repos, traces) rather than implicit conversation history
Open problem: evaluators that capture intended task, not just executable proxies (oracle adequacy)

Read Original

Harness Engineering: Leveraging Codex in an Agent-First World

OpenAI

OpenAI's field report on building a large-scale application with Codex. The key insight: when you move from one-shot code generation to a persistent agent working inside your codebase, the engineering shifts from "how to prompt" to "how to constrain." They describe using architectural constraints (enforced directory structure, forbidden patterns), repo-local instructions that persist across sessions, browser-based validation loops, and telemetry to understand where the agent struggles. The article demonstrates that building reliable agent systems is fundamentally an infrastructure problem -- the model is the easy part.

Key Takeaways

Architectural constraints beat prompt instructions for consistency
Repo-local instructions are the primary interface to the agent
Browser validation creates an automatic feedback loop
Telemetry reveals harness gaps faster than manual review

Read Original

Effective Harnesses for Long-Running Agents

Anthropic

Anthropic's definitive guide to making agents work across multiple context windows. Introduces the concept of "initializer agents" that set up the working environment before the main agent begins. Covers feature lists as a structured decomposition format, init.sh scripts that establish the build/test/lint cycle, self-verification patterns where the agent checks its own work, and handoff artifacts that preserve critical state across context window boundaries. The article argues that the harness is what makes the difference between an agent that produces a demo and one that builds production software.

Key Takeaways

Initializer agents prepare the environment before work begins
Feature lists decompose work into tractable chunks
Self-verification catches errors before humans see them
Handoff artifacts preserve state across context windows
The harness, not the model, determines production readiness

Read Original

Harness Design for Long-Running Application Development

Anthropic

A follow-up focused on generating complete applications autonomously. Introduces a GAN-inspired generator/evaluator architecture where one agent builds and another grades. The evaluator applies concrete criteria to turn subjective judgments ("is this design good?") into gradable terms. Covers task state management across long sessions and why decomposing builds into tractable chunks with structured handoff artifacts dramatically improves completion rates.

Key Takeaways

Generator/evaluator pattern inspired by GANs
Concrete evaluation criteria replace subjective judgments
Task state must persist across context windows
Decomposition + handoff artifacts improve completion rates

Read Original

The Anatomy of an Agent Harness

LangChain

LangChain's concise decomposition of what constitutes an agent harness. Defines an agent as "model + harness" where the harness includes prompts, tools, middleware, orchestration logic, and runtime infrastructure. Distinguishes between the framework (reusable components), the runtime (execution environment), and the harness (the application-specific configuration that ties everything together). This framing helps practitioners understand that most agent failures are harness failures, not model failures.

Key Takeaways

Agent = model + harness
Harness includes prompts, tools, middleware, orchestration, runtime
Framework vs. runtime vs. harness are distinct layers
Most agent failures trace back to harness configuration

Read Original

Harness Engineering

Thoughtworks

Thoughtworks frames harness engineering into three complementary activities: context engineering (what the agent knows), architectural constraints (what the agent is allowed to do), and "garbage collection" against entropy (cleaning up the mess that accumulates over long sessions). The article positions harness engineering as a new discipline that sits between traditional software engineering and AI/ML -- requiring both infrastructure skills and an understanding of model behavior.

Key Takeaways

Three activities: context engineering, constraints, entropy management
Harness engineering is a new discipline between SWE and ML
Architectural constraints define the boundaries of safe operation
"Garbage collection" prevents quality decay over long sessions

Read Original

Building Effective Agents

Anthropic

Anthropic's broader guide covering the full spectrum from simple workflows to autonomous agents. Argues that structured workflows (chaining, routing, parallelization) should be preferred over unconstrained agents when the task is well-defined. Introduces patterns for tool use, handoff between specialized agents, and evaluation. The key message: start simple, add complexity only when needed, and always prefer deterministic control where possible.

Key Takeaways

Prefer structured workflows over unconstrained agents
Patterns: chaining, routing, parallelization, orchestrator-workers
Start simple, add complexity only when needed
Deterministic control > probabilistic autonomy when possible

Read Original

Skill Issue: Harness Engineering for Coding Agents

HumanLayer

A provocative argument that when coding agents produce weak results, the problem is almost always the harness, not the model. Reviews common failure modes -- context overflow, missing guardrails, no validation loop -- and shows how each is solved by harness infrastructure rather than model upgrades. Makes the case that investing in harness engineering gives better ROI than waiting for the next model release.

Key Takeaways

Weak agent results are usually harness problems, not model problems
Context overflow, missing guardrails, and no validation are harness failures
Harness investment gives better ROI than waiting for better models
The harness is the highest-leverage improvement point

Read Original

Your Agent Needs a Harness, Not a Framework

Inngest

Inngest argues that agent frameworks often abstract away the wrong things. What agents actually need is infrastructure for state management, automatic retries, trace collection, and concurrency control. Compares the framework approach (hiding complexity) with the harness approach (making infrastructure visible and controllable). The conclusion: frameworks help you start; harnesses help you ship.

Key Takeaways

Frameworks abstract the wrong things for production agents
State, retries, traces, and concurrency are first-class concerns
Frameworks help start; harnesses help ship
Infrastructure should be visible and controllable, not hidden

Read Original

Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems

Key Takeaways

Harness Engineering: Leveraging Codex in an Agent-First World

Key Takeaways

Effective Harnesses for Long-Running Agents

Key Takeaways

Harness Design for Long-Running Application Development

Key Takeaways

The Anatomy of an Agent Harness

Key Takeaways

Harness Engineering

Key Takeaways

Building Effective Agents

Key Takeaways

Skill Issue: Harness Engineering for Coding Agents

Key Takeaways

Your Agent Needs a Harness, Not a Framework

Key Takeaways

Explore the Knowledge Base

Foundations of Harness Engineering Knowledge Base Summary