This page catalogs agent SDKs, frameworks, runtimes, harnesses, and reference implementations relevant to harness engineering. It covers the conceptual distinction between frameworks (abstractions), runtimes (durable execution), and harnesses (batteries-included agents), then surveys specific tools: Claude Agent SDK, AgentKit, deepagents, SWE-agent, SWE-ReX, Harbor, and Harness Evolver. Key topics: multi-agent orchestration, sandboxed code execution, agent evaluation, harness evolution.

Tools, Runtimes & Reference Implementations

The practical building blocks for AI agent systems. From low-level execution runtimes to batteries-included harnesses, these projects define how agents interact with code, tools, and each other in production.

Understanding the Stack

Harrison Chase of LangChain proposes a three-layer taxonomy for the agent tooling ecosystem that clarifies what was previously a muddled landscape. The framework decomposes agent infrastructure into three distinct tiers, each serving a different developer need and operating at a different level of abstraction.

Agent Frameworks (e.g. LangChain, Vercel AI SDK, OpenAI Agents SDK, Google ADK, CrewAI) provide abstractions and mental models: standard interfaces for the agent loop, structured content blocks, middleware, and tool integrations. They make it easy to get started and create consistency across projects, though poor abstractions can obfuscate internals.

Agent Runtimes (e.g. LangGraph, Temporal, Inngest) operate below frameworks, providing infrastructure-level concerns for production deployments: durable execution, streaming, human-in-the-loop support, thread-level and cross-thread persistence. LangChain 1.0 itself is built on top of the LangGraph runtime, demonstrating how runtimes power frameworks.

Agent Harnesses (e.g. DeepAgents, Claude Code / Claude Agent SDK) sit above frameworks. They are batteries-included packages that ship with default prompts, opinionated tool-call handling, planning tools, filesystem access, and context management. Chase describes DeepAgents as "a general-purpose version of Claude Code" and notes that most coding CLIs are effectively harnesses. This is the layer closest to what harness engineering is about: the full extra-model environment where the agent operates.

  • Three-layer taxonomy: Framework → Runtime → Harness
  • Frameworks provide abstractions; runtimes provide durability
  • Harnesses ship opinionated defaults and built-in tools
  • Runtimes can power frameworks (LangChain on LangGraph)
Read on LangChain Blog

Agent SDKs

Anthropic's production agent SDK, evolved from the Claude Code SDK after the team discovered that the harness powering Claude Code was effective far beyond coding tasks -- research, video creation, note-taking, and other non-coding workflows. The core design principle is "give Claude a computer": instead of constraining agents to predefined tool catalogs, the SDK provides terminal access, file operations, and bash execution so agents can work the way human developers do.

The SDK is organized around a feedback loop: gather context (agentic file search, semantic search, subagents with isolated context windows, automatic compaction for long-running sessions), take action (custom tools, bash scripts, code generation, MCP integrations), and verify work (rules-based validation, visual feedback via screenshots, LLM-as-judge evaluation). Subagents are first-class citizens -- they can be spawned in parallel for parallelized context gathering, each operating with isolated context and returning only relevant information to the orchestrator.

The SDK supports MCP (Model Context Protocol) for standardized external integrations -- Slack, GitHub, Google Drive, Asana -- without custom OAuth code. It particularly excels at code generation as an action primitive, since code is precise, composable, and reusable. Anthropic uses this pattern internally for file creation in Claude.ai, where Claude writes Python scripts to produce Excel, PowerPoint, and Word documents.

  • Terminal-first: agents get computer access, not just tools
  • Subagents with isolated contexts for parallel work
  • Automatic compaction for long-running sessions
  • MCP support for standardized external integrations
  • Code generation as a first-class action primitive
  • Visual feedback loop with screenshot verification
Read on Claude Blog

AgentKit

SDK

A TypeScript toolkit from Inngest for constructing multi-agent networks with deterministic routing and rich tooling via MCP. Unlike autonomous-first frameworks where you hope the LLM makes the right routing decisions, AgentKit uses state-based routing -- a bidirectional typed state machine accessible across system prompts, tools, lifecycle callbacks, and routing functions. This gives developers deterministic control over agent orchestration while still leveraging LLM intelligence within each agent.

The architecture revolves around three primitives: Agents (LLM-powered entities with prompts, tools, and MCP integrations), Networks (coordination layers enabling agent collaboration with shared state and handoff), and Routers (orchestration logic from simple code-based to ReAct-style LLM-based implementations). Agents update shared state through tool execution, and subsequent routers read that state to make intelligent dispatch decisions.

When deployed via Inngest's orchestration engine, agents gain fault tolerance and durable execution. The framework supports multiple model providers and has built-in tracing for debugging and optimization in both local and cloud environments.

  • Deterministic state-based routing, not pure LLM autonomy
  • Multi-agent networks with typed shared state
  • MCP-native tooling integration
  • Durable execution via Inngest orchestration engine
  • Built-in tracing for debugging and optimization
  • Supports multiple LLM providers
View on GitHub

Coding Agents

deepagents

Framework

An opinionated, ready-to-run agent harness built on LangChain and LangGraph. Where most frameworks require developers to manually assemble prompts, tools, and context management, deepagents provides a functional agent immediately with create_deep_agent(). It eliminates the boilerplate of wiring together language models, tool integrations, conversation management, and execution logic by bundling tested patterns and sensible defaults.

The built-in tool suite includes planning (write_todos for task decomposition and progress monitoring), file operations (read_file, write_file, edit_file, ls, glob, grep), sandboxed shell execution, and sub-agents via a task tool for delegating work with isolated context. Context management is automatic -- lengthy conversations get summarized, and large outputs are stored on the filesystem rather than filling the context window. The framework ships with smart prompting that teaches models to use tools effectively.

Built on LangGraph's production-ready runtime, deepagents returns a compiled LangGraph graph, enabling integration with LangGraph features like streaming, persistence, and checkpointing. It is model-agnostic, supporting any LLM with tool-calling capabilities.

  • Batteries-included: planning, file ops, shell, sub-agents
  • Automatic context management and conversation summarization
  • Built on LangGraph for streaming, persistence, checkpointing
  • Model-agnostic: any LLM with tool-calling support
  • Smart prompting teaches models effective tool usage
  • Single call to create_deep_agent() for a working agent
View on GitHub

SWE-agent

Tool

An autonomous AI system from Princeton and Stanford that transforms language models into software engineering agents capable of fixing issues in real GitHub repositories, identifying cybersecurity vulnerabilities, and solving competitive coding challenges. SWE-agent achieves state-of-the-art performance on SWE-bench among open-source projects, with SWE-agent 1.0 using Claude 3.7 setting records on both SWE-Bench full and verified benchmarks.

The design philosophy is "free-flowing and generalizable" -- maximizing language model agency rather than constraining it with rigid pipelines. The entire system is governed by a single YAML configuration file, making it both highly configurable and research-friendly. It supports multiple LLMs including GPT-4o and Claude Sonnet 4 through agent-computer interfaces that structure interaction between models and development tools.

  • State-of-the-art on SWE-bench (open source)
  • YAML-configured, research-friendly design
  • Supports GPT-4o, Claude, and other LLMs
  • Offensive security and competitive coding support
View on GitHub

SWE-ReX

Tool

A runtime framework for sandboxed shell environment interactions, born from the practical experience of building SWE-agent. SWE-ReX abstracts away infrastructure differences so agent code remains consistent whether commands run locally, in Docker containers, on AWS, or Modal. This is especially critical for parallel execution -- the system supports running 100+ agents simultaneously.

Key capabilities include interactive shell session management with automatic command-completion detection, support for interactive tools (IPython, GDB), and simultaneous multi-session handling that mirrors how developers juggle multiple terminal windows. The architecture separates agent logic from infrastructure, supporting backends including local, Docker, AWS Fargate, and Modal.

  • Run any command on any environment transparently
  • 100+ parallel agents with consistent API
  • Interactive tool support (IPython, GDB)
  • Docker, AWS Fargate, Modal backends
View on GitHub

Evaluation Harnesses

Harbor

Framework

A generalized framework from the creators of Terminal-Bench for evaluating and optimizing AI agents and language models. Harbor addresses the critical gap between building an agent and knowing whether it actually works: it provides standardized infrastructure for running agent evaluations and creating reinforcement learning environments.

The framework supports evaluating arbitrary agents -- Claude Code, OpenHands, Codex CLI, and more -- against standardized benchmarks. Developers can build and share their own benchmarks and environments, conduct experiments across thousands of parallel environments through cloud providers like Daytona and Modal, and generate rollouts for RL optimization. This makes Harbor both an evaluation platform and a training data pipeline.

Implemented primarily in Python with Docker integration, Harbor installs via pip install harbor or uv tool install harbor and provides a CLI for running evaluations, listing datasets, and configuring execution environments. The separation between agent code and evaluation infrastructure means you can swap agents without rewriting benchmarks.

  • Evaluate any agent against standardized benchmarks
  • Build and share custom benchmarks and environments
  • Thousands of parallel environments via Daytona/Modal
  • Generate rollouts for RL optimization
  • CLI-first: pip install harbor
  • Supports Claude Code, OpenHands, Codex CLI, and more
View on GitHub

Meta-Harnesses

A Claude Code plugin that autonomously evolves LLM agent harnesses through iterative, data-driven optimization. Instead of manually tuning system prompts, routing logic, retrieval mechanisms, and orchestration code, Harness Evolver uses a multi-agent architecture where specialized agents propose mutations, evaluate results, detect regressions, and synthesize learnings across iterations.

The evolution loop follows seven stages: preflight validation, failure analysis, candidate generation (proposer agents modify actual codebase files within isolated git worktrees), evaluation (LLM-as-judge scoring via LangSmith), selection (Pareto optimization), learning (archive synthesis), and continuation gates (constraint validation, efficiency checks, regression detection, stagnation monitoring).

Winning changes are automatically merged back to the main branch, while regressions are rejected. In a demonstrated case, the system achieved a 74% improvement on a real RAG agent, reaching perfect scores after seven iterations while rejecting three regressions. This represents the frontier of harness engineering: harnesses that improve themselves.

  • Autonomous iterative evolution of agent harnesses
  • Real code mutations in isolated git worktrees
  • LangSmith-native evaluation with LLM-as-judge scoring
  • Pareto selection with regression detection
  • 74% improvement demonstrated on a real RAG agent
  • Multi-agent: proposers, evaluators, detectors, synthesizers
View on GitHub

Multi-Agent Systems

Anthropic's deep engineering post on the multi-agent architecture powering Claude's Research feature. The system uses an orchestrator-worker pattern: a lead agent (Claude Opus 4) analyzes queries, develops strategy, and spawns specialized subagents (Claude Sonnet 4) that explore different aspects simultaneously. Each subagent acts as an intelligent filter -- iteratively searching, evaluating, and returning compressed findings rather than raw data.

The results are striking: the multi-agent system outperformed single-agent Claude Opus 4 by 90.2% on Anthropic's internal research eval. Their analysis of BrowseComp revealed that token usage alone explains 80% of performance variance, with tool calls and model choice as secondary factors. Multi-agent architectures effectively scale token usage for tasks that exceed single-agent limits, though they burn through tokens fast -- about 15x more than standard chat interactions.

The article distills eight prompting principles for multi-agent systems: think like your agents (build simulations to watch them work), teach the orchestrator how to delegate (specific objectives, output formats, tool guidance, clear task boundaries), scale effort to query complexity (embedded heuristics for 1 vs 10+ subagents), design tools carefully (tool descriptions are as critical as prompts), let agents improve themselves (Claude 4 models are excellent prompt engineers -- a tool-testing agent achieved 40% faster task completion by rewriting tool descriptions), start wide then narrow down, guide the thinking process (extended thinking as controllable scratchpad), and parallelize tool calling (cut research time by up to 90%).

Production challenges include: agents are stateful and errors compound (requiring durable execution with resume capabilities), debugging needs full production tracing of decision patterns, deployments use rainbow deployments to avoid disrupting running agents, and synchronous subagent execution creates bottlenecks that asynchronous patterns could resolve.

  • Orchestrator-worker pattern: Opus 4 lead + Sonnet 4 subagents
  • 90.2% improvement over single-agent on internal eval
  • Token usage explains 80% of performance variance
  • 8 prompting principles for multi-agent systems
  • Rainbow deployments for stateful agent updates
  • Subagent filesystem output to avoid "telephone game"
  • Extended thinking as controllable agent scratchpad
  • 40% faster task completion from agent-rewritten tool descriptions
Read on Anthropic Engineering

Comparison

SWE-agent deepagents AgentKit Harbor
Type Coding agent harness General-purpose harness Multi-agent SDK Evaluation framework
Language Python Python TypeScript Python
Primary Use Autonomous bug fixing, security research, competitive coding General-purpose long-running agents with planning and tools Deterministic multi-agent networks with state-based routing Agent evaluation, benchmark creation, RL data generation
Runtime SWE-ReX (sandboxed shell) LangGraph (durable execution) Inngest (fault-tolerant orchestration) Docker / Daytona / Modal
Multi-Model Yes (GPT-4o, Claude, etc.) Yes (any tool-calling LLM) Yes (multiple providers) Yes (evaluates any agent)
Sub-agents No Yes (task tool) Yes (network handoffs) N/A (evaluator)
MCP Support No Via LangChain Native No
Sandboxing SWE-ReX (Docker, AWS, Modal) Sandboxed shell execution Via Inngest isolation Docker / cloud providers
Configuration Single YAML file Programmatic (Python) Programmatic (TypeScript) CLI + config
Best For Research on coding agents, SWE-bench evaluation Building Claude Code-like agents quickly Production multi-agent systems with control Benchmarking agents, generating RL training data

Explore the Knowledge Base

Foundations Context Engineering Safety & Guardrails Specs & Workflows Evals & Observability Benchmarks