Tools, Runtimes & Reference Implementations

Understanding the Stack

Agent Frameworks, Runtimes, and Harnesses, Oh My!

Article

Harrison Chase of LangChain proposes a three-layer taxonomy for the agent tooling ecosystem that clarifies what was previously a muddled landscape. The framework decomposes agent infrastructure into three distinct tiers, each serving a different developer need and operating at a different level of abstraction.

Agent Frameworks (e.g. LangChain, Vercel AI SDK, OpenAI Agents SDK, Google ADK, CrewAI) provide abstractions and mental models: standard interfaces for the agent loop, structured content blocks, middleware, and tool integrations. They make it easy to get started and create consistency across projects, though poor abstractions can obfuscate internals.

Agent Runtimes (e.g. LangGraph, Temporal, Inngest) operate below frameworks, providing infrastructure-level concerns for production deployments: durable execution, streaming, human-in-the-loop support, thread-level and cross-thread persistence. LangChain 1.0 itself is built on top of the LangGraph runtime, demonstrating how runtimes power frameworks.

Agent Harnesses (e.g. DeepAgents, Claude Code / Claude Agent SDK) sit above frameworks. They are batteries-included packages that ship with default prompts, opinionated tool-call handling, planning tools, filesystem access, and context management. Chase describes DeepAgents as "a general-purpose version of Claude Code" and notes that most coding CLIs are effectively harnesses. This is the layer closest to what harness engineering is about: the full extra-model environment where the agent operates.

Three-layer taxonomy: Framework → Runtime → Harness
Frameworks provide abstractions; runtimes provide durability
Harnesses ship opinionated defaults and built-in tools
Runtimes can power frameworks (LangChain on LangGraph)

Read on LangChain Blog

Agent SDKs

Claude Agent SDK

SDK

Anthropic's production agent SDK, evolved from the Claude Code SDK after the team discovered that the harness powering Claude Code was effective far beyond coding tasks -- research, video creation, note-taking, and other non-coding workflows. The core design principle is "give Claude a computer": instead of constraining agents to predefined tool catalogs, the SDK provides terminal access, file operations, and bash execution so agents can work the way human developers do.

The SDK is organized around a feedback loop: gather context (agentic file search, semantic search, subagents with isolated context windows, automatic compaction for long-running sessions), take action (custom tools, bash scripts, code generation, MCP integrations), and verify work (rules-based validation, visual feedback via screenshots, LLM-as-judge evaluation). Subagents are first-class citizens -- they can be spawned in parallel for parallelized context gathering, each operating with isolated context and returning only relevant information to the orchestrator.

The SDK supports MCP (Model Context Protocol) for standardized external integrations -- Slack, GitHub, Google Drive, Asana -- without custom OAuth code. It particularly excels at code generation as an action primitive, since code is precise, composable, and reusable. Anthropic uses this pattern internally for file creation in Claude.ai, where Claude writes Python scripts to produce Excel, PowerPoint, and Word documents.

Terminal-first: agents get computer access, not just tools
Subagents with isolated contexts for parallel work
Automatic compaction for long-running sessions
MCP support for standardized external integrations
Code generation as a first-class action primitive
Visual feedback loop with screenshot verification

Read on Claude Blog

AgentKit

SDK

A TypeScript toolkit from Inngest for constructing multi-agent networks with deterministic routing and rich tooling via MCP. Unlike autonomous-first frameworks where you hope the LLM makes the right routing decisions, AgentKit uses state-based routing -- a bidirectional typed state machine accessible across system prompts, tools, lifecycle callbacks, and routing functions. This gives developers deterministic control over agent orchestration while still leveraging LLM intelligence within each agent.

The architecture revolves around three primitives: Agents (LLM-powered entities with prompts, tools, and MCP integrations), Networks (coordination layers enabling agent collaboration with shared state and handoff), and Routers (orchestration logic from simple code-based to ReAct-style LLM-based implementations). Agents update shared state through tool execution, and subsequent routers read that state to make intelligent dispatch decisions.

When deployed via Inngest's orchestration engine, agents gain fault tolerance and durable execution. The framework supports multiple model providers and has built-in tracing for debugging and optimization in both local and cloud environments.

Deterministic state-based routing, not pure LLM autonomy
Multi-agent networks with typed shared state
MCP-native tooling integration
Durable execution via Inngest orchestration engine
Built-in tracing for debugging and optimization
Supports multiple LLM providers

View on GitHub

Coding Agents

deepagents

Framework

An opinionated, ready-to-run agent harness built on LangChain and LangGraph. Where most frameworks require developers to manually assemble prompts, tools, and context management, deepagents provides a functional agent immediately with create_deep_agent(). It eliminates the boilerplate of wiring together language models, tool integrations, conversation management, and execution logic by bundling tested patterns and sensible defaults.

The built-in tool suite includes planning (write_todos for task decomposition and progress monitoring), file operations (read_file, write_file, edit_file, ls, glob, grep), sandboxed shell execution, and sub-agents via a task tool for delegating work with isolated context. Context management is automatic -- lengthy conversations get summarized, and large outputs are stored on the filesystem rather than filling the context window. The framework ships with smart prompting that teaches models to use tools effectively.

Built on LangGraph's production-ready runtime, deepagents returns a compiled LangGraph graph, enabling integration with LangGraph features like streaming, persistence, and checkpointing. It is model-agnostic, supporting any LLM with tool-calling capabilities.

Batteries-included: planning, file ops, shell, sub-agents
Automatic context management and conversation summarization
Built on LangGraph for streaming, persistence, checkpointing
Model-agnostic: any LLM with tool-calling support
Smart prompting teaches models effective tool usage
Single call to create_deep_agent() for a working agent

View on GitHub

SWE-agent

Tool

An autonomous AI system from Princeton and Stanford that transforms language models into software engineering agents capable of fixing issues in real GitHub repositories, identifying cybersecurity vulnerabilities, and solving competitive coding challenges. SWE-agent achieves state-of-the-art performance on SWE-bench among open-source projects, with SWE-agent 1.0 using Claude 3.7 setting records on both SWE-Bench full and verified benchmarks.

The design philosophy is "free-flowing and generalizable" -- maximizing language model agency rather than constraining it with rigid pipelines. The entire system is governed by a single YAML configuration file, making it both highly configurable and research-friendly. It supports multiple LLMs including GPT-4o and Claude Sonnet 4 through agent-computer interfaces that structure interaction between models and development tools.

State-of-the-art on SWE-bench (open source)
YAML-configured, research-friendly design
Supports GPT-4o, Claude, and other LLMs
Offensive security and competitive coding support

View on GitHub

SWE-ReX

Tool

A runtime framework for sandboxed shell environment interactions, born from the practical experience of building SWE-agent. SWE-ReX abstracts away infrastructure differences so agent code remains consistent whether commands run locally, in Docker containers, on AWS, or Modal. This is especially critical for parallel execution -- the system supports running 100+ agents simultaneously.

Key capabilities include interactive shell session management with automatic command-completion detection, support for interactive tools (IPython, GDB), and simultaneous multi-session handling that mirrors how developers juggle multiple terminal windows. The architecture separates agent logic from infrastructure, supporting backends including local, Docker, AWS Fargate, and Modal.

Run any command on any environment transparently
100+ parallel agents with consistent API
Interactive tool support (IPython, GDB)
Docker, AWS Fargate, Modal backends

View on GitHub

Evaluation Harnesses

Harbor

Framework

A generalized framework from the creators of Terminal-Bench for evaluating and optimizing AI agents and language models. Harbor addresses the critical gap between building an agent and knowing whether it actually works: it provides standardized infrastructure for running agent evaluations and creating reinforcement learning environments.

The framework supports evaluating arbitrary agents -- Claude Code, OpenHands, Codex CLI, and more -- against standardized benchmarks. Developers can build and share their own benchmarks and environments, conduct experiments across thousands of parallel environments through cloud providers like Daytona and Modal, and generate rollouts for RL optimization. This makes Harbor both an evaluation platform and a training data pipeline.

Implemented primarily in Python with Docker integration, Harbor installs via pip install harbor or uv tool install harbor and provides a CLI for running evaluations, listing datasets, and configuring execution environments. The separation between agent code and evaluation infrastructure means you can swap agents without rewriting benchmarks.

Evaluate any agent against standardized benchmarks
Build and share custom benchmarks and environments
Thousands of parallel environments via Daytona/Modal
Generate rollouts for RL optimization
CLI-first: pip install harbor
Supports Claude Code, OpenHands, Codex CLI, and more

View on GitHub

Meta-Harnesses

Harness Evolver

Tool

A Claude Code plugin that autonomously evolves LLM agent harnesses through iterative, data-driven optimization. Instead of manually tuning system prompts, routing logic, retrieval mechanisms, and orchestration code, Harness Evolver uses a multi-agent architecture where specialized agents propose mutations, evaluate results, detect regressions, and synthesize learnings across iterations.

The evolution loop follows seven stages: preflight validation, failure analysis, candidate generation (proposer agents modify actual codebase files within isolated git worktrees), evaluation (LLM-as-judge scoring via LangSmith), selection (Pareto optimization), learning (archive synthesis), and continuation gates (constraint validation, efficiency checks, regression detection, stagnation monitoring).

Winning changes are automatically merged back to the main branch, while regressions are rejected. In a demonstrated case, the system achieved a 74% improvement on a real RAG agent, reaching perfect scores after seven iterations while rejecting three regressions. This represents the frontier of harness engineering: harnesses that improve themselves.

Autonomous iterative evolution of agent harnesses
Real code mutations in isolated git worktrees
LangSmith-native evaluation with LLM-as-judge scoring
Pareto selection with regression detection
74% improvement demonstrated on a real RAG agent
Multi-agent: proposers, evaluators, detectors, synthesizers

View on GitHub

Multi-Agent Systems

How We Built Our Multi-Agent Research System

Article

Anthropic's deep engineering post on the multi-agent architecture powering Claude's Research feature. The system uses an orchestrator-worker pattern: a lead agent (Claude Opus 4) analyzes queries, develops strategy, and spawns specialized subagents (Claude Sonnet 4) that explore different aspects simultaneously. Each subagent acts as an intelligent filter -- iteratively searching, evaluating, and returning compressed findings rather than raw data.

The results are striking: the multi-agent system outperformed single-agent Claude Opus 4 by 90.2% on Anthropic's internal research eval. Their analysis of BrowseComp revealed that token usage alone explains 80% of performance variance, with tool calls and model choice as secondary factors. Multi-agent architectures effectively scale token usage for tasks that exceed single-agent limits, though they burn through tokens fast -- about 15x more than standard chat interactions.

The article distills eight prompting principles for multi-agent systems: think like your agents (build simulations to watch them work), teach the orchestrator how to delegate (specific objectives, output formats, tool guidance, clear task boundaries), scale effort to query complexity (embedded heuristics for 1 vs 10+ subagents), design tools carefully (tool descriptions are as critical as prompts), let agents improve themselves (Claude 4 models are excellent prompt engineers -- a tool-testing agent achieved 40% faster task completion by rewriting tool descriptions), start wide then narrow down, guide the thinking process (extended thinking as controllable scratchpad), and parallelize tool calling (cut research time by up to 90%).

Production challenges include: agents are stateful and errors compound (requiring durable execution with resume capabilities), debugging needs full production tracing of decision patterns, deployments use rainbow deployments to avoid disrupting running agents, and synchronous subagent execution creates bottlenecks that asynchronous patterns could resolve.

Orchestrator-worker pattern: Opus 4 lead + Sonnet 4 subagents
90.2% improvement over single-agent on internal eval
Token usage explains 80% of performance variance
8 prompting principles for multi-agent systems
Rainbow deployments for stateful agent updates
Subagent filesystem output to avoid "telephone game"
Extended thinking as controllable agent scratchpad
40% faster task completion from agent-rewritten tool descriptions

Read on Anthropic Engineering

Comparison

	SWE-agent	deepagents	AgentKit	Harbor
Type	Coding agent harness	General-purpose harness	Multi-agent SDK	Evaluation framework
Language	Python	Python	TypeScript	Python
Primary Use	Autonomous bug fixing, security research, competitive coding	General-purpose long-running agents with planning and tools	Deterministic multi-agent networks with state-based routing	Agent evaluation, benchmark creation, RL data generation
Runtime	SWE-ReX (sandboxed shell)	LangGraph (durable execution)	Inngest (fault-tolerant orchestration)	Docker / Daytona / Modal
Multi-Model	Yes (GPT-4o, Claude, etc.)	Yes (any tool-calling LLM)	Yes (multiple providers)	Yes (evaluates any agent)
Sub-agents	No	Yes (`task` tool)	Yes (network handoffs)	N/A (evaluator)
MCP Support	No	Via LangChain	Native	No
Sandboxing	SWE-ReX (Docker, AWS, Modal)	Sandboxed shell execution	Via Inngest isolation	Docker / cloud providers
Configuration	Single YAML file	Programmatic (Python)	Programmatic (TypeScript)	CLI + config
Best For	Research on coding agents, SWE-bench evaluation	Building Claude Code-like agents quickly	Production multi-agent systems with control	Benchmarking agents, generating RL training data

Understanding the Stack

Agent SDKs

Coding Agents

Evaluation Harnesses

Meta-Harnesses

Multi-Agent Systems

Comparison

Explore the Knowledge Base