Everything you need to build reliable AI agent systems. Curated articles, tools, benchmarks, and practical guides.
Built from the awesome-harness-engineering collection and beyond.
What harness engineering is, why it matters, and the foundational thinking from OpenAI, Anthropic, and Thoughtworks.
Managing the context window as working memory. KV-cache locality, CLAUDE.md, context condensation, and backpressure.
Sandboxing, tool boundaries, prompt injection defense, quality checks, and safe autonomous operation.
AGENTS.md, agent.md, spec-driven development, 12-factor agents, and workflow design patterns.
Testing agent skills, trace grading, eval best practices, and measuring what matters.
The definitive catalogue of agent benchmarks -- from SWE-bench to Terminal-Bench, WebArena to OSWorld.
Agent SDKs, coding agent frameworks, sandboxed execution, and reference harness implementations.
This knowledge base is the central hub for harness engineering resources, curated from the awesome-harness-engineering collection and original research. It covers everything needed to build reliable AI agent systems.
Core concepts: What harness engineering is, the CAR framework (Control, Agency, Runtime), foundational articles from Martin Fowler/Thoughtworks, Anthropic's agent design patterns, OpenAI's agent research, and the evolution from prompt engineering to context engineering to harness engineering.
Managing AI context windows effectively: KV-cache locality optimization, CLAUDE.md and AGENTS.md as context documents, context condensation techniques, backpressure patterns, progressive disclosure of instructions, and sub-agent context firewalls.
Securing AI agent operation: sandboxed execution environments, tool permission boundaries, PreToolUse hook patterns for blocking dangerous commands, prompt injection defense, quality gate Stop hooks, file system protection, and safe autonomous operation patterns.
Structuring agent work: AGENTS.md specification format, agent.md protocol, spec-driven development methodology, the 12-factor agent principles, workflow design patterns, and multi-agent orchestration approaches.
Measuring agent performance: eval best practices, trace-based grading, skill testing methodologies, LLM-as-judge patterns, observability instrumentation, cost tracking, latency monitoring, success rate measurement, and continuous improvement loops.
Comprehensive benchmark catalogue: SWE-bench and SWE-bench Verified for software engineering, Terminal-Bench for CLI tasks, WebArena and VisualWebArena for web navigation, OSWorld for desktop operation, HumanEval and MBPP for code generation, MATH and GSM8K for reasoning, and dozens more specialized agent benchmarks.
Implementation resources: Claude Code SDK, OpenAI Agents SDK, LangChain/LangGraph, CrewAI, Anthropic's computer use, E2B sandboxed execution, Modal for serverless agent runtimes, and reference harness implementations.