Knowledge Base

Harness Engineering
Knowledge Base

Everything you need to build reliable AI agent systems. Curated articles, tools, benchmarks, and practical guides.

Built from the awesome-harness-engineering collection and beyond.

Foundations

8 articles

What harness engineering is, why it matters, and the foundational thinking from OpenAI, Anthropic, and Thoughtworks.

Explore

Context Engineering

7 articles

Managing the context window as working memory. KV-cache locality, CLAUDE.md, context condensation, and backpressure.

Explore

Safety & Guardrails

8 articles

Sandboxing, tool boundaries, prompt injection defense, quality checks, and safe autonomous operation.

Explore

Specs & Workflows

6 articles

AGENTS.md, agent.md, spec-driven development, 12-factor agents, and workflow design patterns.

Explore

Evals & Observability

11 articles

Testing agent skills, trace grading, eval best practices, and measuring what matters.

Explore

Benchmarks

45 benchmarks

The definitive catalogue of agent benchmarks -- from SWE-bench to Terminal-Bench, WebArena to OSWorld.

Explore

Tools & Runtimes

9 resources

Agent SDKs, coding agent frameworks, sandboxed execution, and reference harness implementations.

Explore
92+ Resources
|
8 Categories
|
45 Benchmarks
|
Updated April 2026

Harness Engineering Knowledge Base — Full Structure

This knowledge base is the central hub for harness engineering resources, curated from the awesome-harness-engineering collection and original research. It covers everything needed to build reliable AI agent systems.

Foundations (8 articles)

Core concepts: What harness engineering is, the CAR framework (Control, Agency, Runtime), foundational articles from Martin Fowler/Thoughtworks, Anthropic's agent design patterns, OpenAI's agent research, and the evolution from prompt engineering to context engineering to harness engineering.

Context Engineering (7 articles)

Managing AI context windows effectively: KV-cache locality optimization, CLAUDE.md and AGENTS.md as context documents, context condensation techniques, backpressure patterns, progressive disclosure of instructions, and sub-agent context firewalls.

Safety and Guardrails (8 articles)

Securing AI agent operation: sandboxed execution environments, tool permission boundaries, PreToolUse hook patterns for blocking dangerous commands, prompt injection defense, quality gate Stop hooks, file system protection, and safe autonomous operation patterns.

Specs and Workflows (6 articles)

Structuring agent work: AGENTS.md specification format, agent.md protocol, spec-driven development methodology, the 12-factor agent principles, workflow design patterns, and multi-agent orchestration approaches.

Evals and Observability (11 articles)

Measuring agent performance: eval best practices, trace-based grading, skill testing methodologies, LLM-as-judge patterns, observability instrumentation, cost tracking, latency monitoring, success rate measurement, and continuous improvement loops.

Benchmarks (45 benchmarks)

Comprehensive benchmark catalogue: SWE-bench and SWE-bench Verified for software engineering, Terminal-Bench for CLI tasks, WebArena and VisualWebArena for web navigation, OSWorld for desktop operation, HumanEval and MBPP for code generation, MATH and GSM8K for reasoning, and dozens more specialized agent benchmarks.

Tools and Runtimes (9 resources)

Implementation resources: Claude Code SDK, OpenAI Agents SDK, LangChain/LangGraph, CrewAI, Anthropic's computer use, E2B sandboxed execution, Modal for serverless agent runtimes, and reference harness implementations.

Total: 92+ resources across 8 categories, with 45 dedicated benchmark entries. Last updated April 2026.