Effective Context Engineering for AI Agents
Anthropic
Anthropic's Applied AI team frames context engineering as the natural evolution of prompt engineering. Where prompt engineering focused on crafting the right words for single-shot tasks, context engineering addresses the broader challenge of curating the entire information state available to an LLM at each inference step -- system prompts, tools, MCP servers, message history, and retrieved data.
The article introduces the concept of an "attention budget": as context length grows, the model's ability to capture pairwise token relationships degrades due to the quadratic nature of transformer attention. This creates a performance gradient rather than a hard cliff. The practical implication is that good context engineering means finding the smallest possible set of high-signal tokens that maximize the desired outcome.
For long-horizon tasks, the team details three complementary techniques: compaction (summarizing conversation history and reinitializing with compressed context), structured note-taking (persisting progress outside the context window, as Claude Code does with to-do lists), and sub-agent architectures (delegating deep exploration to focused child agents that return condensed results). The article also advocates a "just in time" context strategy -- maintaining lightweight references like file paths and URLs rather than pre-loading all data, letting the agent retrieve information on demand through tools.
Key Takeaways
- Treat context as a finite resource with diminishing marginal returns -- every token depletes the attention budget
- System prompts should hit the "right altitude": specific enough to guide behavior, flexible enough to provide heuristics
- Prefer just-in-time retrieval via tools over pre-loading all data; maintain lightweight references (paths, queries, URLs)
- Compaction, structured note-taking, and sub-agent architectures each suit different long-horizon task profiles
- Tool sets should be minimal and unambiguous -- if a human can't tell which tool to use, neither can the model
Read Original
Context Engineering for AI Agents: Lessons from Building Manus
Manus
Yichao "Peak" Ji shares hard-won lessons from four complete rebuilds of the Manus agent framework, which the team affectionately calls "Stochastic Graduate Descent." The article is grounded in production metrics: with a 100:1 input-to-output token ratio, the KV-cache hit rate becomes the single most important optimization lever, affecting both latency and cost (cached tokens on Claude Sonnet cost 10x less than uncached ones).
Three architectural principles stand out. First, keep prompt prefixes stable and context append-only -- even a single-token difference (like a timestamp at the start of a system prompt) invalidates the entire cache from that point forward. Second, mask rather than remove tools: instead of dynamically adding/removing tool definitions (which breaks KV-cache and confuses the model), Manus uses a context-aware state machine that constrains action selection via token logit masking during decoding. Tool names are deliberately designed with consistent prefixes (e.g., browser_*, shell_*) to enable group-level constraints.
Third, use the filesystem as unbounded context. When observation data blows past window limits, Manus writes to and reads from files on demand -- treating the filesystem as structured, externalized memory. Context compression is designed to be restorable: a URL or file path is preserved even when the content itself is dropped. The article also explains Manus's todo.md technique -- by rewriting a to-do list at each step, the agent recites its objectives into the tail of context, exploiting recency bias in attention to prevent goal drift. Finally, failed actions are deliberately kept in context rather than cleaned up, because error traces shift the model's posterior away from repeating mistakes.
Key Takeaways
- KV-cache hit rate is the top metric for production agents -- keep prefixes stable, context append-only, serialization deterministic
- Mask tools via logit constraints rather than removing definitions mid-session to preserve cache and prevent schema hallucination
- Use the filesystem as externalized memory: unlimited, persistent, and agent-operable with restorable compression
- Recite goals (e.g., rewriting todo.md) at the end of context to exploit recency bias and prevent goal drift
- Keep failed actions in context -- error traces are evidence that helps the model avoid repeating mistakes
Read Original
Context Engineering for Coding Agents
Thoughtworks
Published on martinfowler.com, this primer from Thoughtworks maps the full landscape of context configuration features available in modern coding agents, using Claude Code as a detailed case study. The author establishes a useful taxonomy: context splits into reusable prompts (instructions that tell the agent what to do, and guidance/rules that set conventions) and context interfaces (tools, MCP servers, and skills that let the agent pull additional context on demand).
A key dimension is who decides to load context: the LLM (non-deterministic, needed for unsupervised operation), the human (controlled but reduces automation), or the agent software itself (deterministic lifecycle triggers like hooks). The article walks through Claude Code's full feature set -- CLAUDE.md, path-scoped rules, slash commands (now deprecated in favor of skills), skills with lazy-loading, subagents with isolated context windows, MCP servers, hooks, and plugins for distribution.
The strongest guidance is on size management: even though context windows are technically large, agent effectiveness degrades with excess context. The recommendation is to build configuration gradually rather than front-loading, and to leverage the agent's built-in compaction. The article closes with an honest warning about the "illusion of control" -- context engineering increases the probability of useful results, but as long as LLMs are involved, outcomes remain probabilistic and human oversight remains essential.
Key Takeaways
- Distinguish instructions (task-specific prompts) from guidance (general conventions) and context interfaces (tools, MCP, skills)
- Three actors decide when to load context: the LLM, the human, or deterministic agent lifecycle events (hooks)
- Build context configuration gradually -- models have gotten powerful enough that much old scaffolding is unnecessary
- Subagents are fundamentally about context isolation, not role-playing; they enable parallel work with clean windows
- Context engineering is probabilistic, not deterministic -- choose the right level of human oversight for the job
Read Original
Advanced Context Engineering for Coding Agents
HumanLayer
HumanLayer's deep-dive argues that AI coding tools fail in production codebases not because models are too dumb, but because practitioners feed them poorly structured context. Drawing on two pivotal talks from AI Engineer 2025 -- Sean Grove's "Specs are the new code" and a Stanford study showing AI tools often cause rework in brownfield codebases -- the article introduces Frequent Intentional Compaction (FIC) as a core workflow.
FIC means designing your entire development process around context management, keeping context utilization in the 40-60% range. The workflow splits into three phases: research (understand the codebase and information flow via subagent exploration), plan (outline precise implementation steps with testing criteria), and implement (step through the plan phase by phase, compacting status back into the plan after each verified phase). The article demonstrates this on a 300k LOC Rust codebase (BAML), where an amateur Rust developer produced a merged PR fixing a real bug, and later shipped 35k LOC of new features in 7 hours.
The most important insight is about human leverage: a bad line of research leads to bad plans, which leads to hundreds of bad lines of code. Therefore, human review should focus on the highest-leverage artifacts -- research documents and plans -- rather than code line-by-line. The article also reframes code review as primarily about mental alignment across the team, not just correctness. Specs and plans serve as readable artifacts that keep everyone oriented even when AI writes most of the code.
Key Takeaways
- Frequent Intentional Compaction: keep context utilization at 40-60% by splitting work into research, plan, implement phases
- Human review has highest leverage on research and plans -- a bad line of research cascades into thousands of bad lines of code
- Subagents are context control mechanisms, not role-play; use them for searching/summarizing to keep the parent context clean
- Spec-driven development makes AI output reviewable: you can read 200 lines of plan instead of 2000 lines of Go
- This is deeply technical craft, not magic -- you must engage with the work or it will not produce quality results
Read Original
Context-Efficient Backpressure for Coding Agents
HumanLayer
This focused post tackles a specific but widespread waste pattern: agents burning context on verbose tool output that adds no decision-making value. A passing test suite might dump 200+ lines of output, consuming 2-3% of the context window just to convey "all good" -- information expressible in fewer than 10 tokens. The fix is a deterministic backpressure wrapper that swallows output on success and only surfaces it on failure.
The core pattern is a run_silent shell function: run the command, capture output to a temp file, print a single checkmark on success or dump the full output on failure. This means the agent sees ✓ Auth tests instead of 50 lines of passing assertions, but gets full stack traces when something actually breaks. The article recommends layering additional optimizations: enable --bail/-x/--failfast flags to stop at the first failure (don't make the agent context-switch between five bugs), filter generic stack frames, and strip timing information.
The post also identifies an ironic counter-pattern in current models: RL-trained models have become so context-anxious that they pipe output to /dev/null or use head -n 50 on test suites, which can actually waste more tokens (the truncation scaffolding costs more than the output it replaces) and forces re-runs when truncated output hides the actual failure. The solution is to take deterministic control of output so the model doesn't have to guess what to truncate.
Key Takeaways
- Wrap test/build/lint output: print a single checkmark on success, dump full output only on failure
- Stay in the "smart zone" (~75k tokens for Claude models) -- every line of passing test output is waste
- Use
--bail / --failfast flags: one failure at a time prevents agents from context-switching between bugs
- Deterministic output control beats model-driven truncation: models using
head -n 50 often waste more tokens and force re-runs
- Human time wasted on wrangling an agent in the "dumb zone" is 10x more expensive than token costs
Read Original
OpenHands Context Condensation for More Efficient AI Agents
OpenHands
OpenHands introduces an intelligent context condenser that maintains bounded conversation memory while preserving the essential information needed to continue work effectively. The problem it solves is familiar: as conversations grow, agents become slower, costlier, and less effective. Starting a new chat sacrifices continuity and forces manual context management.
The condenser works by monitoring conversation size against a threshold. When exceeded, it summarizes older interactions while keeping recent exchanges intact, creating a concise memory of earlier work. The summarization is goal-aware: it encodes the user's objectives, progress made, and remaining work, plus technical details like critical files and failing tests for software engineering tasks. A key design choice is that condensation only triggers at size thresholds rather than every turn, which preserves prompt cache efficiency -- rebuilding costs are amortized across multiple turns.
The results on SWE-bench Verified are compelling: context condensation achieves up to 2x per-turn API cost reduction, consistent response times in long sessions, and equivalent or slightly better task completion (54% vs 53% baseline). The baseline agent's costs scale quadratically over time as context grows, while the condensed approach scales linearly. The only trade-off is occasional extra turns for the condensation step itself. This validates the core insight that aggressive context pruning, when done thoughtfully, does not sacrifice performance.
Key Takeaways
- Bounded conversation memory: summarize older interactions while preserving recent context, goals, and technical state
- Condensation only at thresholds, not every turn, to preserve prompt cache efficiency and amortize rebuild costs
- 2x per-turn cost reduction with equivalent task completion on SWE-bench Verified (54% vs 53% baseline)
- Baseline context scales quadratically; condensation makes it linear -- enormous gains in long sessions
- Goal-aware summarization preserves what matters: user objectives, progress, remaining work, critical files, failing tests
Read Original
Writing a Good CLAUDE.md
HumanLayer
This practical guide addresses the highest-leverage single file in any coding agent workflow: CLAUDE.md (or its open-source equivalent AGENTS.md). Since this file goes into every single conversation, it functions as the onboarding document that tells the agent what the project is, why it exists, and how to work on it -- stack, structure, build commands, test workflows, and verification steps.
The central finding is that less is more. Research indicates frontier thinking LLMs can reliably follow roughly 150-200 instructions, with performance decaying linearly as count increases (exponentially for smaller models). Since Claude Code's own system prompt already consumes about 50 instructions, that leaves limited budget for user instructions. Crucially, Claude Code injects the CLAUDE.md with a note saying "this context may or may not be relevant" -- so overstuffed files get partially ignored by design.
The recommended approach is progressive disclosure: keep the root file concise (HumanLayer's own is under 60 lines) and store task-specific instructions in separate files (agent_docs/building_the_project.md, agent_docs/running_tests.md, etc.) that the agent can load on demand. The article warns against three anti-patterns: using CLAUDE.md as a linter (use deterministic tools instead), auto-generating it with /init (it's too high-leverage for auto-generation), and including non-universal instructions that dilute the signal. Prefer pointers over copies -- reference file:line locations rather than pasting code snippets that go stale.
Key Takeaways
- CLAUDE.md is the highest leverage point of the harness -- it goes into every session, so every line must earn its place
- Frontier LLMs follow ~150-200 instructions reliably; Claude Code's system prompt already uses ~50 of that budget
- Use progressive disclosure: concise root file + separate on-demand docs for task-specific guidance
- Never use CLAUDE.md as a linter or code style guide -- use deterministic tools and hooks instead
- Prefer file:line pointers over code copies; auto-generating CLAUDE.md wastes its potential
Read Original