Evals & Observability

If you can't measure it, you can't improve it. These articles cover how to evaluate agent performance, grade execution traces, and use observability to drive harness improvements.

Building Evals

Testing Agent Skills Systematically with Evals

OpenAI

OpenAI's guide to converting agent execution traces into repeatable evaluation tests. Shows how to capture JSONL logs from agent runs, define deterministic checks against expected outcomes, and build regression suites that catch when harness changes break existing capabilities. The key insight: agent evals are just software tests applied to non-deterministic systems -- the same discipline applies, with statistical thresholds instead of exact matches.

Key Takeaways

Read Original

How to Evaluate Agent Skills (And Why You Should)

OpenHands

Hands-on playbook for measuring whether a specific skill actually improves agent performance. Uses bounded tasks (clear start/end), deterministic verifiers (not LLM-as-judge), no-skill baselines for comparison, and trace review to understand why things fail. Makes the important distinction between "the agent completed the task" and "the agent completed the task reliably."

Key Takeaways

Read Original

Agent Evals

OpenAI

Product documentation for measuring agent quality with reproducible task-level and workflow-level evaluations. Covers metric selection, test harness setup, and how to separate model quality from harness quality in eval results.

Key Takeaways

Read Original

Evaluation Best Practices

OpenAI

Building eval suites that match real-world distributions and catch regressions early. Covers sample selection, avoiding distribution mismatch between evals and production, and iterative refinement of eval criteria as the system evolves.

Key Takeaways

Read Original

Trace Analysis

Trace Grading

OpenAI

Grading agent traces directly rather than just final outcomes. Especially valuable for long multi-step tasks where the final result might be correct but the path was inefficient or dangerous. Covers trajectory scoring, step-level grading, and how to identify harness improvements from trace analysis.

Key Takeaways

Read Original

Learning to Verify AI-Generated Code

OpenHands

A layered verification stack using trajectory critics trained on production traces. These critics enable reranking (pick the best among multiple attempts), early stopping (abort doomed runs), and review-time quality control (flag suspicious patterns before human review). Represents the cutting edge of automated verification for code-producing agents.

Key Takeaways

Read Original

Demystifying Evals for AI Agents

Anthropic

What to measure when agents have many possible successful trajectories. Unlike traditional software tests with one correct answer, agent evals must handle path diversity. Covers outcome-based vs. trajectory-based evaluation, the role of human judgment in eval design, and how to avoid gaming metrics with superficially correct but brittle solutions.

Key Takeaways

Read Original

Harness Impact

Quantifying Infrastructure Noise in Agentic Coding Evals

Anthropic

A critical finding: runtime configuration (timeout values, memory limits, network latency, container image versions) can move coding benchmark scores by more than many leaderboard gaps. This means perceived model differences are often harness differences in disguise. The implication: when comparing agents, you must control for infrastructure variables, or your conclusions are meaningless.

Key Takeaways

Read Original

Evaluating Deep Agents: Our Learnings

LangChain

Practical breakdown of three eval levels: single-step (did this one action work?), full-run (did the entire task complete?), and multi-turn (does the agent maintain quality across a conversation?). Each level requires different infrastructure and reveals different types of failures.

Key Takeaways

Read Original

Improving Deep Agents with Harness Engineering

LangChain

Evidence that harness changes alone -- without model upgrades -- can significantly improve benchmark performance. Shows specific harness modifications (better tool descriptions, structured output formats, retry logic, context management) and their measured impact on coding benchmarks. The strongest argument for investing in harness engineering over waiting for better models.

Key Takeaways

Read Original

Explore the Knowledge Base

Foundations Context Engineering Safety & Guardrails Specs & Workflows Benchmarks Tools & Runtimes

Evals & Observability Knowledge Base Summary

This page curates 10 key resources on evaluating AI coding agents, organized into three sections:

Key principles: use deterministic verifiers over LLM-as-judge, separate model quality from harness quality, control for infrastructure variables, invest in harness engineering rather than waiting for better models, grade trajectories not just outcomes.