Evals & Observability

Testing Agent Skills Systematically with Evals

OpenAI

OpenAI's guide to converting agent execution traces into repeatable evaluation tests. Shows how to capture JSONL logs from agent runs, define deterministic checks against expected outcomes, and build regression suites that catch when harness changes break existing capabilities. The key insight: agent evals are just software tests applied to non-deterministic systems -- the same discipline applies, with statistical thresholds instead of exact matches.

Key Takeaways

Capture JSONL traces from agent runs
Define deterministic checks against expected outcomes
Build regression suites for harness changes
Statistical thresholds replace exact match assertions

Read Original

How to Evaluate Agent Skills (And Why You Should)

OpenHands

Hands-on playbook for measuring whether a specific skill actually improves agent performance. Uses bounded tasks (clear start/end), deterministic verifiers (not LLM-as-judge), no-skill baselines for comparison, and trace review to understand why things fail. Makes the important distinction between "the agent completed the task" and "the agent completed the task reliably."

Key Takeaways

Bounded tasks with clear start and end points
Deterministic verifiers over LLM-as-judge
Always measure against a no-skill baseline
Trace review reveals why, not just whether

Read Original

Agent Evals

OpenAI

Product documentation for measuring agent quality with reproducible task-level and workflow-level evaluations. Covers metric selection, test harness setup, and how to separate model quality from harness quality in eval results.

Key Takeaways

Task-level vs. workflow-level evaluation
Separating model quality from harness quality
Reproducible eval harness setup

Read Original

Evaluation Best Practices

OpenAI

Building eval suites that match real-world distributions and catch regressions early. Covers sample selection, avoiding distribution mismatch between evals and production, and iterative refinement of eval criteria as the system evolves.

Key Takeaways

Match eval distribution to production distribution
Iterative refinement of eval criteria
Early regression detection through continuous eval

Read Original

Trace Grading

OpenAI

Grading agent traces directly rather than just final outcomes. Especially valuable for long multi-step tasks where the final result might be correct but the path was inefficient or dangerous. Covers trajectory scoring, step-level grading, and how to identify harness improvements from trace analysis.

Key Takeaways

Grade the journey, not just the destination
Trajectory scoring reveals harness weaknesses
Step-level grading identifies specific failure points
Trace analysis drives targeted harness improvements

Read Original

Learning to Verify AI-Generated Code

OpenHands

A layered verification stack using trajectory critics trained on production traces. These critics enable reranking (pick the best among multiple attempts), early stopping (abort doomed runs), and review-time quality control (flag suspicious patterns before human review). Represents the cutting edge of automated verification for code-producing agents.

Key Takeaways

Trajectory critics trained on production traces
Reranking: pick best among multiple attempts
Early stopping: abort doomed runs to save compute
Review-time quality control: flag suspicious patterns

Read Original

Demystifying Evals for AI Agents

Anthropic

What to measure when agents have many possible successful trajectories. Unlike traditional software tests with one correct answer, agent evals must handle path diversity. Covers outcome-based vs. trajectory-based evaluation, the role of human judgment in eval design, and how to avoid gaming metrics with superficially correct but brittle solutions.

Key Takeaways

Agents have many valid paths to success
Outcome-based vs. trajectory-based evaluation
Avoid metrics that can be gamed
Human judgment is essential in eval design

Read Original

Quantifying Infrastructure Noise in Agentic Coding Evals

Anthropic

A critical finding: runtime configuration (timeout values, memory limits, network latency, container image versions) can move coding benchmark scores by more than many leaderboard gaps. This means perceived model differences are often harness differences in disguise. The implication: when comparing agents, you must control for infrastructure variables, or your conclusions are meaningless.

Key Takeaways

Infrastructure config moves benchmark scores more than model changes
Perceived model differences are often harness differences
Must control for infrastructure variables when comparing
Leaderboard gaps may reflect config, not capability

Read Original

Evaluating Deep Agents: Our Learnings

LangChain

Practical breakdown of three eval levels: single-step (did this one action work?), full-run (did the entire task complete?), and multi-turn (does the agent maintain quality across a conversation?). Each level requires different infrastructure and reveals different types of failures.

Key Takeaways

Three eval levels: single-step, full-run, multi-turn
Each level reveals different failure types
Different infrastructure needed for each level
Multi-turn evals are the hardest but most valuable

Read Original

Improving Deep Agents with Harness Engineering

LangChain

Evidence that harness changes alone -- without model upgrades -- can significantly improve benchmark performance. Shows specific harness modifications (better tool descriptions, structured output formats, retry logic, context management) and their measured impact on coding benchmarks. The strongest argument for investing in harness engineering over waiting for better models.

Key Takeaways

Harness changes alone move benchmark scores significantly
Better tool descriptions improve tool use accuracy
Structured output formats reduce parsing errors
Context management prevents performance degradation over time

Read Original

Building Evals

Testing Agent Skills Systematically with Evals

Key Takeaways

How to Evaluate Agent Skills (And Why You Should)

Key Takeaways

Agent Evals

Key Takeaways

Evaluation Best Practices

Key Takeaways

Trace Analysis

Trace Grading

Key Takeaways

Learning to Verify AI-Generated Code

Key Takeaways

Demystifying Evals for AI Agents

Key Takeaways

Harness Impact

Quantifying Infrastructure Noise in Agentic Coding Evals

Key Takeaways

Evaluating Deep Agents: Our Learnings

Key Takeaways

Improving Deep Agents with Harness Engineering

Key Takeaways

Explore the Knowledge Base

Building Evals

Testing Agent Skills Systematically with Evals

Key Takeaways

How to Evaluate Agent Skills (And Why You Should)

Key Takeaways

Agent Evals

Key Takeaways

Evaluation Best Practices

Key Takeaways

Trace Analysis

Trace Grading

Key Takeaways

Learning to Verify AI-Generated Code

Key Takeaways

Demystifying Evals for AI Agents

Key Takeaways

Harness Impact

Quantifying Infrastructure Noise in Agentic Coding Evals

Key Takeaways

Evaluating Deep Agents: Our Learnings

Key Takeaways

Improving Deep Agents with Harness Engineering

Key Takeaways

Explore the Knowledge Base

Evals & Observability Knowledge Base Summary