Testing Agent Skills Systematically with Evals
OpenAIOpenAI's guide to converting agent execution traces into repeatable evaluation tests. Shows how to capture JSONL logs from agent runs, define deterministic checks against expected outcomes, and build regression suites that catch when harness changes break existing capabilities. The key insight: agent evals are just software tests applied to non-deterministic systems -- the same discipline applies, with statistical thresholds instead of exact matches.
Key Takeaways
- Capture JSONL traces from agent runs
- Define deterministic checks against expected outcomes
- Build regression suites for harness changes
- Statistical thresholds replace exact match assertions