Agent Benchmarks Catalogue

The model is only half the story. These benchmarks measure what agents can actually do — and how the harness around them shapes performance. Use this catalogue to pick the right eval for your use case.

40
Benchmarks
9
Categories
40
Showing
No benchmarks match your search. Try a different term or clear filters.

Agent Benchmarks Catalogue

A searchable catalogue of 40 agent benchmarks organized by category: Coding, Web, Desktop, Multi-Agent, MCP, Security, Planning, and General.

Coding Benchmarks

Web Benchmarks

Desktop Benchmarks

Multi-Agent Benchmarks

MCP Benchmarks

Security Benchmarks

Planning Benchmarks

General Benchmarks

Learning Resources


Explore the Knowledge Base

Foundations Context Engineering Safety & Guardrails Specs & Workflows Evals & Observability Tools & Runtimes