Agent Benchmarks Catalogue

The model is only half the story. These benchmarks measure what agents can actually do — and how the harness around them shapes performance. Use this catalogue to pick the right eval for your use case.

Benchmarks

Agent Benchmarks Catalogue

A searchable catalogue of 40 agent benchmarks organized by category: Coding, Web, Desktop, Multi-Agent, MCP, Security, Planning, and General.

Coding Benchmarks

SWE-bench Verified - Real GitHub issues resolved by agents, with test verification (swebench.com)
EvoClaw - Continuous software evolution benchmark (openhands.dev)
LeetCode-Hard Gym - RL environment for code generation agents (github.com/GammaTauAI)
Terminal-Bench - Terminal-native agent evaluation (tbench.ai)
Terminal-Bench 2.0 - Harder terminal tasks with Harbor integration (tbench.ai)

Web Benchmarks

WebArena - Self-hostable web agent evaluation (webarena.dev)
WebArena-Verified - Verified subset of WebArena tasks
VisualWebArena - Multimodal web agent tasks
BrowserGym - Web navigation leaderboard by ServiceNow
BrowseComp - Hard-to-find information retrieval benchmark

Desktop Benchmarks

OSWorld - 369 real desktop tasks across operating systems
OSWorld-MCP - OSWorld extended with MCP protocol support
AgentStudio - Realistic virtual agent evaluation
Computer Agent Arena - Real-world computer tasks

Multi-Agent Benchmarks

MAgIC - Multi-agent cognition and collaboration
CharacterEval - Role-playing conversational agents
LLM Colosseum - Street Fighter III agent battles

MCP Benchmarks

MCP Bench - MCP server interaction evaluation
MCP Universe - MCP task leaderboard
MCPMark - MCP stress-testing with real tools

Security Benchmarks

SEC-bench - Security vulnerability detection tasks

Planning Benchmarks

TravelPlanner - Multi-constraint travel planning
Olas Predict - Prediction market agent benchmark

General Benchmarks

GAIA - General AI assistant benchmark by HuggingFace
Agent Arena - ELO-style ratings from head-to-head battles
AgentBench - Cross-environment evaluation: OS, databases, KGs, web
AgentBoard - Multi-turn agents with partial-progress visibility
AppWorld - Controllable world with state-based unit tests
AssistantBench - Realistic multi-step research tasks
ClawBench - Search, reasoning, coding, safety, conversation
ClawWork - Economic benchmark across 44 occupations
GTA - Tool-use with real tools and multimodal inputs
HAL - Holistic agent leaderboard measuring reliability and cost
Galileo Agent Leaderboard - Enterprise agent evaluation
VAB - Visual agent benchmark tasks
WildClawBench - Wild environment benchmark with 60 tasks
WorkArena - Enterprise knowledge-work tasks
tau-Bench - Dynamic conversations with API tools
tau2-bench - Multi-step tool use quality

Learning Resources

learn-harness-engineering - Project-based course on harness engineering (github.com/walkinglabs)

Explore the Knowledge Base

Foundations Context Engineering Safety & Guardrails Specs & Workflows Evals & Observability Tools & Runtimes