Agent Benchmarks Catalogue
The model is only half the story. These benchmarks measure what agents can actually do — and how the harness around them shapes performance. Use this catalogue to pick the right eval for your use case.
All
Coding
Web
Desktop
Multi-Agent
MCP
Security
Planning
General
Learning
No benchmarks match your search. Try a different term or clear filters.
Agent Benchmarks Catalogue
A searchable catalogue of 40 agent benchmarks organized by category: Coding, Web, Desktop, Multi-Agent, MCP, Security, Planning, and General.
Coding Benchmarks
SWE-bench Verified - Real GitHub issues resolved by agents, with test verification (swebench.com)
EvoClaw - Continuous software evolution benchmark (openhands.dev)
LeetCode-Hard Gym - RL environment for code generation agents (github.com/GammaTauAI)
Terminal-Bench - Terminal-native agent evaluation (tbench.ai)
Terminal-Bench 2.0 - Harder terminal tasks with Harbor integration (tbench.ai)
Web Benchmarks
WebArena - Self-hostable web agent evaluation (webarena.dev)
WebArena-Verified - Verified subset of WebArena tasks
VisualWebArena - Multimodal web agent tasks
BrowserGym - Web navigation leaderboard by ServiceNow
BrowseComp - Hard-to-find information retrieval benchmark
Desktop Benchmarks
OSWorld - 369 real desktop tasks across operating systems
OSWorld-MCP - OSWorld extended with MCP protocol support
AgentStudio - Realistic virtual agent evaluation
Computer Agent Arena - Real-world computer tasks
Multi-Agent Benchmarks
MAgIC - Multi-agent cognition and collaboration
CharacterEval - Role-playing conversational agents
LLM Colosseum - Street Fighter III agent battles
MCP Benchmarks
MCP Bench - MCP server interaction evaluation
MCP Universe - MCP task leaderboard
MCPMark - MCP stress-testing with real tools
Security Benchmarks
SEC-bench - Security vulnerability detection tasks
Planning Benchmarks
TravelPlanner - Multi-constraint travel planning
Olas Predict - Prediction market agent benchmark
General Benchmarks
GAIA - General AI assistant benchmark by HuggingFace
Agent Arena - ELO-style ratings from head-to-head battles
AgentBench - Cross-environment evaluation: OS, databases, KGs, web
AgentBoard - Multi-turn agents with partial-progress visibility
AppWorld - Controllable world with state-based unit tests
AssistantBench - Realistic multi-step research tasks
ClawBench - Search, reasoning, coding, safety, conversation
ClawWork - Economic benchmark across 44 occupations
GTA - Tool-use with real tools and multimodal inputs
HAL - Holistic agent leaderboard measuring reliability and cost
Galileo Agent Leaderboard - Enterprise agent evaluation
VAB - Visual agent benchmark tasks
WildClawBench - Wild environment benchmark with 60 tasks
WorkArena - Enterprise knowledge-work tasks
tau-Bench - Dynamic conversations with API tools
tau2-bench - Multi-step tool use quality
Learning Resources
learn-harness-engineering - Project-based course on harness engineering (github.com/walkinglabs)
Explore the Knowledge Base