Agent Evaluation
Agent evaluation tests an AI system that plans and calls tools across many steps, where one wrong action can cause real-world harm.

Agent evaluation is the practice of testing an AI system that can plan, decide, and call tools to act in the world. It goes beyond checking whether a single model reply is correct. It checks whether the agent chose the right actions, in the right order, with the right inputs, and reached the goal without causing harm along the way. Because an agent can book a flight, send an email, or run a command, one wrong tool call can produce a real consequence, not a wrong sentence.
A real-world analogy
Testing a plain language model is like grading a driving exam on paper. You ask questions and score the answers. Testing an agent is like putting a new driver on real roads in traffic. You watch the whole route: every lane change, every signal, every stop. A driver can still arrive at the destination while running a red light. An agent can still return the right final answer while deleting a file, calling the wrong endpoint, or spending money it should not have. The route matters as much as the arrival.
What makes agents harder to evaluate
Agents raise problems that do not appear when you score a single response. See AI evaluation for the broader parent discipline.
- Multi-step trajectories. An agent runs a loop of reason, act, observe. You need to judge the path it took: which tools it called, the arguments it passed, the intermediate reasoning, and how it recovered from a failed step. This path is the agentic loop .
- Tool access with real effects. A model that writes a wrong sentence is annoying. An agent that calls the wrong tool can move money or change data. Evaluation must catch unsafe actions, not just wrong words.
- Non-determinism. Agents behave probabilistically. Running a task once and passing it does not mean it will pass again. Reliability testing repeats the same task many times and asks how often the agent succeeds across all attempts, not on a lucky run.
- Cascading errors. A single early misstep can propagate. A wrong tool output feeds the next decision, and the agent drifts further from the goal at each step. Good evaluation traces where the failure started.
How it works
Agent evaluation usually combines several views of the same run.
Teams run these checks against benchmarks built for agents. Public suites include tau-bench, which measures whether an agent completes multi-step retail and airline tasks and repeats them reliably, AgentBench, which spans environments such as web browsing and databases, WebArena for web navigation, and the Berkeley Function-Calling Leaderboard, which standardises how well a model calls functions. A recent survey frames the shift well: agent evaluation assesses the whole car under many driving conditions, not just the engine.
Adversarial and red-team angles
Because agents act, they are a target. An attacker can hide instructions in a web page, a document, or a tool response to hijack the agent’s plan. This is a prompt-injection risk unique to systems that read untrusted content and then take action. Red teaming an agent means probing for these paths on purpose: can a malicious input make the agent exfiltrate data, call a destructive tool, or ignore its guardrails. Reliability testing asks how often the agent fails. Red teaming asks how it can be made to fail.
How it connects to related concepts
Agent evaluation is a branch of AI evaluation , the parent discipline that also covers model and system testing. It focuses on AI agents , the systems that plan and act, and on the agentic loop they run through. When an agent follows a fixed, scripted sequence rather than free-form planning, the narrower discipline of workflow evaluation applies. And because agents take real actions, red teaming is a core part of evaluating them safely.
Further reading
- AI evaluation : the parent discipline that covers model, system, and agent testing
- AI agents : what an AI agent is and how it plans and acts
- Agentic loops : the reason, act, observe cycle an agent runs
- Workflow evaluation : testing agents that follow a fixed, scripted sequence
- Red teaming : probing an AI system for unsafe or exploitable behaviour
- A Survey on Evaluation of LLM-based Agents : academic overview of agent evaluation dimensions and methods
- Evaluation and Benchmarking of LLM Agents: A Survey : survey covering benchmarks such as tau-bench, AgentBench, and WebArena
- Berkeley Function-Calling Leaderboard : standard benchmark for measuring how well models call tools