Where LLM Agents Fail and How They Can Learn From Failures
TL;DR
Large-language-model agents can fail in a cascade: one mistake triggers a chain of errors that ruins the whole task. The authors built AgentErrorTaxonomy to classify these failures, collected a benchmark of real-world error traces (AgentErrorBench), and released AgentDebug, a lightweight framework that pinpoints the root cause and feeds corrective suggestions back to the agent. On three diverse domains, the researchers report that AgentDebug lifts overall task success by up to 26% and improves step-level accuracy by 17%, making it a practical tool for any engineer who wants reliable, self-healing LLM agents.
What's New
Traditional LLM agents combine planning, memory, reflection, and tool use, but they lack a systematic way to diagnose why a single slip turns into a full-blown failure. Existing debugging attempts treat errors as black boxes, offering little insight into the underlying module that went wrong.
The authors introduce a modular failure taxonomy — memory, reflection, planning, action, and system-level errors — that lets engineers trace a failure back to its origin. [Figure 1] shows how a single error can cascade through an agent's workflow and how AgentDebug corrects the flow. They then annotate thousands of agent rollouts from ALFWorld, GAIA, and WebShop to create AgentErrorBench, the first dataset that links concrete error trajectories to specific taxonomy labels. [Figure 2] outlines the benchmark construction pipeline. Finally, AgentDebug wraps an agent's execution loop: it monitors each step, matches observed symptoms to taxonomy entries, and generates targeted feedback that the agent can use in subsequent iterations.
How It Works
- Error Taxonomy — The taxonomy defines five orthogonal failure modes. Memory errors capture lost or corrupted state; reflection errors involve mis-judging past actions; planning errors occur when the agent's future-step generator goes awry; action errors arise during tool invocation; system errors cover infrastructure or resource failures. This structure lets developers pinpoint whether a problem is, for example, "the agent forgot the target location" (memory) or "it mis-interpreted the tool's output" (action). [Figure 3] maps how these error types distribute across agent execution steps.
-
Benchmark Construction — Agents are run on three real-world simulators: ALFWorld (household tasks), GAIA (robotic manipulation), and WebShop (e-commerce). Every step is logged, and human annotators tag the root cause using the taxonomy. The resulting dataset contains thousands of failure trajectories that reflect realistic, multi-step errors.
-
Debugging Loop — AgentDebug hooks into the agent's planning-reflection-action cycle. At each step, it checks for signatures of taxonomy-defined failures (e.g., missing memory keys, inconsistent tool outputs). When a failure is detected, AgentDebug crafts a concise corrective message — such as "reload the last known position" or "re-parse the tool response" — and injects it into the agent's next prompt. [Figure 4] illustrates the three-stage process: fine-grained analysis, critical error detection, and iterative debugging with actionable feedback.
The modularity of the taxonomy means that adding a new failure type only requires extending the detector, not redesigning the whole system.
Results
- All-Correct Accuracy — According to the paper, AgentDebug raises overall task success by 24% over the strongest baseline, as shown in [Figure 5].
- Step Accuracy — Per-step correctness improves by 17%, indicating that the agent corrects itself more often during execution.
- Domain-Specific Gains — Across ALFWorld, GAIA, and WebShop, the framework delivers up to 26% relative improvement in final task success ([Figure 6]).
These gains come without any architectural overhaul of the underlying LLM; AgentDebug is a lightweight wrapper that can be applied to existing agents.
Try It
- Clone the Repository — The code and benchmark data are available at https://github.com/ulab-uiuc/AgentDebug.
- Environment Setup — Install the required Python packages as listed in
requirements.txt. - Run a Baseline Agent — Use the provided scripts to launch an LLM agent on ALFWorld, GAIA, or WebShop.
- Wrap with AgentDebug — Import the
AgentDebugwrapper and pass your agent instance to it. The wrapper automatically injects failure detection and corrective prompts. - Evaluate — Compare the
all_correctandstep_accuracymetrics against the baseline logs. The repository includes evaluation scripts that reproduce the paper's results.
Limitations & Open Questions
- Scope of Taxonomy — The current taxonomy covers five failure modes; more nuanced errors (e.g., subtle semantic misunderstandings) may slip through.
- Generalization to Other Domains — While the benchmark spans three domains, it remains unclear how well AgentDebug transfers to completely new tasks or tool sets.
- Feedback Loop Stability — Repeated corrective prompts could, in theory, lead to oscillation or over-correction; the authors note that careful tuning of feedback strength is needed.