February 20, 2026 4 min read 776 words

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, Jiaxuan You

Also: Русский

This article is AI-generated from a scientific publication. We recommend verifying information in the original source.

Where LLM Agents Fail and How They Can Learn From Failures

TL;DR

Large-language-model agents can fail in a cascade: one mistake triggers a chain of errors that ruins the whole task. The authors built AgentErrorTaxonomy to classify these failures, collected a benchmark of real-world error traces (AgentErrorBench), and released AgentDebug, a lightweight framework that pinpoints the root cause and feeds corrective suggestions back to the agent. On three diverse domains, the researchers report that AgentDebug lifts overall task success by up to 26% and improves step-level accuracy by 17%, making it a practical tool for any engineer who wants reliable, self-healing LLM agents.

What's New

Traditional LLM agents combine planning, memory, reflection, and tool use, but they lack a systematic way to diagnose why a single slip turns into a full-blown failure. Existing debugging attempts treat errors as black boxes, offering little insight into the underlying module that went wrong.

The authors introduce a modular failure taxonomy — memory, reflection, planning, action, and system-level errors — that lets engineers trace a failure back to its origin. [Figure 1] shows how a single error can cascade through an agent's workflow and how AgentDebug corrects the flow. They then annotate thousands of agent rollouts from ALFWorld, GAIA, and WebShop to create AgentErrorBench, the first dataset that links concrete error trajectories to specific taxonomy labels. [Figure 2] outlines the benchmark construction pipeline. Finally, AgentDebug wraps an agent's execution loop: it monitors each step, matches observed symptoms to taxonomy entries, and generates targeted feedback that the agent can use in subsequent iterations.

How It Works

Error Taxonomy — The taxonomy defines five orthogonal failure modes. Memory errors capture lost or corrupted state; reflection errors involve mis-judging past actions; planning errors occur when the agent's future-step generator goes awry; action errors arise during tool invocation; system errors cover infrastructure or resource failures. This structure lets developers pinpoint whether a problem is, for example, "the agent forgot the target location" (memory) or "it mis-interpreted the tool's output" (action). [Figure 3] maps how these error types distribute across agent execution steps.

Benchmark Construction — Agents are run on three real-world simulators: ALFWorld (household tasks), GAIA (robotic manipulation), and WebShop (e-commerce). Every step is logged, and human annotators tag the root cause using the taxonomy. The resulting dataset contains thousands of failure trajectories that reflect realistic, multi-step errors.
Debugging Loop — AgentDebug hooks into the agent's planning-reflection-action cycle. At each step, it checks for signatures of taxonomy-defined failures (e.g., missing memory keys, inconsistent tool outputs). When a failure is detected, AgentDebug crafts a concise corrective message — such as "reload the last known position" or "re-parse the tool response" — and injects it into the agent's next prompt. [Figure 4] illustrates the three-stage process: fine-grained analysis, critical error detection, and iterative debugging with actionable feedback.

The modularity of the taxonomy means that adding a new failure type only requires extending the detector, not redesigning the whole system.

Results

All-Correct Accuracy — According to the paper, AgentDebug raises overall task success by 24% over the strongest baseline, as shown in [Figure 5].
Step Accuracy — Per-step correctness improves by 17%, indicating that the agent corrects itself more often during execution.
Domain-Specific Gains — Across ALFWorld, GAIA, and WebShop, the framework delivers up to 26% relative improvement in final task success ([Figure 6]).

These gains come without any architectural overhaul of the underlying LLM; AgentDebug is a lightweight wrapper that can be applied to existing agents.

Try It

Clone the Repository — The code and benchmark data are available at https://github.com/ulab-uiuc/AgentDebug.
Environment Setup — Install the required Python packages as listed in requirements.txt.
Run a Baseline Agent — Use the provided scripts to launch an LLM agent on ALFWorld, GAIA, or WebShop.
Wrap with AgentDebug — Import the AgentDebug wrapper and pass your agent instance to it. The wrapper automatically injects failure detection and corrective prompts.
Evaluate — Compare the all_correct and step_accuracy metrics against the baseline logs. The repository includes evaluation scripts that reproduce the paper's results.

Limitations & Open Questions

Scope of Taxonomy — The current taxonomy covers five failure modes; more nuanced errors (e.g., subtle semantic misunderstandings) may slip through.
Generalization to Other Domains — While the benchmark spans three domains, it remains unclear how well AgentDebug transfers to completely new tasks or tool sets.
Feedback Loop Stability — Repeated corrective prompts could, in theory, lead to oscillation or over-correction; the authors note that careful tuning of feedback strength is needed.

Read Original Paper

All Articles