Demystifying Failures in LLM Multi-Agent Systems: Who Dropped the Ball and When?

LLM-powered multi-agent systems promise efficient collaboration on complex tasks, but they frequently fail—and diagnosing the root cause can feel like searching for a needle in a haystack. Developers are left sifting through massive interaction logs, trying to figure out which agent caused the failure and at what step. To tackle this, researchers from Penn State University, Duke University, and collaborators including Google DeepMind introduced the problem of automated failure attribution. They built the first benchmark dataset, called Who&When, and developed several methods to pinpoint blame automatically. This work, accepted as a Spotlight at ICML 2025, is fully open-source. Below, we explore the key questions around this research.

What Is Automated Failure Attribution and Why Is It Needed?

Automated failure attribution is the task of identifying which agent in a multi-agent system was responsible for a task failure and at which step the error occurred. In LLM-driven multi-agent systems, agents communicate, delegate subtasks, and synthesize results. A single miscommunication, a faulty reasoning step, or an incorrect output from one agent can cascade into a complete system failure. Currently, developers rely on manual log archaeology—reading through lengthy conversational histories—and deep domain expertise to find the culprit. This process is slow, error-prone, and does not scale as systems grow. Automated attribution aims to replace this tedious manual debugging with a systematic, efficient solution, accelerating system iteration and improving reliability. Without it, developers waste hours or days on each failure, hindering the deployment of multi-agent systems in real-world applications.

Demystifying Failures in LLM Multi-Agent Systems: Who Dropped the Ball and When? — Source: syncedreview.com

How Does the Who&When Benchmark Work?

The Who&When dataset is the first benchmark designed specifically for automated failure attribution in LLM multi-agent systems. It contains 702 failure cases across 18 diverse multi-agent architectures performing tasks like question answering, code generation, and creative writing. Each failure case comes with ground-truth labels: the responsible agent (Who) and the step (When) where the failure originated. Researchers compiled these cases by running multi-agent systems and manually annotating the logs. The dataset spans different agent topologies (e.g., hierarchical, decentralized) and failure types (e.g., logical errors, factual mistakes, miscommunications). This variety ensures that attribution methods are tested on realistic and challenging scenarios. The benchmark also includes evaluation metrics to measure how accurately a method identifies both the agent and the step, providing a standard for comparing future attribution techniques.

What Challenges Did Researchers Face in Creating the Dataset?

Building the Who&When dataset involved several obstacles. First, generating realistic failures required carefully designing multi-agent interactions that would go wrong in interpretable ways. The team had to ensure that failures were not trivial (e.g., a single agent outputting gibberish) but reflected common pitfalls like ambiguous delegation or overconfident reasoning. Second, manual annotation demanded deep scrutiny of each conversation log—sometimes hundreds of steps—to pinpoint the exact moment and agent that set off the chain of errors. To maintain consistency, multiple annotators cross-checked labels, which was time-consuming. Third, because multi-agent systems can vary wildly in architecture, the dataset had to cover a broad spectrum of topologies and failure modes to be useful. Balancing coverage with annotator effort was a constant challenge. Despite these difficulties, the final dataset is a rigorous, high-quality resource that lets researchers train and evaluate attribution methods in a controlled setting.

What Methods Were Developed for Automated Attribution?

The researchers proposed and evaluated several approaches to automate failure attribution. Baseline methods include simple heuristics, such as blaming the last agent that spoke before the failure or the agent with the most output tokens. LLM-based methods involve prompting a powerful language model to analyze the entire conversation log and output the responsible agent and step. More advanced graph-based methods model the agent interactions as a directed graph, tracing information flow and identifying nodes where errors propagate. They also explored contrastive reasoning, where an LLM compares the failed interaction to a similar successful run to highlight deviations. The best-performing method combined multiple signals: analyzing both the context of each agent’s messages and the causal dependencies between them. All methods were evaluated on the Who&When benchmark using standard accuracy metrics for both who and when. The results show that while no single method is perfect, the graph-based and contrastive approaches significantly outperform simple baselines.

How Effective Are These Attribution Methods?

In experiments on the Who&When dataset, the most effective attribution methods achieved around 70-80% accuracy in identifying the failing agent, and slightly lower accuracy (60-70%) for pinpointing the exact step. Graph-based methods that reason about information flow performed consistently best, especially in failures caused by miscommunication between agents. LLM-based prompting approaches worked well for simple failures but struggled when the root cause was subtle—like an agent making a plausible but wrong assumption. The contrastive method that compared failed and successful runs added a 5-10% boost in accuracy. Importantly, human annotators, when given the same logs, achieved around 85-90% accuracy, meaning there is still room for improvement. The researchers noted that attribution becomes harder as the number of agents increases and as tasks become more open-ended. These results set a baseline for future work and highlight that automated attribution is both feasible and challenging.

What Are the Future Directions for This Research?

The team outlines several promising avenues. One is to integrate attribution methods into real-time monitoring of multi-agent systems, so failures can be caught and explained as they happen. Another direction is to use attribution signals to automatically repair failures—for example, by re-running the responsible agent with corrected context. The researchers also plan to extend the benchmark to include more complex agent roles (e.g., agents that manage subtasks dynamically) and longer interaction chains. Additionally, they hope to explore human-in-the-loop attribution, where the automated method suggests likely culprits, and a developer verifies quickly. Finally, as multi-agent systems are deployed in safety-critical domains like healthcare or autonomous driving, robust failure attribution becomes essential for trust and accountability. The open-source release of the code and dataset invites the community to build on this foundation, accelerating progress toward reliable multi-agent collaborations.

How Can Developers Access the Open-Source Code and Data?

All resources are publicly available. The research paper (accepted as a Spotlight at ICML 2025) can be found on arXiv at https://arxiv.org/pdf/2505.00212. The code repository is on GitHub at https://github.com/mingyin1/Agents_Failure_Attribution, where you’ll find implementation of the attribution methods and instructions for replicating experiments. The Who&When dataset is hosted on Hugging Face at https://huggingface.co/datasets/Kevin355/Who_and_When. It includes all logs with failure labels, ready for download and use. Developers and researchers can immediately experiment with their own attribution techniques, test new architectures, or integrate the dataset into their tooling. The team encourages contributions and feedback to help improve multi-agent system debugging.

Tags: