New Benchmark Automatically Attributes Failures in LLM Multi-Agent Systems

By

Introduction

Large language model (LLM) multi-agent systems have become a hot topic in artificial intelligence, with teams of specialized AI agents working together to tackle complex tasks. While this collaborative approach shows great promise, it also introduces a critical vulnerability: when the system fails, pinpointing exactly which agent caused the failure and at what stage becomes a daunting challenge. Developers often find themselves manually combing through extensive interaction logs—a time-consuming and error-prone process akin to finding a needle in a haystack. To address this pain point, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, have introduced a novel research problem: automated failure attribution. Their work, accepted as a Spotlight presentation at ICML 2025, presents the first benchmark dataset for this task, called Who&When, along with several automated attribution methods. The code and dataset are fully open-source, offering a new path to improving the reliability of LLM multi-agent systems.

New Benchmark Automatically Attributes Failures in LLM Multi-Agent Systems
Source: syncedreview.com

The Debugging Challenge

LLM-driven multi-agent systems exhibit impressive capabilities, yet they remain fragile. A single agent’s error, a misunderstanding between agents, or a mistake in information transmission can cascade into a full task failure. Currently, when a system fails, developers typically resort to manual, inefficient debugging techniques. These include what might be called manual log archaeology—sifting through lengthy interaction logs to locate the source of the problem. Moreover, the debugging process heavily relies on the developer’s intimate knowledge of the system architecture and agent behavior. Such reliance makes troubleshooting not only labor-intensive but also difficult to scale as systems grow more complex.

The Who&When Benchmark

Dataset Construction

To provide a standardized evaluation ground for automated failure attribution, the team constructed Who&When, the first benchmark dataset dedicated to this task. The dataset comprises a diverse set of multi-agent interaction scenarios, where each scenario includes a recorded sequence of agent actions and communications, a final task outcome (success or failure), and ground-truth labels identifying which agent was responsible for the failure and at what specific step the mistake occurred. The researchers carefully designed the dataset to cover various types of failures, including incorrect reasoning, miscommunication, and missing information, ensuring broad coverage of real-world challenges.

Task Definition

Formally, the automated failure attribution task asks: given the interaction log of a multi-agent system that failed a task, identify both the responsible agent and the failure time step. This two-part output provides developers with precise, actionable information for debugging. The task is framed as a supervised learning problem, with the dataset serving as training and evaluation data for new attribution methods.

Automated Attribution Methods

The researchers developed and evaluated several automated attribution methods, ranging from simple baselines to more sophisticated approaches. These methods include:

Each method was tested on the benchmark, providing insights into their strengths and limitations.

Results and Findings

The experiments revealed several key findings. First, the task of automated failure attribution is inherently challenging, with even the best methods achieving moderate accuracy. Second, methods that explicitly model the sequence of agent interactions (such as pointer networks and chain-of-thought LLM prompting) outperformed simpler heuristics, suggesting that temporal context is crucial for accurate attribution. Third, the performance varied significantly across failure types—some errors, like clear reasoning mistakes by a single agent, were easier to attribute, while others, such as cascading failures involving multiple agents, remained difficult. The benchmark provides a clear baseline for future research to improve.

Future Directions

The Who&When dataset opens up several promising avenues for future work. One direction is to develop more robust attribution models that can handle ambiguous or multi-factorial failures. Another is to integrate attribution into the system design itself, enabling agents to self-report errors or to have a monitoring agent that performs real-time attribution. Additionally, expanding the benchmark to cover more complex multi-agent architectures, including those with dynamic roles and interleaved communications, would increase real-world applicability. The researchers hope that their work will inspire further research into making multi-agent systems more transparent and easier to debug.

Conclusion

Automated failure attribution represents a critical step toward building reliable and maintainable LLM multi-agent systems. By introducing the Who&When benchmark and evaluating a range of attribution methods, the research team from Penn State, Duke, and collaborating institutions has laid a solid foundation for future progress. Developers can now begin to move away from manual log analysis and toward automated debugging tools that quickly identify the root cause of failures. The open-source release of code and data will accelerate research in this area, ultimately helping multi-agent systems become more robust and trustworthy. For more details, refer to the paper, code, and dataset.

Tags:

Related Articles

Recommended

Discover More

Coursera’s New Offerings: From AI to CPA Prep, a Roadmap to Career Success in a Changing EconomyBRICKSTORM Malware Targets VMware vSphere: Urgent Hardening Guide for DefendersLighter Adopts USDC as Primary Stablecoin in Strategic Circle AllianceFrom Cost Center to Resilience Driver: A Step-by-Step Guide to ROI in Cyber-Physical SecurityWhy Section 230 Matters for Photographers: A SmugMug Perspective