How to Automatically Diagnose Failures in LLM Multi-Agent Systems: A Step-by-Step Guide Using the Who&When Framework
What You Need
Before you begin, ensure you have the following:

- Basic understanding of Large Language Models (LLMs) and multi-agent systems
- Familiarity with Python programming and machine learning concepts
- Access to the Who&When dataset and codebase (both open-source, linked below)
- A development environment with Python 3.8+ and necessary libraries (e.g., PyTorch, transformers)
- Interaction logs from your multi-agent system (or use the provided sample logs for practice)
- A task failure scenario you wish to analyze (e.g., a question-answering pipeline that returned wrong results)
Introduction
LLM-driven multi-agent systems are powerful but fragile. A single miscommunication or error by one agent can cascade into a complete task failure, leaving developers drowning in logs. Until now, debugging has been a manual, time-consuming hunt. Researchers from Penn State, Duke, Google DeepMind, and others have introduced a revolutionary solution: Automated Failure Attribution via the Who&When benchmark. This step-by-step guide will walk you through applying their framework to identify exactly which agent caused a failure and at what point it happened. By the end, you’ll be able to pinpoint the root cause of system failures quickly and efficiently.
Step 1: Define Your Multi-Agent System and Task
Start by clearly specifying your system’s architecture and the task it performs. For example, you might have three agents: a planner, an executor, and a verifier working together to answer a user query. Document the roles, communication protocols, and expected outputs. This step is crucial because automated attribution methods rely on understanding agent interactions. Without a clear definition, you cannot map failures to specific agents. List the agents and their responsibilities, and note any shared memory or message-passing mechanisms.
Step 2: Collect and Prepare Interaction Logs
Run your system on a set of tasks and capture all inter-agent communication. Typical logs include timestamps, sender, receiver, message content, and internal state changes. The Who&When dataset provides sample logs for benchmark tasks, but for your own system, ensure logs are saved in a structured format (e.g., JSON or CSV). Each log entry should include: agent ID, time step, action taken, and any errors or warnings. Clean the logs by removing irrelevant entries (e.g., system heartbeat messages) and snap them to uniform time units if needed.
Step 3: Identify Task Failures
Define what constitutes a failure for your task. Common criteria include: final answer incorrect, task incomplete, timeout, or violation of constraints. For each run, label it as success or failure. The Who&When dataset includes pre-annotated failures, but for custom systems, you’ll need to manually check a subset to create ground truth. This step is essential for training or evaluating attribution models. Create a list of failure instances with a pointer to the log segment where the failure became evident.
Step 4: Download and Set Up the Who&When Benchmark
Access the open-source resources:
- Paper: arXiv PDF
- Code: GitHub repository
- Dataset: Hugging Face
Clone the repository and install dependencies. The dataset contains logs from multiple multi-agent tasks (e.g., question answering, code generation) with ground-truth failure labels specifying the responsible agent and the time step. Use the provided notebooks to explore the data structure. This benchmark will serve as both a training set and a test bed for your attribution methods.
Step 5: Choose or Develop an Attribution Method
The researchers introduced several automated attribution approaches evaluated on Who&When. You can either adopt their best-performing model or develop your own. The methods include:

- Post-hoc analysis: Analyze logs after a failure using causal reasoning or attention-based mechanisms.
- Online monitoring: Embed a lightweight detector that triggers attribution when a failure is imminent.
- Contrastive learning: Train a model to distinguish between successful and failed runs by agent behavior.
For a starting point, implement the Attribution by Causal Tracing from the paper, which uses a two-step process: first identify the time window with anomalies, then isolate the agent whose actions deviate most. The codebase includes scripts to train these models on Who&When. Configure hyperparameters like window size and noise thresholds. Train on the benchmark and validate on a held-out set.
Step 6: Apply the Attribution Model to Your System’s Logs
Once your model is trained (or ready in a few-shot setting), run it on the failure instances you collected in Step 3. The model will output for each failure: the responsible agent (e.g., “agent_1”) and the critical time step (e.g., “step 7”). Compare these predictions with your manual ground truth to assess accuracy. For logs not in the benchmark, you may need to adapt the input format. The code provides a simple API: attribution_model.attribute(log_data) returning a dictionary of failures. Review the results and make note of false positives or missed attributions.
Step 7: Interpret Results and Debug the System
With the attribution output, you now know exactly which agent went wrong and when. Examine the log at the identified time step to see the specific error—perhaps a hallucinated fact, miscommunication, or a missing parameter. Use this insight to fix the agent’s behavior, adjust the prompt, or modify the communication protocol. For example, if agent_2 consistently fails at step 5 when receiving numeric data, you might add input validation. Validate your fix by re-running the task and confirming the failure is resolved. Automate this feedback loop to improve system reliability over time.
Tips for Success
- Log granularity matters: The more detailed your logs, the easier attribution becomes. Include agent internal states (e.g., confidence scores, intermediate outputs) alongside messages.
- Start with simple tasks: Test your attribution pipeline on a toy multi-agent system (e.g., two agents solving a math problem) before scaling to complex projects.
- Combine manual inspection: Even with automation, spot-check a few attributions to build trust in the model. The Who&When dataset includes multiple failure types—use them to evaluate robustness.
- Iterate on the model: If attribution accuracy is low, fine-tune the model on your own failure logs using transfer learning from the benchmark.
- Share your findings: Contribute new failure cases to the community (e.g., via Hugging Face) to improve multi-agent debugging tools for everyone.
Related Articles
- Unveiling the Hidden Giant: The Vela Supercluster and the Zone of Avoidance
- Next-Gen Martian Rotorcraft: Q&A on NASA's Post-Ingenuity Helicopter Breakthroughs
- Humanoid Robots Close In on Human Sprint Record: Half-Marathon Already Conquered
- How Scientists Uncover the Hidden Phases of Ice: A Step-by-Step Guide
- The Hidden Cost of a 'Bug-Free' Team: What AI Efficiency Takes Away
- The Definitive Guide to Removing Google Chrome's Hidden Gemini Nano AI Model from Your Mac
- Earthworms’ Unexpected Resistance to Microplastics: Implications for the Food Chain
- Juiced Bikes Rises Again: New E-Bikes Combine Power, Safety, and Affordability