How to Automatically Diagnose Failures in LLM Multi-Agent Systems: A Step-by-Step Guide Using the Who&When Framework

By

What You Need

Before you begin, ensure you have the following:

How to Automatically Diagnose Failures in LLM Multi-Agent Systems: A Step-by-Step Guide Using the Who&When Framework
Source: syncedreview.com

Introduction

LLM-driven multi-agent systems are powerful but fragile. A single miscommunication or error by one agent can cascade into a complete task failure, leaving developers drowning in logs. Until now, debugging has been a manual, time-consuming hunt. Researchers from Penn State, Duke, Google DeepMind, and others have introduced a revolutionary solution: Automated Failure Attribution via the Who&When benchmark. This step-by-step guide will walk you through applying their framework to identify exactly which agent caused a failure and at what point it happened. By the end, you’ll be able to pinpoint the root cause of system failures quickly and efficiently.

Step 1: Define Your Multi-Agent System and Task

Start by clearly specifying your system’s architecture and the task it performs. For example, you might have three agents: a planner, an executor, and a verifier working together to answer a user query. Document the roles, communication protocols, and expected outputs. This step is crucial because automated attribution methods rely on understanding agent interactions. Without a clear definition, you cannot map failures to specific agents. List the agents and their responsibilities, and note any shared memory or message-passing mechanisms.

Step 2: Collect and Prepare Interaction Logs

Run your system on a set of tasks and capture all inter-agent communication. Typical logs include timestamps, sender, receiver, message content, and internal state changes. The Who&When dataset provides sample logs for benchmark tasks, but for your own system, ensure logs are saved in a structured format (e.g., JSON or CSV). Each log entry should include: agent ID, time step, action taken, and any errors or warnings. Clean the logs by removing irrelevant entries (e.g., system heartbeat messages) and snap them to uniform time units if needed.

Step 3: Identify Task Failures

Define what constitutes a failure for your task. Common criteria include: final answer incorrect, task incomplete, timeout, or violation of constraints. For each run, label it as success or failure. The Who&When dataset includes pre-annotated failures, but for custom systems, you’ll need to manually check a subset to create ground truth. This step is essential for training or evaluating attribution models. Create a list of failure instances with a pointer to the log segment where the failure became evident.

Step 4: Download and Set Up the Who&When Benchmark

Access the open-source resources:

Clone the repository and install dependencies. The dataset contains logs from multiple multi-agent tasks (e.g., question answering, code generation) with ground-truth failure labels specifying the responsible agent and the time step. Use the provided notebooks to explore the data structure. This benchmark will serve as both a training set and a test bed for your attribution methods.

Step 5: Choose or Develop an Attribution Method

The researchers introduced several automated attribution approaches evaluated on Who&When. You can either adopt their best-performing model or develop your own. The methods include:

How to Automatically Diagnose Failures in LLM Multi-Agent Systems: A Step-by-Step Guide Using the Who&When Framework
Source: syncedreview.com

For a starting point, implement the Attribution by Causal Tracing from the paper, which uses a two-step process: first identify the time window with anomalies, then isolate the agent whose actions deviate most. The codebase includes scripts to train these models on Who&When. Configure hyperparameters like window size and noise thresholds. Train on the benchmark and validate on a held-out set.

Step 6: Apply the Attribution Model to Your System’s Logs

Once your model is trained (or ready in a few-shot setting), run it on the failure instances you collected in Step 3. The model will output for each failure: the responsible agent (e.g., “agent_1”) and the critical time step (e.g., “step 7”). Compare these predictions with your manual ground truth to assess accuracy. For logs not in the benchmark, you may need to adapt the input format. The code provides a simple API: attribution_model.attribute(log_data) returning a dictionary of failures. Review the results and make note of false positives or missed attributions.

Step 7: Interpret Results and Debug the System

With the attribution output, you now know exactly which agent went wrong and when. Examine the log at the identified time step to see the specific error—perhaps a hallucinated fact, miscommunication, or a missing parameter. Use this insight to fix the agent’s behavior, adjust the prompt, or modify the communication protocol. For example, if agent_2 consistently fails at step 5 when receiving numeric data, you might add input validation. Validate your fix by re-running the task and confirming the failure is resolved. Automate this feedback loop to improve system reliability over time.

Tips for Success

Tags:

Related Articles

Recommended

Discover More

Go at 16: Production Power, Concurrent Testing, and a Glimpse into AIAuthorities Unmask Alleged Mastermind Behind Notorious Ransomware Gangs GandCrab and REvilHow to Transition Away from Microsoft Teams' Together Mode: A Step-by-Step GuideInside the JetBrains x Codex Hackathon: How AI-Native IDE Projects Are Redefining Development10 Enduring Lessons from The Mythical Man-Month for Modern Software Development