Source: Microsoft Research Blog | Date: March 2026

AgentRx: Systematic Debugging for AI Agents

⭐⭐⭐⭐⭐ 5/5 — Essential reading for AI agent developers

        🎯 Core Problem: Debugging AI agent failures is incredibly hard because trajectories are long, stochastic, and often multi-agent — the true root cause gets buried.
    

The Challenge

Modern AI agents are:

Long-horizon: Dozens of actions over extended periods
Probabilistic: Same input → different outputs, making reproduction difficult
Multi-agent: Failures can be "passed" between agents, masking root cause

Introducing AgentRx

AgentRx (Agent Diagnosis) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to "guess" the error, AgentRx uses a structured, multi-stage pipeline:

Pipeline Stages:

Trajectory normalization: Convert heterogeneous logs into common intermediate representation
Constraint synthesis: Generate executable constraints from tool schemas and domain policies
Guarded evaluation: Evaluate constraints step-by-step, producing auditable validation logs
LLM-based judging: Use LLM judge to identify Critical Failure Step

Failure Taxonomy (9 Categories)

Category	Description
Plan Adherence Failure	Ignored required steps / did extra unplanned actions
Invention of New Information	Altered facts not grounded in trace/tool output (hallucination)
Invalid Invocation	Tool call malformed / missing args / schema-invalid
Misinterpretation of Tool Output	Read tool output incorrectly; acted on wrong assumptions
Intent–Plan Misalignment	Misread user goal/constraints and planned wrongly
Under-specified User Intent	Could not proceed because required info wasn't available
Intent Not Supported	No available tool can do what's being asked
Guardrails Triggered	Execution blocked by safety/access restrictions
System Failure	Connectivity/tool endpoint failures

Key Results

+23.6% absolute improvement in failure localization accuracy
+22.9% improvement in root-cause attribution
115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One

Why It Matters

AgentRx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering. By providing the "why" behind a failure through an auditable log, it's a prerequisite for real-world agent deployment.

🔗 Original Article | GitHub Repo