Anthropic Research
⭐⭐⭐⭐⭐
Measuring AI Agent Autonomy in Practice
Executive Summary
Anthropic analyzed millions of human-agent interactions across Claude Code and public API to understand: How much autonomy do people grant agents? The findings reveal a significant deployment overhang—models are capable of more autonomy than they exercise in practice.
Key Findings
Claude Code is working autonomously for longer.
The longest-running sessions nearly doubled: from under 25 minutes to over 45 minutes (99.9th percentile). This increase is smooth across model releases—suggesting it's not purely about model capability.
The longest-running sessions nearly doubled: from under 25 minutes to over 45 minutes (99.9th percentile). This increase is smooth across model releases—suggesting it's not purely about model capability.
User Experience Patterns
- New users: ~20% of sessions use full auto-approve
- Experienced users (750+ sessions): >40% use full auto-approve
- But experienced users also interrupt more often—intervening only when needed
- Claude Code pauses for clarification more often than humans interrupt it—2x more on complex tasks
Risk Profile
- Most agent actions are low-risk and reversible
- Software engineering accounts for nearly 50% of agentic activity
- Emerging usage in healthcare, finance, and cybersecurity
- Agents are used in risky domains, but not yet at scale
Methodology
- Definition: An agent is "an AI system equipped with tools that allow it to take actions"
- Studied both Claude Code (depth) and public API (breadth)
- Privacy-preserving infrastructure (CLIO)
Critical Insight: Deployment Overhang
Central Conclusion: The latitude granted to models in practice lags behind what they can handle.
METR estimates Claude Opus 4.5 can complete tasks with 50% success rate that would take a human nearly 5 hours. But the 99.9th percentile turn duration in practice is ~42 minutes—far below capability.
METR estimates Claude Opus 4.5 can complete tasks with 50% success rate that would take a human nearly 5 hours. But the 99.9th percentile turn duration in practice is ~42 minutes—far below capability.
Internal Results
- From August to December: Success rate doubled on most challenging tasks
- Average human interventions per session: 5.4 → 3.3
- Users achieving better outcomes while needing to intervene less often
Recommendations
- Model developers: Need new forms of post-deployment monitoring
- Product developers: New human-AI interaction paradigms
- Policymakers: Understand how autonomy and risk are managed together