Offensive Cybersecurity AI Time Horizons
TL;DR: AI offensive cybersecurity capabilities are improving rapidly with a doubling time of just 5.7 months (recent models). The most capable models can now succeed on 50% of tasks taking human experts 3+ hours.
Key Findings
Doubling Time: 9.8 months (2019-present), steepening to 5.7 months for models released since 2024
- Current Capability: GPT-5.3 Codex and Opus 4.6 achieve 50% success on tasks taking human experts 3.1-3.2 hours
- Token Budget Impact: Re-running GPT-5.3 Codex failures at 10M tokens raises P50 from 3.1h to 10.5h
- Open-Weight Lag: GLM-5 lags closed-source frontier by 5.7 months
- Real-World Harm: 2026 International AI Safety Report identifies cybersecurity as domain with strongest evidence of real-world AI harm
Critical Incidents
- Late 2025: Anthropic disclosed first documented large-scale AI-orchestrated cyber espionage campaign - threat actor used Claude to decompose complex attack chains and automate 80-90% of operations
- Early 2026: Opus 4.6 discovered 500+ previously unknown high-severity vulnerabilities in open-source libraries that had been fuzzed for millions of CPU-hours
- January 2026: AISLE discovered all 12 CVEs in OpenSSL coordinated release, including bugs dating to 1998
Methodology
Uses METR's time-horizon methodology - measuring AI capability growth in human-equivalent task time. Tasks labeled by time skilled human would take. Model's time horizon at given success rate is human-time difficulty at which its fitted success curve crosses that threshold.
Benchmarks Used
- CyBashBench: Short-horizon terminal commands (5s - 5.6m)
- NL2Bash: Natural language to bash (9.5s - 18.3m)
- InterCode-CTF: PicoCTF challenges (4.5m - 41.3m)
- NYU-CTF: CSAW competition challenges (35.4m - 6.8h)
- CyBench: Professional global CTF (37.4m - 6.3h)
- CVEBench: Real-world CVE reproduction (141.6m - 4.6h)
- CyberGym: Memory-safety PoC generation (170.5m - 6.5h)
Implications
This research suggests that frontier AI offensive-cyber capability may diffuse into open-weight form on relatively short timelines. The ecological validity is limited to bounded and verifiable offensive subtasks rather than full scope of real-world operations.
Warning: These results are lower bounds on early-2026 frontier capability due to evaluation budget limitations.