In mid-September 2025, a Chinese state-sponsored hacking group walked up to Claude Code and started asking it to break into dozens of companies and government agencies. It worked. Over roughly 30 targets, the group — Anthropic calls it GTG-1002 — used AI to perform 80 to 90 percent of the operational work, from reconnaissance to credential harvesting to data exfiltration. Human operators made perhaps four to six critical decisions across the entire campaign. Everything else was the model.
Anthropic disclosed the campaign in November 2025 and published a detailed technical report. At peak, the AI was firing thousands of requests per second, attack speed physically impossible for a human operator. Claude did not work perfectly. It hallucinated credentials, claimed to have extracted secrets that were publicly available, and occasionally needed human intervention to stay on track. These failures explain why the campaign still required human operators, not yet full autonomy. But the direction is clear.
A new paper from Lyptus Research, a UK government-affiliated cybersecurity research group, provides the first systematic measurement of how fast AI offensive cyber capability is improving. The answer: at a doubling time of 9.8 months across frontier models since 2019. Since 2024, that accelerated to 5.7 months.
The researchers applied a methodology adapted from METR, a lab that measures AI capability growth in human-equivalent task time. They built a dataset of 291 tasks with completion transcripts and time estimates from 10 offensive cybersecurity professionals, then evaluated seven benchmarks covering command generation, capture-the-flag challenges, CVE exploitation, and memory-safety proof-of-concept generation.
The latest frontier models exceed the fitted trendlines. GPT-5.3 Codex and Opus 4.6 both achieve 50 percent success on tasks that take human experts roughly 3.1 to 3.2 hours. Opus 4.6 also discovered more than 500 previously unknown high-severity vulnerabilities in open-source libraries that had been fuzzed for millions of CPU-hours, without specialized scaffolding.
The Lyptus researchers believe their numbers are conservative. Their fixed 2-million-token evaluation budgets likely understate what current models can do. When they re-ran GPT-5.3 Codex failures at a 10-million-token budget, the model's median completion time stretched from 3.1 hours to 10.5 hours. At higher budgets, they believe the dataset is already saturated.
The open-weight gap matters too. GLM-5, the most capable open-weight model available, trails the closed-source frontier by 5.7 months. Any given level of offensive cyber capability can be expected to appear in open-weight form within months of reaching closed-source models.
The 2026 International AI Safety Report identifies cybersecurity as the domain where evidence of real-world harm from AI is now strongest. The GTG-1002 campaign is the most vivid illustration of that assessment.
The second signal in this story is a field experiment that got less attention. Researchers at INSEAD and Harvard ran a randomized controlled trial across 515 high-growth startups. Treated firms received approximately $25,000 in in-kind AI support: API credits, frontier model access, and onboarding assistance from OpenAI and Manus. Control firms did not.
The results were not subtle. After 12 months, treated firms generated 1.9 times higher revenue than the control group, completed 12 percent more tasks, and were 18 percent more likely to acquire paying customers. They discovered 44 percent more AI use cases internally, 2.7 additional use cases on average. Capital demand fell by approximately $220,000, or 39.5 percent, relative to controls. No corresponding increase in labor demand was observed.
The mechanism is not simply AI doing tasks faster. Treated firms reorganized production around AI, discovering new configurations that compound over time. One additional AI use case prompted by the treatment led to 0.85 more completed tasks and approximately 26 percent higher revenue, not as a direct effect of that use case but as a signal of how deeply the firm had restructured its operations.
The 39.5 percent reduction in external capital demand is the figure that should draw attention from investors and policymakers. AI adoption appears to reduce startups' dependence on outside funding even as it accelerates revenue growth. The implications for venture capital models and startup economics are not yet well understood.
Anthropic disclosed in its report that it provided private advance warnings to government authorities before the public disclosure. The company has not published the criteria or methodology behind those warnings. If frontier labs can simultaneously publish capability benchmarks and issue private risk warnings to authorities without disclosing the thresholds that trigger those warnings, the public record on AI risk is systematically incomplete. Policymakers and security teams outside that private channel have no basis to evaluate whether the warnings are calibrated correctly.
The Lyptus researchers note explicitly that their quantitative results are conditioned on seven benchmarks whose ecological validity is limited to bounded and verifiable offensive subtasks rather than the full scope of real-world operations. The benchmarks are saturating. The real world is not.
The question is not whether the failures that currently prevent full autonomy get fixed. The question is what happens after.