AI Models Shatter Benchmarks for Autonomous Cyber Capabilities

“Frontier AI’s autonomous cyber and software capability is advancing quickly: the length of cyber tasks that frontier models can complete autonomously has doubled on the order of months, not years,” the United Kingdom’s AI Security Institute (AISI) wrote.

AISI cyber ranges and the capability jump

In structured simulations that replicate multi-stage attacks against small, undefended enterprise networks, two recent frontier models cleared benchmarks the AISI had not seen before. A newer checkpoint of Anthropic’s Claude Mythos Preview became the first model to complete both of the institute’s ranges. It solved the 32-step simulated corporate network attack called “The Last Ones” in 6 of 10 attempts and completed “Cooling Tower” — a range previously unsolved by any model — in 3 of 10 attempts. OpenAI’s GPT-5.5 solved “The Last Ones” in 3 of 10 attempts.

Those results, the AISI wrote, substantially exceeded the doubling trend the institute had tracked since late 2024 and outperformed any trend lines the institute had measured. The institute uses an “80% reliability cyber time horizon” — a proxy comparing how long a task takes a human expert to indicate AI autonomy — and had previously estimated that metric was doubling roughly every five months earlier this year.

Palo Alto Networks: volume, products, and patched vulnerabilities

Palo Alto Networks reached similar conclusions through independent testing. As an Anthropic Project Glasswing launch partner, the company began testing Claude Mythos in April and also evaluated Claude Opus 4.7 and OpenAI’s GPT-5.5-Cyber as part of OpenAI’s Trusted Access for Cyber program.

In its testing across more than 130 products, Palo Alto Networks said the latest models were “extraordinarily capable at finding vulnerabilities and changing them into critical exploit paths in near-real-time.” The company released security advisories covering 26 CVEs representing 75 issues identified through AI model scanning — compared with a typical monthly volume of fewer than five CVEs. Palo Alto Networks reported that all important vulnerabilities in its SaaS products had been patched, and that patches were available for all customer-operated products.

Doubling time, corroboration from METR, and methodological caveats

The AISI noted important limits in its data: the estimates are based on a relatively small number of models, and the hardest tasks in the test suite have the least human comparison data. Still, the institute said the overall trend was robust — removing any single model from the analysis changed the estimated doubling time by less than a month in either direction.

Separate research from METR, a nonprofit that tracks how quickly AI handles software tasks, arrived at a nearly identical figure: a doubling time of approximately four months since late 2024. The AISI cautioned that “no single benchmark result should be read as a precise measure of AI capability,” while also stressing that “the direction of change and rapid growth have been consistent across the models, methodological choices and independent data we examined.”

Palo Alto Networks’ immediate priorities for enterprises

Find and fix vulnerabilities in code and applications before attackers do.
Shrink the attack surface and use AI to spot security misconfigurations.
Deploy detection and response tools across all systems, using machine learning to catch threats in real time.
Build security operations capable of responding in minutes, because AI-powered attacks may soon unfold that quickly.

What this means for technologists, policymakers, and enterprises

Technologists and security teams: Expect models like Claude Mythos Preview and GPT-5.5 to identify exploit chains and turn vulnerabilities into critical paths faster than prior generations; the AISI’s shift in benchmark performance suggests test suites and defensive tooling will need to accelerate in lockstep.
Policymakers and regulators: The AISI’s finding that autonomous cyber capability is advancing “on the order of months, not years” presents a moving target for pre-deployment evaluation and oversight; the institute is already developing more demanding evaluations, including new ranges and active cyber defenses, to better reflect real-world conditions.
Enterprises and procurement leaders: Palo Alto Networks’ experience — 26 CVEs and 75 issues across 130 products identified in a short testing window, with patches applied to SaaS and customer-operated products — underscores the operational imperative to prioritize vulnerability remediation and to treat AI-driven scanning results as high-priority input into patch management and incident response.

The AISI and Palo Alto Networks agree on the headline: recent frontier models have materially accelerated autonomous cyber capability. Whether the jump observed in Claude Mythos Preview and GPT-5.5 is an isolated inflection or the start of a sustained, faster trajectory remains unclear; the institutes and vendors are already recalibrating tests, advisories and defenses. The immediate, concrete outcome is not abstract: more vulnerabilities surfaced faster, a surge in CVEs compared with normal months, and active steps to patch and raise operational tempo.

Read the original reporting at CyberScoop: https://cyberscoop.com/ai-autonomous-cyber-capability-benchmarks-broken-gpt5-claude-mythos/