"the first independent benchmark that measures what AI models can actually do with a vulnerability, not just identify it but exploit it step by step,” David Brumley told Infosecurity Europe 2026.
ExploitBench: a graded test for real-world exploitation
Bugcrowd, working with experts at Carnegie Mellon University and “top Chrome vulnerability researchers,” launched ExploitBench in May 2026 to measure whether frontier large language models can do more than point out flaws — whether they can chain discovery into usable exploits. Presented at Infosecurity Europe 2026, the benchmark departs from simple crash/no-crash checks and instead scores staged progress through five capability tiers that culminate in arbitrary code execution against a vulnerable V8 build.
Mythos vs GPT‑5.5: the head-to-head numbers
In the first public runs discussed at the conference, Anthropic’s Claude Mythos substantially outperformed OpenAI’s GPT‑5.5. Mythos posted an average ExploitBench score of 9.90 out of 16 and reached the highest tier on 21 of 41 tested vulnerabilities. GPT‑5.5 averaged 5.51 and reached the top tier in just two cases. Bugcrowd’s David Brumley said Mythos, “is able to exploit a one-day vulnerability in Chrome about 50% of the time,” adding that such a one-day exploit could attract a reward of up to $10,000 from Google if no previous exploit existed.
Why staged outcomes and V8 matter
ExploitBench’s staged scoring is designed to capture planning and multi-step execution — attributes Bugcrowd argues distinguish superficial signals from genuine exploitation capability. The target chosen for these tests, a vulnerable build of V8, matters because V8 is the JavaScript/WebAssembly engine that powers Google Chrome, Microsoft Edge, Node.js and Cloudflare Workers. Brumley argued these choices create a demanding, high‑value benchmark: Chrome is “made of hundreds of thousands of lines of codes, it’s been audited for years,” and finding exploitable chains there is especially meaningful, he said — but he also cautioned results should not be generalized to all application classes.
Bugcrowd’s reinforcement learning environments and the remediation imperative
Bugcrowd released ExploitBench alongside reinforcement learning (RL) environments intended both to measure and to improve model capability. “We put out ExploitBench to motivate the state of where models are at on actual exploitation tasks,” Brumley said, while CEO Dave Gerry described the benchmark and training environments as complementary: one measures capability and the other drives targeted RL training with industry model partners.
Gerry warned that automation and AI are already being integrated into attacker workflows, raising the pace at which discovered flaws can be turned into active exploits. He urged defenders to develop AI‑driven remediation at scale, saying fixes must move “from ticket queues into near‑real‑time workflows” and that “finding more bugs faster only amplifies the noise unless you can automatically prioritize and act on the ones that actually enable exploits.” Brumley echoed the urgency: defenders need contextual intelligence that prioritizes and remediates the vulnerabilities that matter most, and he signaled forthcoming tools and announcements to help provide that intelligence.
What this means for defenders, vendors, and threat actors
- Defenders and security teams: Bugcrowd’s leaders urged rethinking remediation pipelines to become near‑real‑time and to use models that not only find flaws but recommend — and where safe, initiate — fixes at scale.
- Vendors and platform maintainers: The V8 results highlight that high‑value, heavily audited codebases can still yield exploitable chains; Bugcrowd framed ExploitBench as a way to surface and prioritize vulnerabilities that enable real exploits.
- Threat actors and researchers: The benchmark suggests models that plan and replan can lower the technical barrier for chaining discoveries into exploits. Michael Price of VulnCheck pointed to documented improvements in planning ability — citing a UK AI Security Institute report on Mythos — but tempered expectations: “They’re getting better, but they still are not actually like that great,” and he predicted incremental improvements “like every month or every quarter” that could compound over years.
These initial findings put concrete numbers against a theoretical fear: models can increasingly plan and execute multi‑stage exploit sequences, but the record is still narrow in scope. Bugcrowd’s leaders emphasized both caution and urgency — the benchmark covers a specific, sophisticated target and the company has paired measurement with RL training and promises of tools to help defenders. For organizations and model providers alike, the immediate task is not simply watching the scores climb but building remediation and prioritization workflows that can keep pace.
https://www.infosecurity-magazine.com/news/mythos-gpt-chrome-exploits/




