Open Source Models Challenge Dominance in Automated Bug Finding

Mythos found 271 Firefox flaws — but none a human couldn’t spot.

Ari Herbert-Voss at Black Hat Asia

Speaking at the Black Hat Asia conference in Singapore, Ari Herbert‑Voss — CEO of AI security startup RunSybil and OpenAI’s first security hire — argued that the headline performance of Anthropic’s Mythos does not make it uniquely necessary for automated vulnerability discovery. Herbert‑Voss told the conference Mythos is strong at detecting both “shallow” bugs, which are well‑described and easy to validate, and at surfacing more complex vulnerabilities. He framed that effectiveness in technical terms and in one line attributed it to what he called “supralinear scaling.”

Open‑source “scaffolding” versus Anthropic’s Mythos

Herbert‑Voss made a practical case for defenders and attackers alike being able to reach comparable results with open‑source models by building what he called “scaffolding” that runs multiple models in harness. That architecture, he said, improves defense in depth: different open models tend to catch different classes of flaws, which provides a hedge against any single model’s blind spots. He also noted a simple market reality — Mythos is expensive to build and run and Anthropic has kept access tightly restricted, meaning Mythos “may never be publicly available.” For many organizations, he argued, open‑source alternatives are therefore not just viable but necessary.

Supralinear scaling, NDAs, and Mythos’s capabilities

Herbert‑Voss described “supralinear scaling” as a pattern where adding twice the data, compute, and training time can produce more than twice the capability — a multiplier effect he said helps explain Mythos’s reported performance. He hinted such multipliers might get even larger but declined to elaborate further because of a non‑disclosure agreement. The talk also acknowledged concrete outputs attributed to Mythos in public reporting: for example, the claim that Mythos found 271 Firefox flaws, a tally that drew scrutiny because, as the same reporting noted, those were “none a human couldn’t spot.”

Fuzzing, AI noise, and the persistence of human work

Herbert‑Voss warned that automation does not eliminate human effort — it changes its shape. He observed that fuzzing, a long‑standing testing technique that injects random or near‑random data into software, already creates so many warnings that triage becomes a heavy burden for humans. He said AI bug‑hunters produce the same problem and expects that will continue. Human expertise remains required both to orchestrate collections of open models so they together approximate Mythos‑grade performance and to assess the resultant bug reports. In short: automation can multiply findings, but it does not remove the need for skilled people to sort signal from noise.

What this means for technologists, procurement leaders, and attackers and defenders

Technologists and security teams: Expect to build orchestration pipelines — Herbert‑Voss recommended “scaffolding” multiple open models — and to invest human hours in validating AI‑generated findings, because the tooling will produce large volumes of data that require triage.
Procurement leaders and enterprise defenders: Cost and access are central. Mythos’s high build and run costs and Anthropic’s restricted access create a practical incentive to select open‑source alternatives that can be deployed and scaled internally or assembled from multiple services.
Attackers and defenders alike: Herbert‑Voss argued both sides can achieve comparable results with open models; defenders should plan for the likelihood that adversaries will leverage similar capabilities and that defence‑in‑depth afforded by diverse models will be important.

Herbert‑Voss closed on an economic argument: the incentive to use AI is real because “someone's got to use services that pay for all those GPUs and datacenters,” and that cost pressure will act as a forcing function driving infosec teams toward AI — a shift he believes will ultimately strengthen proactive and defensive work even as it preserves substantial human workloads. He also noted the record includes specific examples of persistent operational risk — from ransomware against vendors to complex hybrid‑cloud attack surfaces — underscoring why organizations are hunting for scalable, affordable tooling.

The claim at Black Hat Asia is straightforward: Mythos demonstrates what a highly resourced model can do, but parity is achievable through careful engineering of open models and human oversight. Herbert‑Voss left a final technical and strategic question visible on the conference stage — if supralinear scaling yields outsized capability for closed, expensive models, can the combinations of open models plus orchestration truly replicate that multiplier at scale? He suggested it is possible, but the answer will depend on work that must be done in the lab and on the operations floor.

Read the original story