Anthropic's Mythos Preview Bolsters Vulnerability Discovery

"This is a lot closer to `just go and find something` than anything I’ve seen so far," said one of XBOW's testers after early trials of Anthropic’s Mythos Preview.

Early access and the XBOW testing gauntlet

XBOW received early access to Mythos Preview and ran it through a disciplined program: a 10-person cross-functional team, the same internal benchmarking framework used against Opus 4.7 and GPT 5.5, and a broader set of scenarios that included frozen vulnerable open-source applications, interactive workflows, and native-code and reverse-engineering tasks. The firm evaluated Mythos Preview both inside Claude Code and as a raw model via API to separate the effects of orchestration, tooling, and live-site access.

Mythos Preview’s strength in source-code audits

XBOW concluded that Mythos Preview is "extremely powerful for source code audits." In their web exploit benchmark—where a pass requires an actually validated proof-of-concept after up to 80 scripted actions—Mythos Preview sharply reduced false negatives compared with Opus 4.6: a 42% reduction overall and a 55% reduction when both models were given the site’s source code. The evaluators emphasized that Mythos Preview is notably better at reading and reasoning about code than at merely writing it, and that token-for-token the model "hones in on the vulnerability with absolutely unprecedented precision."

Live-site validation and XBOW’s orchestration

Despite Mythos Preview’s source-code prowess, XBOW’s results stress that live-site validation remains crucial. Many exploitable issues arise from configuration, dependencies, deployment choices, or how components are combined—factors not obvious from source alone. XBOW’s benchmark structure and service model are designed to give frontier models safe, structured access to live application behavior so that leads from code audits can be probed and proven. XBOW found that removing live-site access hurt Mythos Preview’s performance more than removing source-code access, and that the best outcomes came from combining both kinds of access: analyze the code, probe the running site, then craft and validate an exploit.

Judgment, command safety, and practical limits

XBOW reported mixed results on judgment tasks. Mythos Preview often rejected false positives better than predecessors but could be overly literal—favoring the letter of a rule over its spirit—and in doing so, it sometimes missed true positives when evidence did not strictly meet its criteria. On a command-safety benchmark (deciding whether a script is safe to run), Haiku 4.5 achieved 90.1% accuracy after prompt optimization, Opus 4.6 posted 81.2%, and Mythos Preview reached 77.8%. XBOW’s assessment is that Mythos Preview “needs precise prompts, explicit threat models, and validation infrastructure” to translate strong reasoning into reliable operational security outcomes.

Native-code discovery, reverse engineering, and browser workflows

Beyond web apps, XBOW found Mythos Preview substantially capable in native-code vulnerability discovery and reverse engineering. In Chromium-related testing it reportedly found more real bugs with fewer false positives than prior baselines; in V8 sandbox work it identified true positives where earlier approaches had produced many unexploitable findings. The model also reasoned through unusual firmware and embedded‑systems contexts. For browser-driven workflows, Mythos Preview’s visual acuity roughly matched Sonnet 4.6 and dramatically outperformed Opus 4.6, making it practically effective for selecting the right UI actions even if not pixel-perfect on coordinates.

Cost tradeoffs for procurement and security teams

XBOW flags a clear operational constraint: Mythos Preview is large and expensive. At the time of writing, Anthropic had not made it available via public APIs, and the company said it would be about 5× as expensive as an Opus model token-for-token. XBOW’s cost-normalized comparisons suggest Mythos Preview is efficient for high-accuracy needs but not always the best value—sometimes it is preferable to run a different model, such as GPT‑5.5, longer or in multiple iterations. XBOW therefore maintains a cadre of models rather than relying on a single option.

What this means for pentesters, developers, and open-source maintainers

Pentesters: Use frontier models as a brain inside a controlled harness—source-code leads are valuable, but live-site probing and exploit validation remain the decisive step.
Developers and security teams: Mythos Preview can surface strong candidate vulnerabilities from code audits, but those findings need validation to determine exploitability and real-world relevance.
Open-source maintainers: XBOW reported discovering and disclosing "quite a few new vulnerabilities" after using Mythos Preview on open-source projects during week one, underscoring the need for rapid triage and patching workflows.

XBOW’s testing paints Mythos Preview as a major technical advance—a powerful analytic brain that still requires a skilled body and orchestration to turn leads into safe, validated security action. At the time of writing, the remaining practical questions include how Anthropic will price and distribute Mythos Preview, and how defenders will balance per‑token cost against the option of running other models for longer.

Original XBOW test report on BleepingComputer