Anthropic Bolsters AI Models with Enhanced Security Guardrails

“The new classifier also comes at the cost of flagging benign requests more often during routine coding and debugging tasks,” Anthropic admitted.

Redeployment timeline and where the models will appear

Anthropic announced on June 30 that it would lift a pause enacted 19 days earlier and redeploy its frontier models, Claude Mythos 5 and Claude Fable 5, starting July 1. The pause had followed a US government export-control directive tied to a security report; Anthropic said the export-control decision was lifted the same day it announced the redeployment.

Fable 5 will be available globally across the Claude Platform, including Claude.ai, Claude Code and Claude Cowork. For subscribing premium customers — Pro, Max, Team and select Enterprise plans — Fable 5 will account for up to 50% of weekly usage limits through July 7; after that date the model will be available via usage credits. Anthropic also confirmed that the general-access model will be rolled out on AWS, Google Cloud and Microsoft Foundry.

What Anthropic changed in Fable 5: the safety classifier and a known jailbreak

Anthropic says the immediate trigger for the export-control action was an Amazon report that identified a jailbreak technique for Fable 5 which allowed the model to identify software vulnerabilities and, in one case, provide an exploit — effectively bypassing built-in safeguards. Anthropic characterized the finding as not having “expose[d] any unique Mythos-level cyber capabilities.”

In response, Anthropic released a new version of Fable 5 that includes “an improved safety classifier that targets and blocks the behavior described in the report.” The company described a classifier as a small automated AI system that detects potentially harmful user prompts during an interaction with an LLM and blocks the model from responding to those requests.

Anthropic reported that the new classifier blocks the Amazon-identified jailbreak “in over 99% of cases.” It conceded that it may, “in a very small fraction of cases,” still provide information after a potentially harmful user request but said that output would not be “detailed enough to help a cyber attacker.” When a request to Fable 5 is blocked, Anthropic said users will be notified that the interaction has been redirected to Opus 4.8.

The company also acknowledged the trade-off: the classifier “comes at the cost of flagging benign requests more often during routine coding and debugging tasks,” and Anthropic said it will refine safeguards to better distinguish legitimate use from misuse and reduce false positives.

Mythos 5, CAISI testing, and the Glasswing program

Before the June 30 lifting of export controls for worldwide distribution, the US government approved Mythos 5 for redeployment to a set of US organizations described as those that “operate and defend critical infrastructure.” Anthropic said it continues to coordinate with the government to expand access to a broader set of domestic and international partners under what the company calls the Glasswing program.

Anthropic also said it has collaborated with the US government on pre-deployment testing and evaluation. Researchers from the US Department of Commerce’s Center for AI Standards and Innovation (CAISI) tested the new safeguards and described them as “extraordinarily strong,” according to the company.

Industry coordination and a new HackerOne channel

Anthropic reported working with Amazon, Microsoft, Google and other Glasswing partners to draft a consensus framework for assessing the severity of AI jailbreaks, including an effort to define what constitutes a “universal jailbreak” and how developers should respond. The company has also launched a HackerOne program to allow security researchers to submit potential cyber jailbreaks they discover in Fable 5 for review.

What this means for technologists, policymakers, and affected enterprises

Technologists and security teams: Expect a higher false-positive rate in routine coding and debugging workflows as Anthropic refines the classifier. Teams that depend on LLM-assisted code review or vulnerability discovery will need to adjust for redirected sessions to Opus 4.8 and monitor classifier behavior.
Policymakers and regulators: The CAISI characterization of the safeguards as “extraordinarily strong” and the staged redeployment — Mythos 5 first to US critical-infrastructure organizations, then broader release — illustrate a coordinated, evidence-driven approach to pre-deployment testing and access control that policymakers can point to as a model for managing frontier models.
Affected enterprises and procurement leaders: Anthropic’s usage limits for premium tiers through July 7, and the later shift to usage credits, will change cost and capacity assumptions for teams that planned to rely on Fable 5. Enterprises using cloud-hosted deployments on AWS, Google Cloud and Microsoft Foundry should also factor the new classifier’s behavior into developer experience and incident response playbooks.

Anthropic’s public account lays out a deliberate rollback and reintroduction: a known jailbreak was identified, a classifier was added and tested, and access was staged for critical-defence entities while a broader global rollout resumed. The immediate questions left on the table are operational — how often legitimate workflows will be flagged, how quickly classifier refinements will reduce false positives, and how consensus frameworks drafted with major cloud partners will be operationalized. For now, the company’s approach — pairing a technical mitigation with industry and government testing and a public bug-submission channel — is the measure it is offering to critics and customers alike.

Original story