UK Hackathons Expose 400+ Vulnerabilities in AI-Powered Code Scans

“Rather than mandate a single approach, we gave teams model access and let them build their own tooling, noticing what worked each week and building on the best approaches,” the Government Cyber Coordination Centre (GC3) said.

GC3 hackathons: weekly, in-person exercises across nine departments

The GC3 — an initiative run jointly by the National Cyber Security Centre (NCSC) and the Department for Science, Innovation and Technology (DSIT) — ran a series of weekly, in-person hackathons that used frontier AI models to scan public code repositories across nine government departments. The exercise intentionally avoided a single mandated technique, allowing teams to create bespoke pipelines and tooling and iterate on what proved effective during each session. The entire project consumed roughly £13,000 in model tokens (reported as $17,467).

Concrete results: 407 findings and critical flaws identified and remediated

Participants identified 407 findings, including critical flaws described in the report as authentication bypass, data exposure and remote code execution. The report states some of those findings were already known and mitigated through compensating controls, while others were zero days. All critical and high-risk weaknesses that were assessed as exploitable have been remediated, and the report records no evidence of exploitation.

Multiple technical approaches: Claude Skills, traditional scanners and agentic pipelines

Teams took divergent technical routes. One team created five new domain-specific Claude Skills to produce a “reusable, scoped and consistent approach” applicable across selected open-source repositories and operators. Another team combined established tools — Gitleaks, Trivy, Semgrep and Hadolint — to generate initial findings, then applied models to map those results against OWASP and CWE frameworks, compose findings into attack paths, and confirm viability through a triage stage. A separate group built a six-stage agentic pipeline in which each stage read and challenged the last.

Lessons learned: frontier models as components, the centrality of triage, and next priorities

Frontier models performed best when used as “tightly scoped components inside a structured pipeline,” the GC3 reported, with traditional vulnerability management split into discrete, task-specific harnesses.
The team found that, with the right architecture and task design, many near-frontier and frontier models are similarly capable at scanning code — but that “human expertise is still the difference,” necessary to break tasks down and identify the wider context.
Triage proved vital because agents generate candidate findings faster than humans can validate them; careful upfront scoping and “structured internal filtering” improved focus and reduced costs.
The GC3 identified the next priority as integrating prioritization, review and patch-generation without “overwhelming human-centred processes.”

What this means for technologists, policymakers and defenders

Technologists and security teams will see a clear case for combining traditional scanners with model-driven pipelines: teams in the hackathons used existing tools to surface leads and models to triage, link findings to standards such as OWASP and CWE, and assemble multi-step attack paths. Defenders inside government departments will likely continue to “prioritise validation and remediation through existing frameworks,” as the report documents, because model findings still required human validation before remediation. Policymakers and procurement leaders will need to monitor access to specific frontier models: the report notes it is unclear what impact a new US government export ban on Anthropic’s Mythos and Fable models — which the ban locks out to non-American users — will have on continued hackathon-style initiatives.

The GC3’s sessions yielded evidence that these models can surface cross-service vulnerabilities and connect business logic to technical detail in ways that traditional scanners do not. The practical challenge the report leaves front and center is operational: how to fold automated discovery, human review, and safe, scalable patch generation into existing workflows without overwhelming the people who must validate and apply fixes.

For readers who want to consult the primary source, the GC3 report was published on June 21. The original story is here: https://www.infosecurity-magazine.com/news/uk-government-400-vulnerabilities/

GC3 hackathons: weekly, in-person exercises across nine departments

Concrete results: 407 findings and critical flaws identified and remediated

Multiple technical approaches: Claude Skills, traditional scanners and agentic pipelines

Lessons learned: frontier models as components, the centrality of triage, and next priorities

What this means for technologists, policymakers and defenders

Continue Reading

Firewalls Evolve to Secure AI-Driven Networks

Microsoft Copilot Exposes Hidden Prompts in Word Documents

Sysadmins' AI Expectations Unmet as Adoption Lags

NCSC Urges Vendors to Embed Forensic Observability in Network Devices