“While the industry is rightfully excited about the potential of Mythos-class tools, unguided algorithms are inherently prone to returning even more false positives and costly false negatives than the automated scanners we have today,” said Andrew Obadiaru, CISO of Cobalt.
Cobalt’s survey: trust in fully automated AI scanning collapses
The Cobalt State of Pentesting Report 2026, based on two comparative surveys in 2025 and 2026 of around 450 cybersecurity professionals, documents a dramatic collapse in confidence in fully automated AI vulnerability testing. The percentage of organizations relying entirely on AI automation for testing dropped from 29% to 9% over the period. Nearly half (47%) of respondents now prefer a hybrid testing model in which human expertise supports AI tools.
That shift is striking for its speed: the share preferring hybrid testing surged 22 percentage points in one year, while the percentage of organizations using automation for low-risk environments rose by 22 points to 47%.
Missed critical findings and the ‘validation gap’
Cobalt reports that over three-quarters (78%) of respondents said fully automated scanning tools missed critical vulnerabilities. The company frames those missed detections as part of a broader “validation gap” between what unguided algorithms can discover and what elite human testers can validate and remediate.
Andrew Obadiaru further warned that “LLM vulnerabilities are deeply context-dependent and invisible to tools that lack an architectural understanding of the application,” arguing that automation should be “deployed exactly where it excels, but elite human expertise remains foundational to uncovering and remediating the most complex business logic risks.”
AI/LLM issues are more severe and slower to resolve
The report finds AI-focused findings are both more frequently high-risk and harder to remediate. Nearly one-in-three findings from an AI pentest is rated high risk — 2.7 times the average of conventional software, the report claimed. At the time of analysis, less than two-fifths (38%) of LLM vulnerabilities had been fixed, while 62% remained open — the lowest resolution rate of any asset class covered.
Mean time to resolve (MTTR) for AI/LLM security issues increased from 19 days to 36 days over the survey period, which Cobalt said shows teams are tracking “significantly harder vulnerabilities” than before.
Top AI incident vectors: shadow AI, poisoning, and output handling
Among organizations experiencing AI-related incidents, shadow AI was most common (44%). Data or model poisoning and improper output handling were each reported by 41% of respondents. Supply chain vulnerabilities accounted for 35% and prompt injection for 34% of AI-related incident vectors.
These distributions underline the range of operational and adversarial risks organizations confront as they adopt AI systems: from unsanctioned tools in business units to tampering and misuse of model outputs.
What this means for security teams, procurement leaders, and red teams
- Security teams: 60% of security professionals said they need stronger LLM testing capabilities, but only 42% plan to increase human-led red team operations — a gap that suggests many teams see capability shortfalls without committing immediately to the human effort required to close them.
- Procurement and enterprise leaders: Organizations are moving to hybrid models and confining automation to lower-risk environments (automation for low-risk environments rose to 47%), signaling a more cautious procurement posture when buying AI scanning tools.
- Red teams and testers: The report’s finding that AI pentest results are more likely to be high-risk and take longer to remediate highlights a demand signal for experts who combine architectural understanding with adversarial testing skills.
The record in Cobalt’s report is clear: confidence in unguided automation has been eroded by a torrent of missed critical findings and stubborn LLM weaknesses. Organizations are shifting toward hybrid workflows and limiting automation to lower-risk areas, even as many acknowledge a need for stronger LLM testing capabilities but stop short of expanding human-led red teaming. The data leaves a pointed question for security leaders: if automation alone is failing to find the riskiest problems and many teams will not yet scale human testing, who will close the validation gap?




