Researchers at Carnegie Mellon University have introduced ExploitBench, a benchmark that officially marks the end of the era where AI was seen merely as an advanced conversationalist. AI agents are now capable of autonomously identifying and exploiting critical vulnerabilities in the V8 engine—the backbone of Google Chrome, Microsoft Edge, and Cloudflare Workers.
According to the study, Anthropic's Claude Mythos Preview demonstrates skills on par with a seasoned cybersecurity expert. It successfully executed Remote Code Execution (RCE)—the most severe form of system compromise—in 21 out of 41 tested cases.
The performance gap between market leaders is staggering. While OpenAI’s GPT-5.5 showed modest results, managing only two successful exploits with an average score of 5.51 out of 16, Claude Mythos maintained a solid 9.55 in fully autonomous mode. As study co-author Seung-Hyun Lee noted, Mythos reproduced the CVE-2024-0519 vulnerability, which had baffled human experts for over a year. The model developed attack vectors previously considered too complex to implement, confirming that AI’s offensive potential is outpacing our defensive protocols.
For now, the only real deterrent is the price tag. Running the test suite on Claude Mythos cost $36,428—ten times more expensive than using GPT-5.5 via Codex. The UK AI Safety Institute describes Mythos as a superior but prohibitively expensive "digital weapon." For businesses, this means the threat of autonomous hacking is currently constrained only by API costs and compute availability.
We are entering an era where AI agents weaponize known security flaws faster and more creatively than entire security departments. Traditional patch management is no longer enough; any infrastructure integrated with agents must be treated as a live target for automated attacks. The high cost of these exploits is your only temporary advantage. Use this window to implement deterministic security layers and harden your systems before the cost of compute drops, turning AI-driven cybercrime into a mass-market commodity.