Why AI Benchmarks Lie: The Failure of Mathematical Logic

The illusion of artificial intelligence’s omnipotence in the exact sciences has hit a major roadblock. A consortium of 64 mathematicians, led by Carnegie Mellon University, EleutherAI, and Seoul National University, has introduced SOOHAK—a new benchmark that exposes a troubling reality: frontier models are catastrophically incapable of admitting defeat.

The primary stumbling block is the so-called 'Refusal set'—a collection of problems that are physically impossible to solve due to inherent logical contradictions or missing data. Instead of flagging these errors, neural networks enthusiastically hallucinate, churning out polished but entirely nonsensical proofs.

According to the report, even Gemini 1.5 Pro, which managed a respectable 30% on research-level problems, failed miserably when faced with incorrect conditions. No existing system surpassed the 50% threshold for identifying errors within the questions themselves. The situation for the Qwen series appears even more dire; analysis by The Decoder suggests their performance in the Refusal category didn't even reach 3%. This data validates expert skepticism regarding AI's 'Olympic' achievements. Successes at the IMO gold-medal level are increasingly seen as the result of training on standard patterns—a facade that crumbles when confronted with genuine logical chaos.

For business and engineering teams, this is a red flag. Models are being trained to provide an answer at any cost, completely ignoring methodological integrity. As the authors of SOOHAK note, increasing computational power only polishes the surface without addressing the fundamental flaw: AI does not understand the limits of its own competence. In critical sectors like aerospace engineering or biochemistry, this 'confident incompetence' is a ticking time bomb. The issue is compounded by the fact that open-source models like Kimi or GPT-OSS-based variants show even weaker results on unpublished materials, highlighting a shortage of high-quality data in niche disciplines.

Blindly trusting probabilistic models in high-precision industries today can be viewed as managerial negligence. Until benchmarks like SOOHAK force developers to implement deterministic verification systems, 'precision hallucinations' will continue to block the meaningful integration of AI into science and manufacturing. Without expert oversight and rigorous logical filters, any neural network remains nothing more than an expensive, eloquent generator of random formulas.

Source: The Decoder →

Rate this material

★ ★ ★ ★ ★

Artificial IntelligenceLarge Language ModelsAI SafetyGoogle DeepMind

Why AI Benchmarks Lie: The Massive Gap in Mathematical Logic