In February 2026 researchers from IBM Research and UC Berkeley released ITBench, a benchmark that directly addresses the concerns of SRE, security and FinOps teams. Using MAST (Multi‑Agent System Failure Taxonomy) they examined 310 execution traces of three popular large language models – Gemini‑3‑Flash, Kimi‑2 and GPT‑OSS‑120B. Rather than vague messages such as “something went wrong,” MAST produces structured failure signatures that pinpoint the responsible component.

Gemini‑3‑Flash crashes an average of 2.6 times per trace, almost always due to error FM‑3.3 (Incorrect Verification). The agent declares success without actually checking system state. GPT‑OSS‑120B displays a classic cascade failure pattern: roughly 5.3 errors occur per run when a single faulty logic branch contaminates the context and triggers a chain of hallucinations. Kimi‑2, even without a PR mask, shows a 46 % rise in premature terminations and a 43 % increase in uncertainty around termination conditions, clearly indicating missing external termination controllers and loop detectors.

For businesses this translates into four practical actions. First, results must be validated externally; the model should never self‑evaluate without an independent tool that provides hard proof of outcome. Second, implement termination mechanisms and repeat‑call detectors by embedding finite‑state machines that eliminate FM‑1.5 (Termination Issues). Third, resolve ambiguities at the first branching point in the agent’s graph by clarifying inputs immediately, thereby reducing exposure to FM‑2.2. Fourth, enable a “clarify‑or‑read‑only” mode for smaller models to prevent errors caused by incomplete task comprehension.

Why this matters now? Monitoring failure KPIs such as FM‑3.3, FM‑1.5 and premature termination can shave up to 20 % off debugging costs, while regular audits of MAST signatures boost automation reliability across SRE, security and FinOps. Turning agent decisions from a risk source into a measurable competitive advantage is achievable today.

other