Arm’s KleidiAI has long promised to speed up models in popular edge frameworks without code changes, and now that promise is a reality. Integrated into XNNPack, MediaPipe, MNN, ONNX Runtime and even llama.cpp, KleidiAI delivers noticeable performance gains the moment it is installed. Developers no longer waste weeks on custom tuning – acceleration works out of the box, models start faster, latency drops and memory usage improves.
The next step is ExecuTorch 0.7 beta, where KleidiAI is enabled by default. That means any Android device built on the newest Arm CPUs as well as a massive pool of older phones will automatically receive the same optimizations. For companies this simplifies integration: instead of writing separate accelerators for each architecture, you simply update the SDK and reap the benefit immediately.
The most compelling aspect is the ability to move generative queries from cloud GPU farms to local processors. The SDOT (Signed Dot Product) instruction, supported in Armv8.2 and newer cores since 2015, speeds up matrix multiplication – the foundation of any large language model – even when running with int8 or lower precision. Arm estimates that roughly three billion devices already include this instruction, ranging from five‑year‑old smartphones to single‑board computers like the Raspberry Pi 5.
For midsize firms this opens a viable alternative to expensive cloud solutions. Instead of paying for every GPU hour in public clouds, companies can run part of the inference locally, cutting budgets and accelerating product response times. Reducing reliance on network connectivity also boosts service reliability in regions with poor internet.
Why this matters: CEOs can now offer AI‑enabled features without massive cloud spend, bringing products to market faster and reaching users of older devices. The scale is billions of potential endpoints, and the competitive edge comes from lower costs and faster response through on‑device acceleration.