Grokking Transformers: How Weight Decay Drives Generalization

For years, training transformers on modular arithmetic tasks felt like peering into a black box: models would suddenly leap from blind memorization to logical generalization. This phenomenon, known as 'grokking,' has finally been stripped of its mystery. New research by Lucky Verma proves that Weight Decay (WD) is far more than a secondary regularizer used to smooth out weights—it is a fundamental control lever. It defines the thin line between mindless rote learning, developmental progress, and total system collapse.

Testing models ranging from 0.82 million to 85 million parameters, Verma identified a critical threshold at λc = 0.0158. Fall below this value, and your model will remain forever trapped in memorization mode with zero chance of understanding data structures. Rise above it, and you effectively force the architecture into a developmental phase where the probability of grokking hits 100%. More importantly, the timing of this 'epiphany' is now mathematically predictable, following a power law with an empirical exponent of ν = 0.757. Instead of guesswork, engineers now have a formula to calculate resource requirements.

Monitoring these transitions has historically been prohibitively expensive due to the cost of calculating the loss landscape. The study offers an elegant, budget-friendly solution: two online diagnostic methods based on attention activation analysis. By tracking the cosine similarity of attention heads and the standard deviation of entropy, teams can monitor learning dynamics in real-time. Within a λ range of 0.1 to 2.0, the time to reach grokking drops by an order of magnitude—from 1,090 down to just 83 epochs. However, excessive rigors are counterproductive: at λ = 10, the system collapses as attention patterns become identical, turning the model into a useless data array. Notably, this mechanism is universal; the Mamba architecture shows a critical threshold of λc = 0.0144, nearly identical to transformers.

For the business world, this signals the end of intuitive tuning and the start of rigorous engineering. By using lightweight metrics, developers can determine early on whether a specialized model is learning to extract logic or simply burning GPU hours memorizing the training set. In an industry where training costs are scaling exponentially, the ability to 'force' understanding through WD calibration becomes a critical factor for project survival. You either manage the phase transition, or your compute budget becomes a sunk cost.

Source: arXiv cs.AI →

Rate this material

★ ★ ★ ★ ★

Machine LearningNeural NetworksFine-tuningCost ReductionMamba

Beyond Luck: Controlling the 'Grokking' Point in Transformer Training