How SAEs Decode Literary Style in Llama and Gemma

Modern neural networks are shedding their 'black box' reputation, at least when it comes to literary style and emotional nuance. Researchers João Paulo Cavalcante Preza and Sávio Salvarino Teles de Oliveira from the Federal University of Goiás (UFG) have demonstrated that what we perceive as 'textual magic' is actually a set of discrete computational units. Using Sparse Autoencoders (SAEs), the team dissected the residual streams of Llama 3.1 8B and Gemma 2 9B, isolating 'literary primitives'—specific features responsible for metaphors, defamiliarization, and even the classic 'show, don't tell' principle.

The study’s core finding is that stylistic techniques are encoded within these models not as vague statistical shadows, but as autonomous, actionable levers. This shifts the paradigm: instead of spending hours on prompt engineering to coax a model into 'writing like Hemingway,' it becomes possible to directly intervene in neuronal activations—a process known as steering. Essentially, we are moving from polite requests to surgical control over model behavior at the internal weight level.

During the architectural deconstruction, the researchers identified four classes of features. These include a curious 'Eleven I’s' cluster that defines first-person register and specific style modulators. Interestingly, Llama and Gemma 'feel' differently. Llama 3.1 8B is straightforward, activating 'naming-gates' that explicitly call out a desired effect. Gemma 2 9B is more subtle, evoking emotion by describing imagery and surroundings. When tested against the Cowen-Keltner taxonomy of 27 emotions, Llama achieved 100% coverage through combinations of feature 'recipes.' Gemma stumbled only on 'adoration,' covering 23 categories. This compositional nature confirms a growing axiom: complex AI sentiment is a mathematical sum of basic features, not magic.

The UFG methodology is robust, featuring a three-stage validation process that included dictionary projection via logit-lens and feature purity checks by a panel of five LLM judges. The researchers also uncovered a 'developer's shadow'—a specific feature heavily weighted during Reinforcement Learning from Human Feedback (RLHF). This feature is responsible for the bland 'helpful assistant' persona that, when overloaded, starts generating forced emotional content. While finding these features now takes just 15 minutes on a single GPU, the scalability of these findings to massive models like Llama 405B remains to be seen. Nevertheless, the foundation is set: interpretability is finally providing businesses with reliable control tools to replace the 'shamanism' of prompt engineering.

Source: arXiv cs.AI →

Rate this material

★ ★ ★ ★ ★

Large Language ModelsMachine LearningMeta AIOpen Source AINeural Networks

Beyond the Black Box: How SAEs Decode Literary Style in Llama and Gemma