Understanding LLM Activations with Sparse Autoencoders

Mechanistic Interpretability
Mechanistic Interpretability

Overview

This project delves into the internal workings of Large Language Models (LLMs), specifically Google's Gemma 3 (1B-IT version), using Sparse Autoencoders (SAEs). The primary goal was to disentangle the learned representations within a specific layer's MLP, specifically the down_proj layer, which projects information processed by the MLP back into the residual stream. By training an SAE on activations from this layer, I aimed to identify more interpretable, monosemantic features than what the original model neurons might represent. This page outlines the experimental setup, training methodology, and key findings.

Experimental Setup & Training Details

The journey to interpretable features involved several key stages, from model and data selection to SAE training and evaluation.

1. Model and Layer Selection

I chose google/gemma-3-1b-it as the target model due to its strong performance and relatively manageable size for experimentation. The focus was on the activations from the model.layers[25].mlp.down_proj layer. This layer is crucial as it's the output projection of the MLP block, returning processed information to the residual stream. Its activation dimension is 1152.

2. Dataset and Activation Collection

The wikitext-103-raw-v1 dataset was used to source text for generating activations, chosen for its diverse and high-quality content.

  • Activation Storage: I used 3,000,000 activations. A forward hook was registered on the target layer to capture its output.
  • Processing: Texts were tokenized with a maximum sequence length of 128 tokens. Padding was applied to ensure uniform input shapes.
  • Filtering: Crucially, only activations corresponding to actual tokens (not padding, as indicated by the attention mask) were stored. Additionally, activations corresponding to the bos_token (beginning-of-sequence) were excluded to focus on content-specific features. This alignment ensures that each stored activation vector corresponds to a specific token in the input.
  • Normalization: The collected activations were normalized. If UNIT_NORM is enabled (as in the my experiment), each activation vector was mean-centered and then normalized to have a unit L2 norm. This helps stabilize SAE training and ensures features aren't learned based on activation magnitude alone. The mean of the activations (act_mean) was saved for later use during inference/patching, to replicate training conditions.

3. SAE Configuration

The Sparse Autoencoder was designed with the following parameters:

  • Expansion Factor: An SAE_EXPANSION_FACTOR of 8 was used, resulting in an SAE hidden dimension of 9216 (1152 * 8). A larger hidden dimension allows the SAE to learn a richer set of disentangled features, but it makes to training mroe expensive.
  • L1 Coefficient: I chose a value of 3e-4, to enforce sparsity in the SAE's hidden layer. This penalty encourages most hidden neurons to be zero for any given input, making the active ones more interpretable. The value was decided after several experiments, so it is a valid value specific for my setup.

4. Training Regimen

The SAE was trained using the Adam optimizer with the following parameters:

  • Batch Size: 512.
  • Learning Rate: An initial alue of 3e-3.
  • Epochs: 300.
  • Learning Rate Schedule: A linear warmup for the first 500 steps, followed by a cosine decay down to 0.5 of the base LR over the remaining steps. This schedule helps in stabilizing training at the beginning and fine-tuning towards the end.
  • Loss Function: The loss was a combination of Mean Squared Error (MSE) between the original and reconstructed activations, and an L1 penalty on the SAE's encoded (hidden) activations. total_loss = mse_loss + l1_coeff * l1_loss_on_encoded_activations
  • Monitoring: Training progress was tracked using TensorBoard, logging total loss, MSE loss, L1 loss, and importantly, the L0 norm (average number of non-zero features per input) to monitor sparsity.

5. Evaluation Strategy

After training, the SAE's ability to represent activations sparsely and meaningfully was evaluated:

  • Feature Extraction: The trained SAE was used to encode the full set of collected (and normalized) activations to get their sparse feature representations.
  • Max Activating Examples: For top features (identified by their maximum activation strength across the dataset), I extracted text snippets (contexts) that caused these features to fire most strongly. This is the core of interpreting what each SAE feature might represent. A window of 10 tokens around the activating token was used.
  • Feature Statistics:
    • Liveliness: Calculated the number of "dead" features (those that never activate above a small threshold like 1e-6).
    • Strength Distribution: Visualized feature strengths (e.g., max activation per feature) using 2D heatmaps and 3D bar plots to understand the overall distribution and identify salient features.
  • Interpretability Analysis (Qualitative): The primary evaluation was qualitative, examining the top k features (based on their activation strenght) to identify semantic meaning. For example, does a feature consistently activate on specific grammatical structures, parts of speech, semantic concepts (e.g., "dates," "locations," "negation")?

Results and Analysis

The training process yielded an SAE capable of reconstructing the original MLP activations while maintaining a high degree of sparsity in its own hidden layer.

The full experimental report, including detailed training curves, feature strength visualizations, and examples of maximally activating text snippets for top features, provides a comprehensive look at the outcomes, and it can be downloaded at the end of the page. For example the following image is taken from the report, and contains the max activations for each feature of the latent space of the sparse autoencoder.

  • Feature Sparsity: A significant number of SAE features were "dead" or rarely active. For instance (in the report at page 6), 6788 features out of 9216 (73.7%) were dead, with an average of 1130.9 features were active per sample. This means on average ~12% of features are active for any given input.
  • Semantic Clustering: Several SAE features appeared to capture distinct semantic or syntactic concepts. For example:
    • Feature 1177 (Report Page 7): Frequently activates on tokens that appear to be person names or entities within a list or descriptive context (e.g., "Khan as Gabbar Singh", "Francisco de Solis Folch de Cardona").
    • Feature 292 (Report Page 10): Activates on the tokens "1" and "0", always when placed within a date at the tens digit
    • Feature 2799 (Report Page 11): Activates on the tokens "of"
    • Feature 74 (Report Page 12): Activates on the tokens "s" after the dates (e.g. 1930s, 1980s)
    • Feature 4263 (Report Page 13): Activates on the tokens corresponding to numbers, when indicating the number of the month (e.g. October 8, March 6)
    • Feature 2493 (Report Page 20): Activates on the tokens "s" when used in Saxon genitive
    • Feature 1881 (Report Page 21): Activates on the tokens corresponding to the cardinal numbers (e.g. rd, st, eth, th)
    • Feature 799 (Report Page 23): Activates on the tokens corresponding to numbers, representing the unit number in dates (e.g. 1 in 1931, 4 in 2004)
    • Feature 334 (Report Page 25): Activates on the tokens "first"
    • Feature 3384 (Report Page 29): Activates on the tokens "also"
    • Feature 7786 (Report Page 31): Activates on the tokens "May"
    • Feature 604 (Report Page 32): Activates on the tokens "as"
    • Feature 974 (Report Page 34): Activates on the tokens "that"
    • Feature 8468 (Report Page 35): Activates on the tokens "not"
    • Feature 5691 (Report Page 37): Activates on the tokens "the"
    • Feature 4843 (Report Page 45): Activates on the tokens "only"
    • Feature 6357 (Report Page 46): Activates on the tokens "'" when using to indicate the genitive ('s in Jhon's personal life)
    • Feature 201 (Report Page 47): Activates on the tokens "to"
    • Feature 3089 (Report Page 50): Activates on the tokens "a" or "an"
    • Feature 1715 (Report Page 55): Activates on the tokens representing a measure for distance, specifically "km", "miles", and "mi"
    • Feature 2535 (Report Page 60): Activates on the tokens "later"
    • Feature 7770 (Report Page 63): Activates on the tokens "in" or "inch"
    • Feature 6897 (Report Page 72): Activates on the tokens representing surnames (e.g. Weston, Ross, Smith)
    • Feature 5058 (Report Page 74): Activates on the tokens related to the world wars (e.g. "war", "I", "II")
    • Feature 7154 (Report Page 78): Activates on the tokens "," when follows the word "However"
    • Feature 5668 (Report Page 82): Activates on the tokens representing high measures (e.g. "lot", "much")
    • Feature 2508 (Report Page 83): Activates on the tokens derived from "die" (e.g. "died", "die", "dies")
    • Feature 3853 (Report Page 88): Activates on the tokens ")" when it follows some unit of measure
    • Feature 8458 (Report Page 90): Activates on the tokens representing the third person of the possessive adjectives and pronouns
    • Feature 6097 (Report Page 94): Activates on the tokens representing a measure for distance (shorter than previous tokens, e.g. "feet" or "meter")
    • Feature 5099 (Report Page 98): Activates on the tokens "was"
    • Feature 3853 (Report Page 88): Activates on the tokens ")" when it follows some unit of measure

The ability to pinpoint such specific activating contexts demonstrates the utility of SAEs in making the internal representations of LLMs more transparent. Further work could involve using these features for model steering or fine-grained analysis of model behavior.

The generated PDF report with detailed plots and feature analysis can be downloaded directly:

Download Full Experiment Report (PDF)

GitHub Repository

View the code on GitHub