A summary of our conversation on understanding and building SAEs for LLM interpretability.
Neural networks like GPT are powerful but opaque. We'd like to understand them by looking at individual neurons, but there's a problem: single neurons don't correspond to single concepts. One neuron might activate for academic citations, English dialogue, HTTP requests, and Korean text all at once.
This is called superposition — the model crams more concepts than it has neurons by using combinations of neurons to represent things.
An SAE takes the model's internal activations (a dense vector) and transforms them into a sparse representation — one where most values are zero, and only a handful are active.
Example:
- Input: A 12,288-dimensional vector that's dense and incomprehensible
- Output: A 49,152-dimensional vector where only ~20-100 values are non-zero
Each non-zero position ideally corresponds to one interpretable concept (like "Golden Retriever" or "Golden Gate Bridge").
By forcing the representation to be sparse, the SAE is pushed to find meaningful, separable features rather than smearing concepts together.
Run lots of diverse text through your language model (like GPT) and save the internal activations at a specific location. For example, collect the 12,288-dimensional vectors between layer 26 and 27.
The SAE is simple — just two matrices with a ReLU in between:
Input activation → Encoder matrix → ReLU → Decoder matrix → Reconstructed activation
For a model with 12,288 dimensions and 4x expansion:
- Encoder matrix: 12,288 × 49,152
- Intermediate representation: 49,152 dimensions (but sparse!)
- Decoder matrix: 49,152 × 12,288
Combine two things:
- Reconstruction loss (L2): How well does the output match the input? (squared error)
- Sparsity penalty (L1): Sum of absolute values in the intermediate representation
Total Loss = Reconstruction Error + (L1 coefficient × Sparsity Penalty)
The L1 coefficient is a knob you tune — higher means sparser but worse reconstruction.
Feed batches of collected activations through the SAE, compute loss, backpropagate, update weights. Repeat.
After training, look at what inputs cause each feature to activate. If feature 317 consistently fires on text about dogs, you've found a "dog" feature. The corresponding decoder vector is that feature's direction in the model's internal space.
import torch
import torch.nn as nn
class SparseAutoEncoder(nn.Module):
def __init__(self, activation_dim: int, dict_size: int):
super().__init__()
self.activation_dim = activation_dim
self.dict_size = dict_size
self.encoder = nn.Linear(activation_dim, dict_size, bias=True)
self.decoder = nn.Linear(dict_size, activation_dim, bias=True)
def encode(self, x):
return nn.ReLU()(self.encoder(x))
def decode(self, z):
return self.decoder(z)
def forward(self, x):
z = self.encode(x)
x_reconstructed = self.decode(z)
return x_reconstructed, z
def calculate_loss(autoencoder, activations, l1_coefficient):
reconstructed, encoded = autoencoder(activations)
# L2 reconstruction loss
l2_loss = (reconstructed - activations).pow(2).sum(dim=-1).mean()
# L1 sparsity loss
l1_loss = l1_coefficient * encoded.abs().sum(dim=-1).mean()
return l2_loss + l1_lossThis is what you have! The notebook walks through:
- Building toy models of superposition first
- Understanding feature geometry
- Building SAEs from scratch
- Training on real models
Links:
- Exercises: https://colab.research.google.com/drive/1fg1kCFsG0FCyaK4d5ejEsot4mOVhsIFH
- Solutions: https://colab.research.google.com/drive/1rPy82rL3iZzy2_Rd3F82RwFhlVnnroIh
pip install sae-lens transformer-lens- Training tutorial: https://github.com/jbloomAus/SAELens/blob/main/tutorials/training_a_sparse_autoencoder.ipynb
- Recommended hyperparameters for GPT-2 Small: LR 4e-4, L1 coefficient 8e-5, batch size 4096
pip install git+https://github.com/openai/sparse_autoencoder.gitPre-trained SAEs for GPT-2 small with training code and feature visualizer.
- HuggingFace: jacobcd52/gpt2-small-sparse-autoencoders (trained on 1B tokens)
- Neuronpedia: https://neuronpedia.org/gpt2-small — browse features interactively
python -m sparsify gpt2 --hookpoints "h.*.attn" "h.*.mlp.act"Uses TopK activation to directly control sparsity.
Before building SAEs, you build the toy model that creates superposition:
- Simple model: 5 features → 2 hidden dims → reconstruct 5 features
- Watch features arrange geometrically (pentagons!) as sparsity increases
- Explore correlated/anticorrelated features
- Dimensionality metric: measures what fraction of a dimension each feature gets
- Discover surprising geometric structures (tetrahedrons, pentagons)
- Define architecture: encoder matrix, decoder matrix, biases
- Forward pass:
z = ReLU(W_enc(h - b_dec) + b_enc), thenh' = W_dec·z + b_dec - Loss function: L1 (sparsity) + L2 (reconstruction)
- Train on toy model's hidden activations
- Implement neuron resampling to fix dead neurons
- Visualize learned features recovering the original structure
Load a pre-trained SAE on GPT-2 (GELU-1L) and:
- Measure feature sparsity
- Find highest-activating tokens for specific neurons
- Analyze what tokens features boost/suppress
| Term | Definition |
|---|---|
| Superposition | When a model represents more features than it has dimensions |
| Polysemanticity | When one neuron corresponds to multiple features |
| Privileged basis | When standard basis vectors (individual neurons) are meaningful |
| L0 | Average number of nonzero elements in SAE's encoded representation |
| Loss Recovered | How much the model's loss increases when using SAE reconstructions |
| Dead neurons | SAE features that never activate — need resampling |
| Feature importance | How useful a feature is for achieving lower loss |
| Feature sparsity | How frequently a feature appears in input data |
- Read first: Adam Karvonen's blog post (intuitive explanation)
- Work through: Callum McDougall's notebook (exercises version)
- Get hands-on: Train your own SAE with SAELens
- Explore: Browse features on Neuronpedia
- Toy Models of Superposition (Anthropic)
- Towards Monosemanticity (Anthropic)
- Scaling and Evaluating Sparse Autoencoders (OpenAI)
- Open the exercises Colab (not solutions)
- Work through Part 1-3 (toy models of superposition)
- Build the SAE in Part 5
- Implement neuron resampling
- Run on real GPT-2 model in Part 6
- Find some fun interpretable neurons!
Good luck! 🧠