bigsnarfdude/sae.md

## sae.md

      
    Raw
  

              sae.md
            
          
    Sparse Autoencoders (SAEs) Study Guide

A summary of our conversation on understanding and building SAEs for LLM interpretability.

What Problem Do SAEs Solve?

Neural networks like GPT are powerful but opaque. We'd like to understand them by looking at individual neurons, but there's a problem: single neurons don't correspond to single concepts. One neuron might activate for academic citations, English dialogue, HTTP requests, and Korean text all at once.
This is called superposition — the model crams more concepts than it has neurons by using combinations of neurons to represent things.

What SAEs Do

An SAE takes the model's internal activations (a dense vector) and transforms them into a sparse representation — one where most values are zero, and only a handful are active.
Example:

Input: A 12,288-dimensional vector that's dense and incomprehensible
Output: A 49,152-dimensional vector where only ~20-100 values are non-zero

Each non-zero position ideally corresponds to one interpretable concept (like "Golden Retriever" or "Golden Gate Bridge").
Why Sparsity Helps

By forcing the representation to be sparse, the SAE is pushed to find meaningful, separable features rather than smearing concepts together.

Step-by-Step: How to Build a Sparse Autoencoder

Step 1: Collect Training Data

Run lots of diverse text through your language model (like GPT) and save the internal activations at a specific location. For example, collect the 12,288-dimensional vectors between layer 26 and 27.
Step 2: Build the Architecture

The SAE is simple — just two matrices with a ReLU in between:
Input activation → Encoder matrix → ReLU → Decoder matrix → Reconstructed activation

For a model with 12,288 dimensions and 4x expansion:

Encoder matrix: 12,288 × 49,152
Intermediate representation: 49,152 dimensions (but sparse!)
Decoder matrix: 49,152 × 12,288

Step 3: Define the Loss Function

Combine two things:

Reconstruction loss (L2): How well does the output match the input? (squared error)
Sparsity penalty (L1): Sum of absolute values in the intermediate representation

Total Loss = Reconstruction Error + (L1 coefficient × Sparsity Penalty)

The L1 coefficient is a knob you tune — higher means sparser but worse reconstruction.
Step 4: Train with Standard Gradient Descent

Feed batches of collected activations through the SAE, compute loss, backpropagate, update weights. Repeat.
Step 5: Interpret the Features

After training, look at what inputs cause each feature to activate. If feature 317 consistently fires on text about dogs, you've found a "dog" feature. The corresponding decoder vector is that feature's direction in the model's internal space.

Reference Implementation

import torch
import torch.nn as nn

class SparseAutoEncoder(nn.Module):
    def __init__(self, activation_dim: int, dict_size: int):
        super().__init__()
        self.activation_dim = activation_dim
        self.dict_size = dict_size
        
        self.encoder = nn.Linear(activation_dim, dict_size, bias=True)
        self.decoder = nn.Linear(dict_size, activation_dim, bias=True)

    def encode(self, x):
        return nn.ReLU()(self.encoder(x))
    
    def decode(self, z):
        return self.decoder(z)
    
    def forward(self, x):
        z = self.encode(x)
        x_reconstructed = self.decode(z)
        return x_reconstructed, z

def calculate_loss(autoencoder, activations, l1_coefficient):
    reconstructed, encoded = autoencoder(activations)
    
    # L2 reconstruction loss
    l2_loss = (reconstructed - activations).pow(2).sum(dim=-1).mean()
    
    # L1 sparsity loss
    l1_loss = l1_coefficient * encoded.abs().sum(dim=-1).mean()
    
    return l2_loss + l1_loss

Best Tutorials & Resources

1. ARENA Exercises by Callum McDougall (Most Comprehensive)

This is what you have! The notebook walks through:

Building toy models of superposition first
Understanding feature geometry
Building SAEs from scratch
Training on real models

Links:

Exercises: https://colab.research.google.com/drive/1fg1kCFsG0FCyaK4d5ejEsot4mOVhsIFH
Solutions: https://colab.research.google.com/drive/1rPy82rL3iZzy2_Rd3F82RwFhlVnnroIh

2. SAELens Library (Easiest to Get Running)

pip install sae-lens transformer-lens

Training tutorial: https://github.com/jbloomAus/SAELens/blob/main/tutorials/training_a_sparse_autoencoder.ipynb
Recommended hyperparameters for GPT-2 Small: LR 4e-4, L1 coefficient 8e-5, batch size 4096

3. OpenAI's Sparse Autoencoder Repo

pip install git+https://github.com/openai/sparse_autoencoder.git
Pre-trained SAEs for GPT-2 small with training code and feature visualizer.
4. Pre-trained GPT-2 SAEs to Explore


HuggingFace: jacobcd52/gpt2-small-sparse-autoencoders (trained on 1B tokens)
Neuronpedia: https://neuronpedia.org/gpt2-small — browse features interactively

5. EleutherAI's Sparsify (Simplest CLI)

python -m sparsify gpt2 --hookpoints "h.*.attn" "h.*.mlp.act"
Uses TopK activation to directly control sparsity.

The Callum McDougall Notebook Structure

Parts 1-3: Understanding Superposition First

Before building SAEs, you build the toy model that creates superposition:

Simple model: 5 features → 2 hidden dims → reconstruct 5 features
Watch features arrange geometrically (pentagons!) as sparsity increases
Explore correlated/anticorrelated features

Part 4: Feature Geometry


Dimensionality metric: measures what fraction of a dimension each feature gets
Discover surprising geometric structures (tetrahedrons, pentagons)

Part 5: Building Your SAE (Core Section)


Define architecture: encoder matrix, decoder matrix, biases
Forward pass: z = ReLU(W_enc(h - b_dec) + b_enc), then h' = W_dec·z + b_dec
Loss function: L1 (sparsity) + L2 (reconstruction)
Train on toy model's hidden activations
Implement neuron resampling to fix dead neurons
Visualize learned features recovering the original structure

Part 6: SAEs on Real Models

Load a pre-trained SAE on GPT-2 (GELU-1L) and:

Measure feature sparsity
Find highest-activating tokens for specific neurons
Analyze what tokens features boost/suppress


Key Concepts to Remember


Term
Definition


Superposition
When a model represents more features than it has dimensions


Polysemanticity
When one neuron corresponds to multiple features


Privileged basis
When standard basis vectors (individual neurons) are meaningful


L0
Average number of nonzero elements in SAE's encoded representation


Loss Recovered
How much the model's loss increases when using SAE reconstructions


Dead neurons
SAE features that never activate — need resampling


Feature importance
How useful a feature is for achieving lower loss


Feature sparsity
How frequently a feature appears in input data


Recommended Learning Path


Read first: Adam Karvonen's blog post (intuitive explanation)
Work through: Callum McDougall's notebook (exercises version)
Get hands-on: Train your own SAE with SAELens
Explore: Browse features on Neuronpedia


Papers to Read


Toy Models of Superposition (Anthropic)
Towards Monosemanticity (Anthropic)
Scaling and Evaluating Sparse Autoencoders (OpenAI)


Morning Checklist


 Open the exercises Colab (not solutions)
 Work through Part 1-3 (toy models of superposition)
 Build the SAE in Part 5
 Implement neuron resampling
 Run on real GPT-2 model in Part 6
 Find some fun interpretable neurons!

Good luck! 🧠
Term	Definition
Superposition	When a model represents more features than it has dimensions
Polysemanticity	When one neuron corresponds to multiple features
Privileged basis	When standard basis vectors (individual neurons) are meaningful
L0	Average number of nonzero elements in SAE's encoded representation
Loss Recovered	How much the model's loss increases when using SAE reconstructions
Dead neurons	SAE features that never activate — need resampling
Feature importance	How useful a feature is for achieving lower loss
Feature sparsity	How frequently a feature appears in input data
No results found