Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created January 9, 2026 03:42
Show Gist options
  • Select an option

  • Save bigsnarfdude/f4c52672ca0de0ed7a560472e53b2db4 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/f4c52672ca0de0ed7a560472e53b2db4 to your computer and use it in GitHub Desktop.
sae.md

Sparse Autoencoders (SAEs) Study Guide

A summary of our conversation on understanding and building SAEs for LLM interpretability.


What Problem Do SAEs Solve?

Neural networks like GPT are powerful but opaque. We'd like to understand them by looking at individual neurons, but there's a problem: single neurons don't correspond to single concepts. One neuron might activate for academic citations, English dialogue, HTTP requests, and Korean text all at once.

This is called superposition — the model crams more concepts than it has neurons by using combinations of neurons to represent things.


What SAEs Do

An SAE takes the model's internal activations (a dense vector) and transforms them into a sparse representation — one where most values are zero, and only a handful are active.

Example:

  • Input: A 12,288-dimensional vector that's dense and incomprehensible
  • Output: A 49,152-dimensional vector where only ~20-100 values are non-zero

Each non-zero position ideally corresponds to one interpretable concept (like "Golden Retriever" or "Golden Gate Bridge").

Why Sparsity Helps

By forcing the representation to be sparse, the SAE is pushed to find meaningful, separable features rather than smearing concepts together.


Step-by-Step: How to Build a Sparse Autoencoder

Step 1: Collect Training Data

Run lots of diverse text through your language model (like GPT) and save the internal activations at a specific location. For example, collect the 12,288-dimensional vectors between layer 26 and 27.

Step 2: Build the Architecture

The SAE is simple — just two matrices with a ReLU in between:

Input activation → Encoder matrix → ReLU → Decoder matrix → Reconstructed activation

For a model with 12,288 dimensions and 4x expansion:

  • Encoder matrix: 12,288 × 49,152
  • Intermediate representation: 49,152 dimensions (but sparse!)
  • Decoder matrix: 49,152 × 12,288

Step 3: Define the Loss Function

Combine two things:

  1. Reconstruction loss (L2): How well does the output match the input? (squared error)
  2. Sparsity penalty (L1): Sum of absolute values in the intermediate representation
Total Loss = Reconstruction Error + (L1 coefficient × Sparsity Penalty)

The L1 coefficient is a knob you tune — higher means sparser but worse reconstruction.

Step 4: Train with Standard Gradient Descent

Feed batches of collected activations through the SAE, compute loss, backpropagate, update weights. Repeat.

Step 5: Interpret the Features

After training, look at what inputs cause each feature to activate. If feature 317 consistently fires on text about dogs, you've found a "dog" feature. The corresponding decoder vector is that feature's direction in the model's internal space.


Reference Implementation

import torch
import torch.nn as nn

class SparseAutoEncoder(nn.Module):
    def __init__(self, activation_dim: int, dict_size: int):
        super().__init__()
        self.activation_dim = activation_dim
        self.dict_size = dict_size
        
        self.encoder = nn.Linear(activation_dim, dict_size, bias=True)
        self.decoder = nn.Linear(dict_size, activation_dim, bias=True)

    def encode(self, x):
        return nn.ReLU()(self.encoder(x))
    
    def decode(self, z):
        return self.decoder(z)
    
    def forward(self, x):
        z = self.encode(x)
        x_reconstructed = self.decode(z)
        return x_reconstructed, z

def calculate_loss(autoencoder, activations, l1_coefficient):
    reconstructed, encoded = autoencoder(activations)
    
    # L2 reconstruction loss
    l2_loss = (reconstructed - activations).pow(2).sum(dim=-1).mean()
    
    # L1 sparsity loss
    l1_loss = l1_coefficient * encoded.abs().sum(dim=-1).mean()
    
    return l2_loss + l1_loss

Best Tutorials & Resources

1. ARENA Exercises by Callum McDougall (Most Comprehensive)

This is what you have! The notebook walks through:

  • Building toy models of superposition first
  • Understanding feature geometry
  • Building SAEs from scratch
  • Training on real models

Links:

2. SAELens Library (Easiest to Get Running)

pip install sae-lens transformer-lens

3. OpenAI's Sparse Autoencoder Repo

pip install git+https://github.com/openai/sparse_autoencoder.git

Pre-trained SAEs for GPT-2 small with training code and feature visualizer.

4. Pre-trained GPT-2 SAEs to Explore

5. EleutherAI's Sparsify (Simplest CLI)

python -m sparsify gpt2 --hookpoints "h.*.attn" "h.*.mlp.act"

Uses TopK activation to directly control sparsity.


The Callum McDougall Notebook Structure

Parts 1-3: Understanding Superposition First

Before building SAEs, you build the toy model that creates superposition:

  • Simple model: 5 features → 2 hidden dims → reconstruct 5 features
  • Watch features arrange geometrically (pentagons!) as sparsity increases
  • Explore correlated/anticorrelated features

Part 4: Feature Geometry

  • Dimensionality metric: measures what fraction of a dimension each feature gets
  • Discover surprising geometric structures (tetrahedrons, pentagons)

Part 5: Building Your SAE (Core Section)

  1. Define architecture: encoder matrix, decoder matrix, biases
  2. Forward pass: z = ReLU(W_enc(h - b_dec) + b_enc), then h' = W_dec·z + b_dec
  3. Loss function: L1 (sparsity) + L2 (reconstruction)
  4. Train on toy model's hidden activations
  5. Implement neuron resampling to fix dead neurons
  6. Visualize learned features recovering the original structure

Part 6: SAEs on Real Models

Load a pre-trained SAE on GPT-2 (GELU-1L) and:

  • Measure feature sparsity
  • Find highest-activating tokens for specific neurons
  • Analyze what tokens features boost/suppress

Key Concepts to Remember

Term Definition
Superposition When a model represents more features than it has dimensions
Polysemanticity When one neuron corresponds to multiple features
Privileged basis When standard basis vectors (individual neurons) are meaningful
L0 Average number of nonzero elements in SAE's encoded representation
Loss Recovered How much the model's loss increases when using SAE reconstructions
Dead neurons SAE features that never activate — need resampling
Feature importance How useful a feature is for achieving lower loss
Feature sparsity How frequently a feature appears in input data

Recommended Learning Path

  1. Read first: Adam Karvonen's blog post (intuitive explanation)
  2. Work through: Callum McDougall's notebook (exercises version)
  3. Get hands-on: Train your own SAE with SAELens
  4. Explore: Browse features on Neuronpedia

Papers to Read


Morning Checklist

  • Open the exercises Colab (not solutions)
  • Work through Part 1-3 (toy models of superposition)
  • Build the SAE in Part 5
  • Implement neuron resampling
  • Run on real GPT-2 model in Part 6
  • Find some fun interpretable neurons!

Good luck! 🧠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment