Skip to content

Instantly share code, notes, and snippets.

@Kadajett
Last active March 3, 2026 13:38
Show Gist options
  • Select an option

  • Save Kadajett/bacd9b07749e8094ab69dd56c1ce04aa to your computer and use it in GitHub Desktop.

Select an option

Save Kadajett/bacd9b07749e8094ab69dd56c1ce04aa to your computer and use it in GitHub Desktop.

1. What Muon Is

Muon (Momentum Orthogonalization via Newton–Schulz) is a matrix-aware optimizer for neural networks that:

  • Keeps momentum
  • Orthogonalizes that momentum for 2D weight matrices
  • Uses the orthogonalized matrix as the update direction

It is designed specifically for matrix-structured parameters (like linear layer weights in transformers).


2. Core Idea in One Sentence

Instead of scaling each parameter independently, Muon reshapes the entire weight update matrix so its directions are balanced and orthogonal.


3. Why This Matters

In deep networks, especially transformers:

  • Linear layers are 2D matrices
  • Momentum updates tend to become low-rank
  • A few dominant directions drive learning
  • Rare directions get ignored

Muon forces the update matrix to be well-conditioned, ensuring:

  • No single direction dominates
  • Rare but important directions are amplified
  • Learning uses the full parameter space

4. The Mechanism

For a 2D weight matrix ( W ):

Step 1 — Compute gradient

[ G = \frac{\partial L}{\partial W} ]

Step 2 — Update momentum

[ M_t = \beta M_{t-1} + (1-\beta)G ]

Step 3 — Normalize

[ M_t \leftarrow \frac{M_t}{|M_t|_F} ]

Step 4 — Orthogonalize (Newton–Schulz iterations)

Instead of SVD, apply:

[ X_{k+1} = aX_k + bX_k(X_k^T X_k) + cX_k(X_k^T X_k)^2 ]

Repeated ~5 times.

This pushes singular values toward 1.

Result: [ O \approx \text{nearest orthogonal matrix to } M ]

Step 5 — Update weights

[ W \leftarrow W - \eta O ]


5. What Makes Muon Different

It is:

  • Matrix-structured
  • Geometry-aware
  • Low-rank correcting
  • Momentum-based
  • GPU-friendly (matmuls only)
  • Memory efficient (1× params instead of 2×)

It is NOT:

  • Element-wise adaptive
  • Variance-scaling based
  • Diagonal preconditioned
  • SVD-based (explicitly)

6. What Problem It Targets

A. Low-Rank Momentum Collapse

Momentum updates often collapse to a few dominant directions.

B. Transformer Instability

Orthogonalization improves conditioning of attention and MLP blocks.

C. Memory Efficiency

No second-moment buffer required.


7. Where It Works Best

  • Transformers
  • Language models
  • MLP-heavy architectures
  • Mid-sized models (where geometry matters but compute budget is tight)

8. Where It Is Not Ideal

  • 1D parameters (biases, norms)
  • Extremely small matrices
  • Extremely high aspect ratio matrices (embeddings, output heads)

Those usually use a fallback optimizer.


9. Conceptual Framing

You can think of Muon as:

  • Projecting momentum onto the orthogonal manifold
  • A cheap approximation of “optimal direction balancing”
  • A way to keep updates from collapsing into narrow subspaces

It is closer to:

  • Spectral conditioning
  • Manifold optimization
  • Geometry-aware learning

Than to:

  • Adaptive scalar learning rate methods

10. Computational Cost

Per 2D weight matrix:

  • ~5 small matrix multiplications
  • No eigen-decomposition
  • No SVD

Efficient on GPUs.


11. Mental Model

Adam asks:

How big should each coordinate step be?

Muon asks:

What is the best balanced matrix update direction?

Adam scales components.

Muon reshapes directions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment