Kadajett/Muon-for-dummies.md

## Muon-for-dummies.md

      
    Raw
  

              Muon-for-dummies.md
            
          
    1. What Muon Is

Muon (Momentum Orthogonalization via Newton–Schulz) is a matrix-aware optimizer for neural networks that:

Keeps momentum
Orthogonalizes that momentum for 2D weight matrices
Uses the orthogonalized matrix as the update direction

It is designed specifically for matrix-structured parameters (like linear layer weights in transformers).

2. Core Idea in One Sentence


Instead of scaling each parameter independently, Muon reshapes the entire weight update matrix so its directions are balanced and orthogonal.


3. Why This Matters

In deep networks, especially transformers:

Linear layers are 2D matrices
Momentum updates tend to become low-rank
A few dominant directions drive learning
Rare directions get ignored

Muon forces the update matrix to be well-conditioned, ensuring:

No single direction dominates
Rare but important directions are amplified
Learning uses the full parameter space


4. The Mechanism

For a 2D weight matrix ( W ):
Step 1 — Compute gradient

[
G = \frac{\partial L}{\partial W}
]
Step 2 — Update momentum

[
M_t = \beta M_{t-1} + (1-\beta)G
]
Step 3 — Normalize

[
M_t \leftarrow \frac{M_t}{|M_t|_F}
]
Step 4 — Orthogonalize (Newton–Schulz iterations)

Instead of SVD, apply:
[
X_{k+1} = aX_k + bX_k(X_k^T X_k) + cX_k(X_k^T X_k)^2
]
Repeated ~5 times.
This pushes singular values toward 1.
Result:
[
O \approx \text{nearest orthogonal matrix to } M
]
Step 5 — Update weights

[
W \leftarrow W - \eta O
]

5. What Makes Muon Different

It is:


Matrix-structured
Geometry-aware
Low-rank correcting
Momentum-based
GPU-friendly (matmuls only)
Memory efficient (1× params instead of 2×)

It is NOT:


Element-wise adaptive
Variance-scaling based
Diagonal preconditioned
SVD-based (explicitly)


6. What Problem It Targets

A. Low-Rank Momentum Collapse

Momentum updates often collapse to a few dominant directions.
B. Transformer Instability

Orthogonalization improves conditioning of attention and MLP blocks.
C. Memory Efficiency

No second-moment buffer required.

7. Where It Works Best


Transformers
Language models
MLP-heavy architectures
Mid-sized models (where geometry matters but compute budget is tight)


8. Where It Is Not Ideal


1D parameters (biases, norms)
Extremely small matrices
Extremely high aspect ratio matrices (embeddings, output heads)

Those usually use a fallback optimizer.

9. Conceptual Framing

You can think of Muon as:

Projecting momentum onto the orthogonal manifold
A cheap approximation of “optimal direction balancing”
A way to keep updates from collapsing into narrow subspaces

It is closer to:

Spectral conditioning
Manifold optimization
Geometry-aware learning

Than to:

Adaptive scalar learning rate methods


10. Computational Cost

Per 2D weight matrix:

~5 small matrix multiplications
No eigen-decomposition
No SVD

Efficient on GPUs.

11. Mental Model

Adam asks:

How big should each coordinate step be?

Muon asks:

What is the best balanced matrix update direction?

Adam scales components.
Muon reshapes directions.
No results found