Muon (Momentum Orthogonalization via Newton–Schulz) is a matrix-aware optimizer for neural networks that:
- Keeps momentum
- Orthogonalizes that momentum for 2D weight matrices
- Uses the orthogonalized matrix as the update direction
It is designed specifically for matrix-structured parameters (like linear layer weights in transformers).
Instead of scaling each parameter independently, Muon reshapes the entire weight update matrix so its directions are balanced and orthogonal.
In deep networks, especially transformers:
- Linear layers are 2D matrices
- Momentum updates tend to become low-rank
- A few dominant directions drive learning
- Rare directions get ignored
Muon forces the update matrix to be well-conditioned, ensuring:
- No single direction dominates
- Rare but important directions are amplified
- Learning uses the full parameter space
For a 2D weight matrix ( W ):
[ G = \frac{\partial L}{\partial W} ]
[ M_t = \beta M_{t-1} + (1-\beta)G ]
[ M_t \leftarrow \frac{M_t}{|M_t|_F} ]
Instead of SVD, apply:
[ X_{k+1} = aX_k + bX_k(X_k^T X_k) + cX_k(X_k^T X_k)^2 ]
Repeated ~5 times.
This pushes singular values toward 1.
Result: [ O \approx \text{nearest orthogonal matrix to } M ]
[ W \leftarrow W - \eta O ]
- Matrix-structured
- Geometry-aware
- Low-rank correcting
- Momentum-based
- GPU-friendly (matmuls only)
- Memory efficient (1× params instead of 2×)
- Element-wise adaptive
- Variance-scaling based
- Diagonal preconditioned
- SVD-based (explicitly)
Momentum updates often collapse to a few dominant directions.
Orthogonalization improves conditioning of attention and MLP blocks.
No second-moment buffer required.
- Transformers
- Language models
- MLP-heavy architectures
- Mid-sized models (where geometry matters but compute budget is tight)
- 1D parameters (biases, norms)
- Extremely small matrices
- Extremely high aspect ratio matrices (embeddings, output heads)
Those usually use a fallback optimizer.
You can think of Muon as:
- Projecting momentum onto the orthogonal manifold
- A cheap approximation of “optimal direction balancing”
- A way to keep updates from collapsing into narrow subspaces
It is closer to:
- Spectral conditioning
- Manifold optimization
- Geometry-aware learning
Than to:
- Adaptive scalar learning rate methods
Per 2D weight matrix:
- ~5 small matrix multiplications
- No eigen-decomposition
- No SVD
Efficient on GPUs.
Adam asks:
How big should each coordinate step be?
Muon asks:
What is the best balanced matrix update direction?
Adam scales components.
Muon reshapes directions.