Source: Redwood Center for Theoretical Neuroscience Seminar
Author: Ben Dongsung Huh (IBM Research)
Venue: Redwood Center for Theoretical Neuroscience, UC Berkeley
Date: February 11, 2026
Length: 1:15:07
Deep learning generalizes through high-dimensional interpolation — filling gaps between trillions of training points. Scaling laws show steady improvement as data, compute, and model size increase. But this paradigm fails when you need systematic generalization.
Models exploit statistical shortcuts instead of capturing underlying structure:
- Vision: A model learns that cows appear in grassy backgrounds. Change the background and it misclassifies the cow.
- Reasoning: LLMs perform arithmetic well on digit counts seen during training, then fail on longer numbers.
- General: Even at massive scale, models rely on heuristics. Interpolation cannot substitute for structure.
"The reality is that you can't really scale these interpolations to substitute for the structure."
Candidates for out-of-distribution generalization:
- Causal structures, symbolic logic, program synthesis — fundamentally discrete and combinatorial, so gradients flow poorly through them. Hard to integrate with deep learning.
- Symmetry structures — the principle that generalizes local observations into universal laws. This is the talk's focus.
Symmetry already works in deep learning: equivariant and convolutional neural networks constrain architecture to respect known symmetries, yielding immediate out-of-domain generalization and better sample efficiency. But they require you to already know which symmetry matters. They cannot discover new symmetries or the latent space where symmetries operate.
-
Implicit bias (low-rank): Deep networks gravitate toward smooth, low-rank solutions through depth, update rules, and regularizers. This matches the manifold hypothesis — natural data lies on low-dimensional manifolds. The network discovers which dimensions matter without manual specification.
-
Symmetry constraints (equivariance): Hand-built into architecture. Provides systematic generalization but cannot discover structure on its own.
The central question: Can we induce an inductive bias toward discovering symmetry structures, rather than building them in by hand?
Consider a two-layer linear network with product weight W̄ = W₁W₂. Gradient descent reveals a conserved quantity: W₁W₁ᵀ − W₂ᵀW₂ has zero gradient. If weights start near zero, both layers remain balanced throughout training, sharing the same singular modes.
The product's singular values equal each layer's singular values raised to the power D (depth). This creates a separation of timescales: dominant modes learn fast; weak modes stay suppressed. Deeper networks amplify this effect, producing incremental rank learning — the network acquires a few dimensions at a time, yielding low-rank structure.
The Netflix Challenge made this concrete: recover a full user-movie preference matrix from sparse observations, assuming low rank. Classical approaches use nuclear norm as a convex proxy for rank (Candès, Tao). Deep linear networks (matrix factorization) now solve matrix completion better than classical methods by leveraging the same implicit low-rank bias.
Starting from large weights with L2 penalty gives the same result as starting from small weights. The L2 penalty, expressed in product singular values, becomes a sum of σ̄^(2/D). At depth D=1, this is ordinary Frobenius norm. At depth D=2, nuclear norm — a rank proxy. As D increases, the exponent approaches zero, approximating rank ever more closely.
A group is a set with a binary operation satisfying:
- Closure: the operation keeps you in the set
- Associativity: (a ∘ b) ∘ c = a ∘ (b ∘ c)
- Identity: a unique element that leaves others unchanged
- Inverse: every element has one
The operation encodes as a Cayley table (multiplication table) or equivalently as a structure tensor δ — a sparse {0,1} tensor where δ(a,b,c) = 1 if and only if a ∘ b = c.
A group representation ρ maps group elements to matrices preserving the group operation: ρ(g₁ ∘ g₂) = ρ(g₁)ρ(g₂). Matrix multiplication is inherently associative, so representations satisfy associativity automatically. Unitary representations — where the matrices are unitary (full-rank, maximally invertible) — are the "nicest" kind, the antithesis of low-rank matrices.
All equivariant neural networks reduce to group convolution layers (Kondor and Trivedi), defined in terms of the structure tensor δ. Standard equivariant networks prescribe δ top-down from a known symmetry. Learning the symmetry means recovering δ from data — a bottom-up process.
This becomes a Cayley table completion task: given partial observations of the operation table, infer the rest.
Three order-3 tensors (A, B, C), each n×n×n, where n is the number of group elements. For indices a, b, c, the model computes:
T(a,b,c) = (1/n) Tr(Aₐ · Bᵦ · Cᶜ)
where Aₐ, Bᵦ, Cᶜ are n×n matrix slices and n is the group order. This is a normalized trace of a triple matrix product — "multiplying three cubes to make another cube," hence the name HyperCube.
This tensor product is distinct from CP or Tucker decomposition and more closely related to group representation theory.
Instead of L2 regularization, HyperCube penalizes the squared Frobenius norms of pairwise factor products, summed over all three cyclic pairings:
H = Σ ‖Bᵦ · Cᶜ‖² + Σ ‖Cᶜ · Aₐ‖² + Σ ‖Aₐ · Bᵦ‖²
Each term equals the squared Jacobian norm with respect to the omitted factor (e.g., ‖∂T/∂Aₐ‖² = ‖BᵦCᶜ‖²). The goal: minimize H subject to T = δ (the structure tensor).
With 60% of the Cayley table observed:
- No regularization: Training loss drops to zero instantly (overparameterized), but test loss stays high. The learned tensor bears no resemblance to δ. Factor singular values show no meaningful structure.
- L2 regularization: Captures some off-diagonal patterns but misses the exact structure. Factors develop low-rank structure (some singular values collapse to zero) — the wrong inductive bias for this task.
- HyperCube objective (H): Recovers δ exactly — near-perfect zeros and ones. Train and test loss both reach zero. Factor singular values all converge to unity. The model learns unitary representations.
After gauge transformation (choosing a basis where the identity element maps to the identity matrix), all factor slices synchronize: Aᵍ = Bᵍ = (Cᵍ)ᵀ. These become proper matrix representations satisfying homomorphism. A further change of basis reveals all irreducible representations (irreps) — for S₃, the trivial rep, sign rep, and 2×2 rep. The model discovers the group's regular representation.
The solution is unique (up to gauge transformation), despite quartic loss and trilinear constraints.
Using the same dataset as the OpenAI "Grokking" paper, across a range of binary operations:
- Transformers have an unspecific bias: they generalize well on "simple" operations (modular addition) but poorly on complex ones (S₅). They prefer commutative groups (A+B easier than A−B). Convergence takes extremely long with few observations.
- HyperCube has a specific bias toward all groups equally. A+B and A−B are identical in difficulty (they are group isotopes). S₅ is no harder than modular addition. Converges ~100x faster than transformers on S₅, with near-zero variance across runs.
HyperCube is equivariant under isotopic transformations — independently relabeling the indices a, b, c — so any group isotope is equally easy to learn.
Apply Cauchy-Schwarz to T(a,b,c) = (1/n)⟨Aₐ†, BᵦCᶜ⟩:
|T(a,b,c)|² ≤ ‖Aₐ‖² · ‖BᵦCᶜ‖²
This yields a lower bound on the Jacobian norm:
‖BᵦCᶜ‖² ≥ |T(a,b,c)|² / ‖Aₐ‖²
Equality holds if and only if the Jacobian and factors are collinearly aligned. Summing over all terms decomposes H into:
H = B (lower bound) + R (misalignment remainder)
where R ≥ 0 measures how far from collinear the factors are.
Within the collinear manifold (R = 0), factors satisfy AₐAₐ† = Cᶜ†Cᶜ (up to a constant). The left side depends only on index a, the right only on c, so both equal a constant matrix X — a Gram matrix with trace n.
Define κ as a dimensionless ratio of factor norms to T norms. Then P = κX is a projection matrix (P² = P), so its eigenvalues are 0 or 1. Its trace equals the rank:
rank = n · κ
A striking result: κ represents discrete rank exactly in continuous values — unlike nuclear norm or other standard proxies, which always approximate.
"We had, for matrix factorization, nuclear norm and so on. But those are not exact. Whenever you try to embed a discrete quantity into the continuous domain, you have to do some approximation. This seems to be the exact way to represent rank in terms of continuous values."
The AM-GM inequality applied to the lower bound B shows that, under the constraint T = δ, H is a rank-maximizing objective — the opposite of deep learning's usual bias. H reaches its minimum when κ = 1 (full rank), forcing all factors to be unitary.
When factors are unitary and collinear, they must:
- Synchronize: A = B = Cᵀ (a single representation ρ)
- Satisfy homomorphism: ρ(g₁)ρ(g₂) = ρ(g₁ ∘ g₂)
- Be injective: distinct elements map to distinct matrices
- Enforce associativity: matrix multiplication is inherently associative, so the underlying operation must be too
Main theorem: The feasible collinear manifold (the set of factorizations achieving R = 0) is non-empty if and only if the structure tensor δ is isotopic to a group. For non-groups, every factorization incurs strictly positive misalignment.
The "Strong Collinearity Dominance" conjecture bounds B from below in terms of R:
B(Θ; δ) ≥ 3|δ| − c · R(Θ; δ), where c ∈ [0, 1)
In other words, any reduction in B from leaving the collinear manifold is more than offset by the increase in R. Experiments across all Latin squares for small n confirm this (with c ≈ 0.28): H increases approximately linearly with associativity violations. The HyperCube objective serves as a differentiable proxy for associativity.
HyperCube introduces an inductive bias toward maximizing rank and discovering unitary group representations — the opposite of deep learning's typical low-rank bias.
Key properties:
- Universal: works for all groups equally, favoring no particular symmetry
- Automatic: discovers symmetry from partial observations without hand-engineering
- Exact: recovers the regular representation (all irreducible representations)
- Unique: converges to the same solution (up to gauge transformation) regardless of initialization
- Fast: ~100x fewer optimization steps than transformers on equivalent tasks
- Scalability: The structure tensor scales as n³ in group elements — cubic growth limits applicability to large groups.
- Integration: No clear path yet to embedding HyperCube within large neural networks as a differentiable module.
- Continuous groups: Limited to finite groups. Continuous symmetries (rotations, translations) remain out of scope.
- Existence proof, not endpoint: The model demonstrates that inductive bias toward group discovery is achievable, but may lack the practicality for production deployment.
"I think this is a nice existential proof that there is a way to induce inductive bias towards discovering groups... HyperCube may not be the ultimate practical model to do so, but it allows us to look for other ways to achieve the same thing."
- Huh, D. (2024). "Discovering Abstract Symbolic Relations by Learning Unitary Group Representations." arXiv:2402.17002. The original HyperCube paper with model definition and experiments.
- Huh, D. & Jeong, H. (2025). "Implicit Unitarity Bias in Tensor Factorization: A Theoretical Framework for Symmetry Group Discovery." arXiv:2511.23152. The theoretical companion paper with proofs of the Cauchy-Schwarz decomposition, balance condition, κ-rank relationship, and collinearity-group equivalence.
- Saxe, A., McClelland, J., & Ganguli, S. (2014). "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks." ICLR 2014. arXiv:1312.6120.
- Candès, E. & Recht, B. (2009). "Exact Matrix Completion via Convex Relaxation." Found. Comput. Math. arXiv:0805.4471.
- Candès, E. & Tao, T. (2010). "The Power of Convex Relaxation." J. ACM. arXiv:0903.1476.
- Kondor, R. & Trivedi, S. (2018). "On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups." ICML 2018. arXiv:1802.03690.
- Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." arXiv:2201.02177.













