理學院電子報 No.66

研究與教學

國際交流

榮耀分享

探索理院

活動報導

人事動態

第66期出刊日：2026.07.01

When a Model Is “Singular,” It Learns Better

文/ 統計與數據科學研究所楊鈞澔助理教授

A Puzzle at the Heart of Deep Learning

Over the past decade, deep neural networks have rewritten our sense of what machines can do, from recognizing images to generating fluent language. Yet that success left behind a puzzle that troubled classical statisticians: these networks routinely carry millions, even billions, of parameters—far more than the data points they are trained on. Textbook wisdom says such a model should overfit badly and perform terribly on new data. Reality is the opposite: these networks not only avoid collapse, they generalize remarkably well.

This contradiction is exactly what the title of a 2023 paper by Wei and collaborators captured: “Deep Learning Is Singular, and That’s Good.” Here “singular” is not a criticism but a precise mathematical concept. It points to a framework developed before the deep learning boom that nonetheless answers these very puzzles—Singular Learning Theory (SLT), founded by the mathematician Sumio Watanabe.

What Does “Singular” Mean?

Classical statistics calls a model regular if it satisfies two conditions. First, identifiability: distinct parameters correspond to distinct distributions—no two settings produce the same model. Second, the Fisher information matrix is positive definite everywhere, which guarantees that near the best parameter the error surface looks like a clean bowl-shaped paraboloid. The entire classical toolbox—the asymptotic normality of the maximum likelihood estimator, the Laplace approximation, and the BIC criterion for model selection—rests on this “bowl” assumption.

Almost every interesting modern model violates it. A neural network, for instance, carries built-in symmetries:

Permutation symmetry: reordering the hidden units in a layer leaves the output unchanged.
Scale symmetry: with a ReLU activation, scaling a unit’s incoming weights up and its outgoing weights down by the same factor changes nothing.
Dead units: a neuron that never activates can carry any outgoing weights at all without affecting the model.

These symmetries mean that many different parameter settings represent the same model. When that happens, the optimal parameter is no longer a lone point but an entire curved—sometimes self-intersecting—geometric set. There the bowl collapses and the Fisher information matrix degenerates. That is what “singular” means. Far from being a pathological exception, singularity is the norm for modern models.

The Crucial New Invariant: The RLCT

Since the classical tools break down, we need a new quantity to measure the complexity of this singular structure. That quantity is the Real Log Canonical Threshold (RLCT), usually written λ.

Its most intuitive reading is a “volume of indistinguishability.” Count how many parameter settings lie within some tiny distance ε (measured by KL divergence) of the true distribution, and call this volume V(ε). In a regular model V(ε) scales roughly as ε raised to d/2, where d is the number of parameters; in a singular model it shrinks as ε raised to λ. In the regular case λ equals exactly d/2, but in the singular case λ is strictly smaller—and need not even be an integer.

A model that is singular at the true parameter has far fewer effective parameters than it appears to—which is exactly why such models generalize better than classic theory predicts. The deep learning puzzle then has a natural resolution: much of the enormous parameter count is redundant symmetry, and the degrees of freedom the model genuinely uses are accounted for precisely by λ.

From Geometry to Computation

How does one actually compute λ? This is where the theory becomes most fascinating. The key tool is the resolution of singularities theorem, proved by Heisuke Hironaka in 1964—an achievement that earned him the Fields Medal. However badly the zero set of the KL function self-intersects, a sequence of coordinate changes (“blow-ups”) can locally reduce it to the simplest monomial form. Once everything is a monomial, the previously intractable integrals become computable term by term, and λ is determined by the exponent data of those monomials. A question about how a model learns ultimately gets answered by how a singularity resolves, linking statistics to deep ideas in algebraic geometry and the poles of zeta functions.

Why Should We Care?

Watanabe’s theory culminates in two elegant theorems: a model’s free energy and its Bayesian generalization error are governed not by the classical d/2 but by λ. Wei and collaborators verified these predictions empirically—in singular models, Bayesian methods genuinely outperform maximum a posteriori estimation, while the Laplace approximation, which leans on the bowl assumption, fails outright.

Open challenges remain. For real networks with hundreds of millions of parameters, estimating λ efficiently is a central difficulty; the recently proposed local learning coefficient is the first scalable estimator, but the road ahead is long. Whether stochastic gradient descent approximates Bayesian inference, and whether the power laws seen in large-language-model training are themselves a signature of singularity, are active frontiers.

For me, the appeal of singular learning theory lies in what it reveals: contemporary machine learning may look like a pure contest of engineering and compute, but beneath it lie deep mathematical structures still waiting to be understood. Understanding why something works matters just as much as making it work—and that is precisely where statisticians can contribute.