Scaling description of generalization with number of parameters in deep learning

Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane D'Ascoli, Giulio Biroli, Clément Hongler, Matthieu Wyart

Research output: Contribution to journalArticlepeer-review


Supervised deep learning involves the training of neural networks with a large number N of parameters. For large enough N, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as N grows past a certain threshold N . Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with N. We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations fN-(fN) ∼ N-1/4 of the neural net output function fN around its expectation. These affect the generalization error for classification: Under natural assumptions, it decays to a plateau value in a power-law fashion ∼N -1/2. This description breaks down at a so-called jamming transition N = N . At this threshold, we argue that diverges. This result leads to a plausible explanation for the cusp in test error known to occur at N . Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond N , and averaging their outputs.

Original languageEnglish (US)
Article number023401
JournalJournal of Statistical Mechanics: Theory and Experiment
Issue number2
StatePublished - Feb 2020


  • learning theory
  • machine learning

ASJC Scopus subject areas

  • Statistical and Nonlinear Physics
  • Statistics and Probability
  • Statistics, Probability and Uncertainty


Dive into the research topics of 'Scaling description of generalization with number of parameters in deep learning'. Together they form a unique fingerprint.

Cite this