TY - JOUR
T1 - Parameters as interacting particles
T2 - 32nd Conference on Neural Information Processing Systems, NeurIPS 2018
AU - Rotskoff, Grant M.
AU - Vanden-Eijnden, Eric
N1 - Funding Information:
We would like to thank Andrea Montanari and Matthieu Wyart for useful discussions regarding the fixed points of gradient flows in the Wasserstein metric. GMR was supported by the James S. McDonnell Foundation. EVE was supported by National Science Foundation (NSF) Materials Research Science and Engineering Center Program Award DMR-1420073; and by NSF Award DMS-1522767.
Publisher Copyright:
© 2018 Curran Associates Inc.All rights reserved.
PY - 2018
Y1 - 2018
N2 - The performance of neural networks on high-dimensional data distributions suggests that it may be possible to parameterize a representation of a given high-dimensional function with controllably small errors, potentially outperforming standard interpolation methods. We demonstrate, both theoretically and numerically, that this is indeed the case. We map the parameters of a neural network to a system of particles relaxing with an interaction potential determined by the loss function. We show that in the limit that the number of parameters n is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as O(n−1). In this limit, we prove a dynamical variant of the universal approximation theorem showing that the optimal representation can be attained by stochastic gradient descent, the algorithm ubiquitously used for parameter optimization in machine learning. In the asymptotic regime, we study the fluctuations around the optimal representation and show that they arise at a scale O(n−1). These fluctuations in the landscape identify the natural scale for the noise in stochastic gradient descent. Our results apply to both single and multi-layer neural networks, as well as standard kernel methods like radial basis functions.
AB - The performance of neural networks on high-dimensional data distributions suggests that it may be possible to parameterize a representation of a given high-dimensional function with controllably small errors, potentially outperforming standard interpolation methods. We demonstrate, both theoretically and numerically, that this is indeed the case. We map the parameters of a neural network to a system of particles relaxing with an interaction potential determined by the loss function. We show that in the limit that the number of parameters n is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as O(n−1). In this limit, we prove a dynamical variant of the universal approximation theorem showing that the optimal representation can be attained by stochastic gradient descent, the algorithm ubiquitously used for parameter optimization in machine learning. In the asymptotic regime, we study the fluctuations around the optimal representation and show that they arise at a scale O(n−1). These fluctuations in the landscape identify the natural scale for the noise in stochastic gradient descent. Our results apply to both single and multi-layer neural networks, as well as standard kernel methods like radial basis functions.
UR - http://www.scopus.com/inward/record.url?scp=85064832054&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064832054&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85064832054
SN - 1049-5258
VL - 2018-December
SP - 7146
EP - 7155
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
Y2 - 2 December 2018 through 8 December 2018
ER -