TY - JOUR

T1 - Online stochastic gradient descent on non-convex losses from high-dimensional inference

AU - Arous, Gerard Ben

AU - Gheissari, Reza

AU - Jagannath, Aukosh

N1 - Funding Information:
The authors thank the anonymous referees for their detailed comments and suggestions. The authors thank Y. M. Lu for interesting discussions. G.B.A. thanks A. Montanari, Y. Le Cun, and L. Bottou for interesting discussions at early stages of this project. A.J. thanks S. Sen for helpful comments on this work. R.G. thanks the Miller Institute for Basic Research in Science for their support. A.J. acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [RGPIN-2020-04597, DGECR-2020-00199].
Publisher Copyright:
© 2021 Microtome Publishing. All rights reserved.

PY - 2021

Y1 - 2021

N2 - Stochastic gradient descent (SGD) is a popular algorithm for optimization problems arising in high-dimensional inference tasks. Here one produces an estimator of an unknown parameter from independent samples of data by iteratively optimizing a loss function. This loss function is random and often non-convex. We study the performance of the simplest version of SGD, namely online SGD, from a random start in the setting where the parameter space is high-dimensional. We develop nearly sharp thresholds for the number of samples needed for consistent estimation as one varies the dimension. Our thresholds depend only on an intrinsic property of the population loss which we call the information exponent. In particular, our results do not assume uniform control on the loss itself, such as convexity or uniform derivative bounds. The thresholds we obtain are polynomial in the dimension and the precise exponent depends explicitly on the information exponent. As a consequence of our results, we find that except for the simplest tasks, almost all of the data is used simply in the initial search phase to obtain non-trivial correlation with the ground truth. Upon attaining nontrivial correlation, the descent is rapid and exhibits law of large numbers type behavior. We illustrate our approach by applying it to a wide set of inference tasks such as phase retrieval, and parameter estimation for generalized linear models, online PCA, and spiked tensor models, as well as to supervised learning for single-layer networks with general activation functions.

AB - Stochastic gradient descent (SGD) is a popular algorithm for optimization problems arising in high-dimensional inference tasks. Here one produces an estimator of an unknown parameter from independent samples of data by iteratively optimizing a loss function. This loss function is random and often non-convex. We study the performance of the simplest version of SGD, namely online SGD, from a random start in the setting where the parameter space is high-dimensional. We develop nearly sharp thresholds for the number of samples needed for consistent estimation as one varies the dimension. Our thresholds depend only on an intrinsic property of the population loss which we call the information exponent. In particular, our results do not assume uniform control on the loss itself, such as convexity or uniform derivative bounds. The thresholds we obtain are polynomial in the dimension and the precise exponent depends explicitly on the information exponent. As a consequence of our results, we find that except for the simplest tasks, almost all of the data is used simply in the initial search phase to obtain non-trivial correlation with the ground truth. Upon attaining nontrivial correlation, the descent is rapid and exhibits law of large numbers type behavior. We illustrate our approach by applying it to a wide set of inference tasks such as phase retrieval, and parameter estimation for generalized linear models, online PCA, and spiked tensor models, as well as to supervised learning for single-layer networks with general activation functions.

KW - Generalized linear models

KW - Non-convex optimization

KW - Parameter estimation

KW - Stochastic gradient descent

KW - Supervised learning

KW - Tensor PCA

UR - http://www.scopus.com/inward/record.url?scp=85107272441&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85107272441&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:85107272441

VL - 22

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

SN - 1532-4435

ER -