Abstract
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize η is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly - in O(η) steps, and subsequently achieves an Õ(1/(ηt)) convergence rate after t additional steps. Our results imply that, given a budget of T steps, GD can achieve an accelerated loss of Õ(1/T2) with an aggressive stepsize η := Θ(T), without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the Õ(1/T2) acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.
Original language | English (US) |
---|---|
Pages (from-to) | 5019-5073 |
Number of pages | 55 |
Journal | Proceedings of Machine Learning Research |
Volume | 247 |
State | Published - 2024 |
Event | 37th Annual Conference on Learning Theory, COLT 2024 - Edmonton, Canada Duration: Jun 30 2024 → Jul 3 2024 |
Keywords
- acceleration
- edge of stability
- gradient descent
- logistic regression
- optimization
ASJC Scopus subject areas
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability