Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

Research output: Contribution to journalConference articlepeer-review

Abstract

We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize η is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly - in O(η) steps, and subsequently achieves an Õ(1/(ηt)) convergence rate after t additional steps. Our results imply that, given a budget of T steps, GD can achieve an accelerated loss of Õ(1/T2) with an aggressive stepsize η := Θ(T), without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the Õ(1/T2) acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

Original languageEnglish (US)
Pages (from-to)5019-5073
Number of pages55
JournalProceedings of Machine Learning Research
Volume247
StatePublished - 2024
Event37th Annual Conference on Learning Theory, COLT 2024 - Edmonton, Canada
Duration: Jun 30 2024Jul 3 2024

Keywords

  • acceleration
  • edge of stability
  • gradient descent
  • logistic regression
  • optimization

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency'. Together they form a unique fingerprint.

Cite this