Averaging weights leads to wider optima and better generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensem-bling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and ShakeShake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

Original languageEnglish (US)
Title of host publication34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018
EditorsRicardo Silva, Amir Globerson, Amir Globerson
PublisherAssociation For Uncertainty in Artificial Intelligence (AUAI)
Pages876-885
Number of pages10
ISBN (Electronic)9781510871601
StatePublished - 2018
Event34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 - Monterey, United States
Duration: Aug 6 2018Aug 10 2018

Publication series

Name34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018
Volume2

Other

Other34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018
Country/TerritoryUnited States
CityMonterey
Period8/6/188/10/18

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Averaging weights leads to wider optima and better generalization'. Together they form a unique fingerprint.

Cite this