Abstract
Deep neural networks can achieve impressive performance in the regime where they are massively over-parameterized. Consequently, over the past year, there has been a growing interest in analyzing optimization and generalization properties of over-parameterized networks. However, the majority of existing work only applies to supervised learning. The role of over-parameterization in the unsupervised setting has by contrast gained far less attention. In this paper, we study the inductive bias of gradient descent for two-layer over-parameterized autoencoders with ReLU activation. We first provide theoretical evidence for the memorization phenomena observed in recent work using the property that infinitely wide neural networks under gradient descent evolve as linear models. We also analyze the gradient dynamics of the autoencoders in the finite-width setting. Starting from a randomly initialized autoencoder network, we rigorously prove the linear convergence of gradient descent in two weakly-trained and jointly-trained regimes. Our results indicate the considerable benefits of joint training over weak training in finding global optima, achieving a dramatic decrease in the required level of over-parameterization. Finally, we analyze the case of weight-tied autoencoders and prove that in the over-parameterized setting, training such networks from randomly initialized points leads to certain unexpected degeneracies.
Original language | English (US) |
---|---|
Article number | 9374468 |
Pages (from-to) | 4669-4692 |
Number of pages | 24 |
Journal | IEEE Transactions on Information Theory |
Volume | 67 |
Issue number | 7 |
DOIs | |
State | Published - Jul 2021 |
Keywords
- Convergence
- Data models
- Decoding
- Heuristic algorithms
- Kernel
- Task analysis
- Training
- autoencoders
- gradient dynamics
- neural tangent kernel
- Gradient dynamics
- Autoencoders
- Neural tangent kernel
ASJC Scopus subject areas
- Information Systems
- Library and Information Sciences
- Computer Science Applications