TY - JOUR
T1 - Kernel-Based Smoothness Analysis of Residual Networks
AU - Tirer, Tom
AU - Bruna, Joan
AU - Giryes, Raja
N1 - Funding Information:
TT and RG acknowledge support from the European research council (ERC StG 757497 PI Giryes) and Nvidia for donating a GPU. JB acknowledges partial support from the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, and Samsung Electronics.
Publisher Copyright:
© 2021 T. Tirer, J. Bruna & R. Giryes.
PY - 2021
Y1 - 2021
N2 - A major factor in the success of deep neural networks is the use of sophisticated architectures rather than the classical multilayer perceptron (MLP). Residual networks (ResNets) stand out among these powerful modern architectures. Previous works focused on the optimization advantages of deep ResNets over deep MLPs. In this paper, we show another distinction between the two models, namely, a tendency of ResNets to promote smoother interpolations than MLPs. We analyze this phenomenon via the neural tangent kernel (NTK) approach. First, we compute the NTK for a considered ResNet model and prove its stability during gradient descent training. Then, we show by various evaluation methodologies that for ReLU activations the NTK of ResNet, and its kernel regression results, are smoother than the ones of MLP. The better smoothness observed in our analysis may explain the better generalization ability of ResNets and the practice of moderately attenuating the residual blocks.
AB - A major factor in the success of deep neural networks is the use of sophisticated architectures rather than the classical multilayer perceptron (MLP). Residual networks (ResNets) stand out among these powerful modern architectures. Previous works focused on the optimization advantages of deep ResNets over deep MLPs. In this paper, we show another distinction between the two models, namely, a tendency of ResNets to promote smoother interpolations than MLPs. We analyze this phenomenon via the neural tangent kernel (NTK) approach. First, we compute the NTK for a considered ResNet model and prove its stability during gradient descent training. Then, we show by various evaluation methodologies that for ReLU activations the NTK of ResNet, and its kernel regression results, are smoother than the ones of MLP. The better smoothness observed in our analysis may explain the better generalization ability of ResNets and the practice of moderately attenuating the residual blocks.
KW - kernel methods
KW - multilayer perceptron
KW - Neural tangent kernel
KW - residual networks
UR - http://www.scopus.com/inward/record.url?scp=85164028779&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85164028779&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85164028779
SN - 2640-3498
VL - 145
SP - 921
EP - 954
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 2nd Mathematical and Scientific Machine Learning Conference, MSML 2021
Y2 - 16 August 2021 through 19 August 2021
ER -