Effects of Skip-Connection in ResNet and Batch-Normalization on Fisher Information Matrix
Deep neural networks such as multi-layer perceptron (MLP) have intensively been studied and new techniques have been introduced for better generalization ability and faster convergence. One of the techniques is skip-connections between layers in the ResNet and another is the batch normalization (BN). To clarify effects of these techniques, we carried out the landscape analysis of the loss function for these networks. The landscape affects the convergence properties where the eigenvalues of the Fisher Information Matrix (FIM) plays an important role. Thus, we calculated the eigenvalues of the FIMs of the MLP, ResNet, and ResNet with BN by applying functional analysis to the networks with random weights, which of MLP was analyzed before in asymptotic case using the central limit theorem. Our results show that the MLP has eigenvalues that are independent of its depth, that the ResNet has eigenvalues that grow exponentially with its depth, and that the ResNet with BN has eigenvalues that grow sub-linear with its depth. These imply that the BN allows the ResNet to use larger learning rate and hence converges faster than the vanilla ResNet.
KeywordsResNet Batch-normalization Fisher Information Matrix
This work was supported by JSPS KAKENHI Grant Number JP18J15055, JP18K19821, and NAIST Big Data Project.
- 1.Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma, K.W.D., McWilliams, B.: The shattered gradients problem: if ResNets are the answer, then what is the question? In: International Conference on Machine Learning, pp. 342–350 (2017)Google Scholar
- 3.He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)Google Scholar
- 4.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
- 5.Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
- 6.Karakida, R., Akaho, S., Amari, S.: Universal statistics of fisher information in deep neural networks: mean field approach. arXiv preprint arXiv:1806.01316 (2018)
- 9.LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural Networks: Tricks of the Trade, pp. 9–48. Springer, Heidelberg (2012)Google Scholar
- 10.Montufar, G.F., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2924–2932 (2014)Google Scholar
- 11.Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., Sohl-Dickstein, J.: On the expressive power of deep neural networks. In: International Conference on Machine Learning, pp. 2847–2854 (2017)Google Scholar
- 13.Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? In: Advances in Neural Information Processing Systems, vol. 31 (2018)Google Scholar
- 14.Telgarsky, M.: Benefits of depth in neural networks. In: Conference on Learning Theory, pp. 1517–1539 (2016)Google Scholar