Abstract
Conditioning analysis uncovers the landscape of an optimization objective by exploring the spectrum of its curvature matrix. This has been well explored theoretically for linear models. We extend this analysis to deep neural networks (DNNs) in order to investigate their learning dynamics. To this end, we propose layer-wise conditioning analysis, which explores the optimization landscape with respect to each layer independently. Such an analysis is theoretically supported under mild assumptions that approximately hold in practice. Based on our analysis, we show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum, which has detrimental effects on the learning. Besides, we experimentally observe that BN can improve the layer-wise conditioning of the optimization problem. Finally, we find that the last linear layer of a very deep residual network displays ill-conditioned behavior. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original and pre-activation residual networks.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We evaluate the general condition number with respect to the percentage: \(\kappa _{p}=\frac{\lambda _{max}}{\lambda _{p}}\), where \(\lambda _{p}\) is the pd-th eigenvalue (in descending order) and d is the number of eigenvalues, e.g., \(\kappa _{100\%}\) is the original definition of the condition number.
- 2.
We also perform SGD with a batch size of 1024, and further perform experiments on convolutional neural networks (CNNs) for CIFAR-10 and ImageNet. The results are shown in
, in which we have the same observation as the full gradient descent. - 3.
The large magnitude of \(\lambda _{\varSigma _{\mathbf {x}}}\) is caused mainly by the addition of multiple residual connections from the previous layers with ReLU output.
References
Ba, J., Grosse, R., Martens, J.: Distributed second-order optimization using Kronecker-factored approximations. In: ICLR (2017)
Ba, J., Kiros, R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016)
Bernacchia, A., Lengyel, M., Hennequin, G.: Exact natural gradient in deep linear networks and its application to the nonlinear case. In: NeurIPS (2018)
Bjorck, J., Gomes, C., Selman, B.: Understanding batch normalization. In: NeurIPS (2018)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)
Carreira-Perpinan, M., Wang, W.: Distributed optimization of deeply nested systems. In: AISTATS (2014)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Desjardins, G., Simonyan, K., Pascanu, R., kavukcuoglu, K.: Natural neural networks. In: NeurIPS (2015)
Frerix, T., Möllenhoff, T., Möller, M., Cremers, D.: Proximal backpropagation. In: ICLR (2018)
Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via Hessian eigenvalue density. In: ICML (2019)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010 (2010)
Grosse, R.B., Martens, J.: A Kronecker-factored approximate Fisher matrix for convolution layers. In: ICML (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: ICCV (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)
Hoffer, E., Banner, R., Golan, I., Soudry, D.: Norm matters: efficient and accurate normalization schemes in deep networks. In: NeurIPS (2018)
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
Huang, L., Liu, X., Liu, Y., Lang, B., Tao, D.: Centered weight normalization in accelerating training of deep neural networks. In: ICCV (2017)
Huang, L., Yang, D., Lang, B., Deng, J.: Decorrelated batch normalization. In: CVPR (2018)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Karakida, R., Akaho, S., Amari, S.: Universal statistics of Fisher information in deep neural networks: mean field approach. In: AISTATS (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Kohler, J., Daneshmand, H., Lucchi, A., Zhou, M., Neymeyr, K., Hofmann, T.: Towards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694 (2018)
LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521, 436–444 (2015)
LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 1524, pp. 9–50. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49430-8_2
LeCun, Y., Kanter, I., Solla, S.A.: Second order properties of error surfaces: learning time and generalization. In: NeurIPS (1990)
Martens, J.: Deep learning via Hessian-free optimization. In: ICML, pp. 735–742 (2010)
Martens, J.: New perspectives on the natural gradient method. CoRR abs/1412.1193 (2014)
Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-factored approximate curvature. In: ICML (2015)
Martens, J., Sutskever, I., Swersky, K.: Estimating the Hessian by back-propagating curvature. In: ICML (2012)
Montavon, G., Müller, K.-R.: Deep Boltzmann machines and the centering trick. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_33
Papyan, V.: The full spectrum of deep net Hessians at scale: dynamics with sample size. CoRR abs/1811.07062 (2018)
Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: ICLR (2014)
Roux, N.L., Manzagol, P., Bengio, Y.: Topmoumoute online natural gradient algorithm. In: NeurIPS, pp. 849–856 (2007)
Sagun, L., Evci, U., Güney, V.U., Dauphin, Y.N., Bottou, L.: Empirical analysis of the Hessian of over-parametrized neural networks. CoRR abs/1706.04454 (2017)
Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? In: NeurIPS (2018)
Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: ICLR (2014)
Schraudolph, N.N.: Accelerated gradient descent by factor-centering decomposition. Technical report (1998)
Sun, K., Nielsen, F.: Relative Fisher information and natural gradient for learning large modular models. In: ICML (2017)
Ulyanov, D., Vedaldi, A., Lempitsky, V.S.: Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022 (2016)
Wei, M., Stokes, J., Schwab, D.J.: Mean-field analysis of batch normalization. arXiv:1903.02606 (2019)
Wiesler, S., Ney, H.: A convergence analysis of log-linear training. In: NeurIPS (2011)
Wu, S., Li, G., Deng, L., Liu, L., Xie, Y., Shi, L.: L1-norm batch normalization for efficient training of deep neural networks. CoRR (2018)
Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_1
Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., Schoenholz, S.S.: A mean field theory of batch normalization. In: ICLR (2019)
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012)
Zhang, H., Chen, W., Liu, T.Y.: On the local Hessian in back-propagation. In: NeurIPS (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, L., Qin, J., Liu, L., Zhu, F., Shao, L. (2020). Layer-Wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12347. Springer, Cham. https://doi.org/10.1007/978-3-030-58536-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-58536-5_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58535-8
Online ISBN: 978-3-030-58536-5
eBook Packages: Computer ScienceComputer Science (R0)

, in which we have the same observation as the full gradient descent.