Layer-Wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12347)


Conditioning analysis uncovers the landscape of an optimization objective by exploring the spectrum of its curvature matrix. This has been well explored theoretically for linear models. We extend this analysis to deep neural networks (DNNs) in order to investigate their learning dynamics. To this end, we propose layer-wise conditioning analysis, which explores the optimization landscape with respect to each layer independently. Such an analysis is theoretically supported under mild assumptions that approximately hold in practice. Based on our analysis, we show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum, which has detrimental effects on the learning. Besides, we experimentally observe that BN can improve the layer-wise conditioning of the optimization problem. Finally, we find that the last linear layer of a very deep residual network displays ill-conditioned behavior. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original and pre-activation residual networks.


Conditioning analysis Normalization Residual network 

Supplementary material (11.6 mb)
Supplementary material 1 (zip 11912 KB)


  1. 1.
    Ba, J., Grosse, R., Martens, J.: Distributed second-order optimization using Kronecker-factored approximations. In: ICLR (2017)Google Scholar
  2. 2.
    Ba, J., Kiros, R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016)Google Scholar
  3. 3.
    Bernacchia, A., Lengyel, M., Hennequin, G.: Exact natural gradient in deep linear networks and its application to the nonlinear case. In: NeurIPS (2018)Google Scholar
  4. 4.
    Bjorck, J., Gomes, C., Selman, B.: Understanding batch normalization. In: NeurIPS (2018)Google Scholar
  5. 5.
    Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)Google Scholar
  7. 7.
    Carreira-Perpinan, M., Wang, W.: Distributed optimization of deeply nested systems. In: AISTATS (2014)Google Scholar
  8. 8.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  9. 9.
    Desjardins, G., Simonyan, K., Pascanu, R., kavukcuoglu, K.: Natural neural networks. In: NeurIPS (2015)Google Scholar
  10. 10.
    Frerix, T., Möllenhoff, T., Möller, M., Cremers, D.: Proximal backpropagation. In: ICLR (2018)Google Scholar
  11. 11.
    Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via Hessian eigenvalue density. In: ICML (2019)Google Scholar
  12. 12.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010 (2010)Google Scholar
  13. 13.
    Grosse, R.B., Martens, J.: A Kronecker-factored approximate Fisher matrix for convolution layers. In: ICML (2016)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: ICCV (2015)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  16. 16.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). Scholar
  17. 17.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Hoffer, E., Banner, R., Golan, I., Soudry, D.: Norm matters: efficient and accurate normalization schemes in deep networks. In: NeurIPS (2018)Google Scholar
  19. 19.
    Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)Google Scholar
  20. 20.
    Huang, L., Liu, X., Liu, Y., Lang, B., Tao, D.: Centered weight normalization in accelerating training of deep neural networks. In: ICCV (2017)Google Scholar
  21. 21.
    Huang, L., Yang, D., Lang, B., Deng, J.: Decorrelated batch normalization. In: CVPR (2018)Google Scholar
  22. 22.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  23. 23.
    Karakida, R., Akaho, S., Amari, S.: Universal statistics of Fisher information in deep neural networks: mean field approach. In: AISTATS (2019)Google Scholar
  24. 24.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)Google Scholar
  25. 25.
    Kohler, J., Daneshmand, H., Lucchi, A., Zhou, M., Neymeyr, K., Hofmann, T.: Towards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694 (2018)
  26. 26.
    LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521, 436–444 (2015)CrossRefGoogle Scholar
  27. 27.
    LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 1524, pp. 9–50. Springer, Heidelberg (1998). Scholar
  28. 28.
    LeCun, Y., Kanter, I., Solla, S.A.: Second order properties of error surfaces: learning time and generalization. In: NeurIPS (1990)Google Scholar
  29. 29.
    Martens, J.: Deep learning via Hessian-free optimization. In: ICML, pp. 735–742 (2010)Google Scholar
  30. 30.
    Martens, J.: New perspectives on the natural gradient method. CoRR abs/1412.1193 (2014)Google Scholar
  31. 31.
    Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-factored approximate curvature. In: ICML (2015)Google Scholar
  32. 32.
    Martens, J., Sutskever, I., Swersky, K.: Estimating the Hessian by back-propagating curvature. In: ICML (2012)Google Scholar
  33. 33.
    Montavon, G., Müller, K.-R.: Deep Boltzmann machines and the centering trick. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012). Scholar
  34. 34.
    Papyan, V.: The full spectrum of deep net Hessians at scale: dynamics with sample size. CoRR abs/1811.07062 (2018)Google Scholar
  35. 35.
    Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: ICLR (2014)Google Scholar
  36. 36.
    Roux, N.L., Manzagol, P., Bengio, Y.: Topmoumoute online natural gradient algorithm. In: NeurIPS, pp. 849–856 (2007)Google Scholar
  37. 37.
    Sagun, L., Evci, U., Güney, V.U., Dauphin, Y.N., Bottou, L.: Empirical analysis of the Hessian of over-parametrized neural networks. CoRR abs/1706.04454 (2017)Google Scholar
  38. 38.
    Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? In: NeurIPS (2018)Google Scholar
  39. 39.
    Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: ICLR (2014)Google Scholar
  40. 40.
    Schraudolph, N.N.: Accelerated gradient descent by factor-centering decomposition. Technical report (1998)Google Scholar
  41. 41.
    Sun, K., Nielsen, F.: Relative Fisher information and natural gradient for learning large modular models. In: ICML (2017)Google Scholar
  42. 42.
    Ulyanov, D., Vedaldi, A., Lempitsky, V.S.: Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022 (2016)Google Scholar
  43. 43.
    Wei, M., Stokes, J., Schwab, D.J.: Mean-field analysis of batch normalization. arXiv:1903.02606 (2019)
  44. 44.
    Wiesler, S., Ney, H.: A convergence analysis of log-linear training. In: NeurIPS (2011)Google Scholar
  45. 45.
    Wu, S., Li, G., Deng, L., Liu, L., Xie, Y., Shi, L.: L1-norm batch normalization for efficient training of deep neural networks. CoRR (2018)Google Scholar
  46. 46.
    Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). Scholar
  47. 47.
    Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., Schoenholz, S.S.: A mean field theory of batch normalization. In: ICLR (2019)Google Scholar
  48. 48.
    Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)Google Scholar
  49. 49.
    Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012)Google Scholar
  50. 50.
    Zhang, H., Chen, W., Liu, T.Y.: On the local Hessian in back-propagation. In: NeurIPS (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Inception Institute of Artificial Intelligence (IIAI)Abu DhabiUAE
  2. 2.Mohamed bin Zayed University of Artificial IntelligenceAbu DhabiUAE

Personalised recommendations