Skip to main content

Training Deep Neural Networks

  • Chapter
  • First Online:
Neural Networks and Deep Learning

Abstract

The procedure for training neural networks with backpropagation is briefly introduced in ChapterĀ 1 This chapter will expand on the description on ChapterĀ 1 in several ways

ā€œI hated every minute of training, but I said, ā€˜Donā€™t quit. Suffer now and live the rest of your life as a champion.ā€ā€”Muhammad Ali

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Although the backpropagation algorithm was popularized by the RumelhartĀ etĀ al. papersĀ [408, 409], it had been studied earlier in the context of control theory. Crucially, Paul Werbosā€™s forgotten (and eventually rediscovered) thesis in 1974 discussed how these backpropagation methods could be used in neural networks. This was well before RumelhartĀ et al.ā€™s papers in 1986, which were nevertheless significant because the style of presentation contributed to a better understanding of why backpropagation might work.

  2. 2.

    A different type of manifestation occurs in cases where the parameters in earlier and later layers are shared. In such cases, the effect of an update can be highly unpredictable because of the combined effect of different layers. Such scenarios occur in recurrent neural networks in which the parameters in later temporal layers are tied to those of earlier temporal layers. In such cases, small changes in the parameters can cause large changes in the loss function in very localized regions without any gradient-based indication in nearby regions. Such topological characteristics of the loss function are referred to as cliffs (cf.Ā SectionĀ 3.5.4), and they make the problem harder to optimize because the gradient descent tends to either overshoot or undershoot.

  3. 3.

    In most of this book, we have worked with \(\overline{W}\) as a row-vector. However, it is notationally convenient here to work with \(\overline{W}\) as a column-vector.

Bibliography

  1. R. Ahuja, T. Magnanti, and J. Orlin. Network flows: Theory, algorithms, and applications. Prentice Hall, 1993.

    Google ScholarĀ 

  2. J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS Conference, pp.Ā 2654ā€“2662, 2014.

    Google ScholarĀ 

  3. J. Ba, J. Kiros, and G. Hinton. Layer normalization. arXiv:1607.06450, 2016.https://arxiv.org/abs/1607.06450

  4. M. Bazaraa, H. Sherali, and C. Shetty. Nonlinear programming: theory and algorithms. John Wiley and Sons, 2013.

    Google ScholarĀ 

  5. S. Becker, and Y. LeCun. Improving the convergence of back-propagation learning with second order methods. Proceedings of the 1988 connectionist models summer school, pp.Ā 29ā€“37, 1988.

    Google ScholarĀ 

  6. J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for hyper-parameter optimization. NIPS Conference, pp.Ā 2546ā€“2554, 2011.

    Google ScholarĀ 

  7. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, pp.Ā 281ā€“305, 2012.

    MathSciNetĀ  MATHĀ  Google ScholarĀ 

  8. J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. ICML Confererence, pp.Ā 115ā€“123, 2013.

    Google ScholarĀ 

  9. D. Bertsekas. Nonlinear programming Athena Scientific, 1999.

    Google ScholarĀ 

  10. C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.

    Google ScholarĀ 

  11. C. M. Bishop. Bayesian Techniques. Chapter 10 in ā€œNeural Networks for Pattern Recognition,ā€ pp.Ā 385ā€“439, 1995.

    Google ScholarĀ 

  12. A. Bryson. A gradient method for optimizing multi-stage allocation processes. Harvard University Symposium on Digital Computers and their Applications, 1961.

    Google ScholarĀ 

  13. C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. ACM KDD Conference, pp.Ā 535ā€“541, 2006.

    Google ScholarĀ 

  14. W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. ICML Confererence, pp.Ā 2285ā€“2294, 2015.

    Google ScholarĀ 

  15. A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. ICML Confererence, pp.Ā 1337ā€“1345, 2013.

    Google ScholarĀ 

  16. T. Cooijmans, N. Ballas, C. Laurent, C. Gulcehre, and A. Courville. Recurrent batch normalization. arXiv:1603.09025, 2016.https://arxiv.org/abs/1603.09025

  17. Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS Conference, pp.Ā 2933ā€“2941, 2014.

    Google ScholarĀ 

  18. J. DeanĀ et al. Large scale distributed deep networks. NIPS Conference, 2012.

    Google ScholarĀ 

  19. M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. NIPS Conference, pp.Ā 2148ā€“2156, 2013.

    Google ScholarĀ 

  20. G. Desjardins, K. Simonyan, and R. Pascanu. Natural neural networks. NIPS Congference, pp.Ā 2071ā€“2079, 2015.

    Google ScholarĀ 

  21. T. Dettmers. 8-bit approximations for parallelism in deep learning. arXiv:1511.04561, 2015.https://arxiv.org/abs/1511.04561

  22. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, pp.Ā 2121ā€“2159, 2011.

    MathSciNetĀ  MATHĀ  Google ScholarĀ 

  23. H. Gavin. The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems, 2011.http://people.duke.edu/~hpgavin/ce281/lm.pdf

  24. X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS, pp.Ā 249ā€“256, 2010.

    Google ScholarĀ 

  25. X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. AISTATS, 15(106), 2011.

    Google ScholarĀ 

  26. I. Goodfellow, O. Vinyals, and A. Saxe. Qualitatively characterizing neural network optimization problems. arXiv:1412.6544, 2014. [Also appears in International Conference in Learning Representations, 2015]https://arxiv.org/abs/1412.6544

  27. I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.

    Google ScholarĀ 

  28. R. Hahnloser and H. S. Seung. Permitted and forbidden sets in symmetric threshold-linear networks. NIPS Conference, pp.Ā 217ā€“223, 2001.

    Google ScholarĀ 

  29. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. EIE: Efficient Inference Engine for Compressed Neural Network. ACM SIGARCH Computer Architecture News, 44(3), pp.Ā 243ā€“254, 2016.

    ArticleĀ  Google ScholarĀ 

  30. S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural networks. NIPS Conference, pp.Ā 1135ā€“1143, 2015.

    Google ScholarĀ 

  31. M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. ICML Confererence, pp.Ā 1225ā€“1234, 2006.

    Google ScholarĀ 

  32. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. IEEE International Conference on Computer Vision, pp.Ā 1026ā€“1034, 2015.

    Google ScholarĀ 

  33. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp.Ā 770ā€“778, 2016.

    Google ScholarĀ 

  34. M. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6), 1952.

    Google ScholarĀ 

  35. G. Hinton. Neural networks for machine learning, Coursera Video, 2012.

    Google ScholarĀ 

  36. G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. NIPS Workshop, 2014.

    Google ScholarĀ 

  37. R. Hochberg. Matrix Multiplication with CUDA: A basic introduction to the CUDA programming model. Unpublished manuscript, 2012. http://www.shodor.org/media/content/petascale/materials/UPModules/matrixMultiplication/moduleDocument.pdf

  38. S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.

    Google ScholarĀ 

  39. F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360, 2016.https://arxiv.org/abs/1602.07360

  40. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.

    Google ScholarĀ 

  41. R. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1(4), pp.Ā 295ā€“307, 1988.

    ArticleĀ  Google ScholarĀ 

  42. K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? International Conference on Computer Vision (ICCV), 2009.

    Google ScholarĀ 

  43. H. J. Kelley. Gradient theory of optimal flight paths. Ars Journal, 30(10), pp.Ā 947ā€“954, 1960.

    ArticleĀ  Google ScholarĀ 

  44. D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.https://arxiv.org/abs/1412.6980

  45. A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997, 2014.https://arxiv.org/abs/1404.5997

  46. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. NIPS Conference, pp.Ā 1097ā€“1105. 2012.

    Google ScholarĀ 

  47. Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng, On optimization methods for deep learning. ICML Conference, pp.Ā 265ā€“272, 2011.

    Google ScholarĀ 

  48. Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. in G. Orr and K. Muller (eds.) Neural Networks: Tricks of the Trade, Springer, 1998.

    Google ScholarĀ 

  49. Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. NIPS Conference, pp.Ā 598ā€“605, 1990.

    Google ScholarĀ 

  50. D. Luenberger and Y. Ye. Linear and nonlinear programming, Addison-Wesley, 1984.

    Google ScholarĀ 

  51. D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), pp.Ā 448ā€“472, 1992.

    ArticleĀ  Google ScholarĀ 

  52. J. Martens. Deep learning via Hessian-free optimization. ICML Conference, pp.Ā 735ā€“742, 2010.

    Google ScholarĀ 

  53. J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. ICML Conference, pp.Ā 1033ā€“1040, 2011.

    Google ScholarĀ 

  54. J. Martens, I. Sutskever, and K. Swersky. Estimating the hessian by back-propagating curvature. arXiv:1206.6464, 2016.https://arxiv.org/abs/1206.6464

  55. J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. ICML Conference, 2015.

    Google ScholarĀ 

  56. T. Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology, 2012.

    Google ScholarĀ 

  57. M. Minsky and S. Papert. Perceptrons. An Introduction to Computational Geometry, MIT Press, 1969.

    Google ScholarĀ 

  58. Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1āˆ•k 2). Soviet Mathematics Doklady, 27, pp.Ā 372ā€“376, 1983.

    MATHĀ  Google ScholarĀ 

  59. J. Nocedal and S. Wright. Numerical optimization. Springer, 2006.

    Google ScholarĀ 

  60. G. Orr and K.-R. MĆ¼ller (editors). Neural Networks: Tricks of the Trade, Springer, 1998.

    Google ScholarĀ 

  61. R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML Conference, 28, pp.Ā 1310ā€“1318, 2013.

    Google ScholarĀ 

  62. R. Pascanu, T. Mikolov, and Y. Bengio. Understanding the exploding gradient problem. CoRR, abs/1211.5063, 2012.

    Google ScholarĀ 

  63. E. Polak. Computational methods in optimization: a unified approach. Academic Press, 1971.

    Google ScholarĀ 

  64. B. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), pp.Ā 838ā€“855, 1992.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  65. D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323 (6088), pp.Ā 533ā€“536, 1986.

    ArticleĀ  Google ScholarĀ 

  66. D. Rumelhart, G. Hinton, and R. Williams. Learning internal representations by back-propagating errors. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pp.Ā 318ā€“362, 1986.

    Google ScholarĀ 

  67. T. Salimans and D. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NIPS Conference, pp.Ā 901ā€“909, 2016.

    Google ScholarĀ 

  68. A. Saxe, J. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.

    Google ScholarĀ 

  69. T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. ICML Confererence, pp. 343ā€“351, 2013.

    Google ScholarĀ 

  70. J. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical Report, CMU-CS-94-125, Carnegie-Mellon University, 1994.

    Google ScholarĀ 

  71. J. Snoek, H. Larochelle, and R. Adams. Practical bayesian optimization of machine learning algorithms. NIPS Conference, pp.Ā 2951ā€“2959, 2013.

    Google ScholarĀ 

  72. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. ICML Confererence, pp.Ā 1139ā€“1147, 2013.

    Google ScholarĀ 

  73. C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. ACM KDD Conference, pp.Ā 847ā€“855, 2013.

    Google ScholarĀ 

  74. P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.

    Google ScholarĀ 

  75. P. Werbos. The roots of backpropagation: from ordered derivatives to neural networks and political forecasting (Vol. 1). John Wiley and Sons, 1994.

    Google ScholarĀ 

  76. S. Wieseler and H. Ney. A convergence analysis of log-linear training. NIPS Conference, pp.Ā 657ā€“665, 2011.

    Google ScholarĀ 

  77. O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. Multi-gpu training of convnets. arXiv:1312.5853, 2013.https://arxiv.org/abs/1312.5853

  78. H. Yu and B. Wilamowski. Levenbergā€“Marquardt training. Industrial Electronics Handbook, 5(12), 1, 2011.

    Google ScholarĀ 

  79. M. Zeiler. ADADELTA: an adaptive learning rate method. arXiv:1212.5701, 2012.https://arxiv.org/abs/1212.5701

  80. http://caffe.berkeleyvision.org/

  81. http://torch.ch/

  82. http://deeplearning.net/software/theano/

  83. https://www.tensorflow.org/

  84. http://jaberg.github.io/hyperopt/

  85. http://www.cs.ubc.ca/labs/beta/Projects/SMAC/

  86. https://github.com/JasperSnoek/spearmint

  87. https://developer.nvidia.com/cudnn

  88. http://www.nvidia.com/object/machine-learning.html

  89. https://developer.nvidia.com/deep-learning-frameworks

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Training Deep Neural Networks. In: Neural Networks and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-94463-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94463-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94462-3

  • Online ISBN: 978-3-319-94463-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics