Training Deep Neural Networks

Aggarwal, Charu C.

doi:10.1007/978-3-319-94463-0_3

Charu C. Aggarwal²

430k Accesses
18 Citations

Abstract

The procedure for training neural networks with backpropagation is briefly introduced in Chapter 1 This chapter will expand on the description on Chapter 1 in several ways

“I hated every minute of training, but I said, ‘Don’t quit. Suffer now and live the rest of your life as a champion.”—Muhammad Ali

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Although the backpropagation algorithm was popularized by the Rumelhart et al. papers [408, 409], it had been studied earlier in the context of control theory. Crucially, Paul Werbos’s forgotten (and eventually rediscovered) thesis in 1974 discussed how these backpropagation methods could be used in neural networks. This was well before Rumelhart et al.’s papers in 1986, which were nevertheless significant because the style of presentation contributed to a better understanding of why backpropagation might work.
2.
A different type of manifestation occurs in cases where the parameters in earlier and later layers are shared. In such cases, the effect of an update can be highly unpredictable because of the combined effect of different layers. Such scenarios occur in recurrent neural networks in which the parameters in later temporal layers are tied to those of earlier temporal layers. In such cases, small changes in the parameters can cause large changes in the loss function in very localized regions without any gradient-based indication in nearby regions. Such topological characteristics of the loss function are referred to as cliffs (cf. Section 3.5.4), and they make the problem harder to optimize because the gradient descent tends to either overshoot or undershoot.
3.
In most of this book, we have worked with \(\overline{W}\) as a row-vector. However, it is notationally convenient here to work with \(\overline{W}\) as a column-vector.

Bibliography

R. Ahuja, T. Magnanti, and J. Orlin. Network flows: Theory, algorithms, and applications. Prentice Hall, 1993.
Google Scholar
J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS Conference, pp. 2654–2662, 2014.
Google Scholar
J. Ba, J. Kiros, and G. Hinton. Layer normalization. arXiv:1607.06450, 2016.https://arxiv.org/abs/1607.06450
M. Bazaraa, H. Sherali, and C. Shetty. Nonlinear programming: theory and algorithms. John Wiley and Sons, 2013.
Google Scholar
S. Becker, and Y. LeCun. Improving the convergence of back-propagation learning with second order methods. Proceedings of the 1988 connectionist models summer school, pp. 29–37, 1988.
Google Scholar
J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for hyper-parameter optimization. NIPS Conference, pp. 2546–2554, 2011.
Google Scholar
J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, pp. 281–305, 2012.
MathSciNet MATH Google Scholar
J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. ICML Confererence, pp. 115–123, 2013.
Google Scholar
D. Bertsekas. Nonlinear programming Athena Scientific, 1999.
Google Scholar
C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.
Google Scholar
C. M. Bishop. Bayesian Techniques. Chapter 10 in “Neural Networks for Pattern Recognition,” pp. 385–439, 1995.
Google Scholar
A. Bryson. A gradient method for optimizing multi-stage allocation processes. Harvard University Symposium on Digital Computers and their Applications, 1961.
Google Scholar
C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. ACM KDD Conference, pp. 535–541, 2006.
Google Scholar
W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. ICML Confererence, pp. 2285–2294, 2015.
Google Scholar
A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. ICML Confererence, pp. 1337–1345, 2013.
Google Scholar
T. Cooijmans, N. Ballas, C. Laurent, C. Gulcehre, and A. Courville. Recurrent batch normalization. arXiv:1603.09025, 2016.https://arxiv.org/abs/1603.09025
Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS Conference, pp. 2933–2941, 2014.
Google Scholar
J. Dean et al. Large scale distributed deep networks. NIPS Conference, 2012.
Google Scholar
M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. NIPS Conference, pp. 2148–2156, 2013.
Google Scholar
G. Desjardins, K. Simonyan, and R. Pascanu. Natural neural networks. NIPS Congference, pp. 2071–2079, 2015.
Google Scholar
T. Dettmers. 8-bit approximations for parallelism in deep learning. arXiv:1511.04561, 2015.https://arxiv.org/abs/1511.04561
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, pp. 2121–2159, 2011.
MathSciNet MATH Google Scholar
H. Gavin. The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems, 2011.http://people.duke.edu/~hpgavin/ce281/lm.pdf
X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS, pp. 249–256, 2010.
Google Scholar
X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. AISTATS, 15(106), 2011.
Google Scholar
I. Goodfellow, O. Vinyals, and A. Saxe. Qualitatively characterizing neural network optimization problems. arXiv:1412.6544, 2014. [Also appears in International Conference in Learning Representations, 2015]https://arxiv.org/abs/1412.6544
I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.
Google Scholar
R. Hahnloser and H. S. Seung. Permitted and forbidden sets in symmetric threshold-linear networks. NIPS Conference, pp. 217–223, 2001.
Google Scholar
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. EIE: Efficient Inference Engine for Compressed Neural Network. ACM SIGARCH Computer Architecture News, 44(3), pp. 243–254, 2016.
Article Google Scholar
S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural networks. NIPS Conference, pp. 1135–1143, 2015.
Google Scholar
M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. ICML Confererence, pp. 1225–1234, 2006.
Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. IEEE International Conference on Computer Vision, pp. 1026–1034, 2015.
Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
Google Scholar
M. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6), 1952.
Google Scholar
G. Hinton. Neural networks for machine learning, Coursera Video, 2012.
Google Scholar
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. NIPS Workshop, 2014.
Google Scholar
R. Hochberg. Matrix Multiplication with CUDA: A basic introduction to the CUDA programming model. Unpublished manuscript, 2012. http://www.shodor.org/media/content/petascale/materials/UPModules/matrixMultiplication/moduleDocument.pdf
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.
Google Scholar
F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360, 2016.https://arxiv.org/abs/1602.07360
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
Google Scholar
R. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1(4), pp. 295–307, 1988.
Article Google Scholar
K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? International Conference on Computer Vision (ICCV), 2009.
Google Scholar
H. J. Kelley. Gradient theory of optimal flight paths. Ars Journal, 30(10), pp. 947–954, 1960.
Article Google Scholar
D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.https://arxiv.org/abs/1412.6980
A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997, 2014.https://arxiv.org/abs/1404.5997
A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. NIPS Conference, pp. 1097–1105. 2012.
Google Scholar
Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng, On optimization methods for deep learning. ICML Conference, pp. 265–272, 2011.
Google Scholar
Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. in G. Orr and K. Muller (eds.) Neural Networks: Tricks of the Trade, Springer, 1998.
Google Scholar
Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. NIPS Conference, pp. 598–605, 1990.
Google Scholar
D. Luenberger and Y. Ye. Linear and nonlinear programming, Addison-Wesley, 1984.
Google Scholar
D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), pp. 448–472, 1992.
Article Google Scholar
J. Martens. Deep learning via Hessian-free optimization. ICML Conference, pp. 735–742, 2010.
Google Scholar
J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. ICML Conference, pp. 1033–1040, 2011.
Google Scholar
J. Martens, I. Sutskever, and K. Swersky. Estimating the hessian by back-propagating curvature. arXiv:1206.6464, 2016.https://arxiv.org/abs/1206.6464
J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. ICML Conference, 2015.
Google Scholar
T. Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology, 2012.
Google Scholar
M. Minsky and S. Papert. Perceptrons. An Introduction to Computational Geometry, MIT Press, 1969.
Google Scholar
Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1∕k ²). Soviet Mathematics Doklady, 27, pp. 372–376, 1983.
MATH Google Scholar
J. Nocedal and S. Wright. Numerical optimization. Springer, 2006.
Google Scholar
G. Orr and K.-R. Müller (editors). Neural Networks: Tricks of the Trade, Springer, 1998.
Google Scholar
R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML Conference, 28, pp. 1310–1318, 2013.
Google Scholar
R. Pascanu, T. Mikolov, and Y. Bengio. Understanding the exploding gradient problem. CoRR, abs/1211.5063, 2012.
Google Scholar
E. Polak. Computational methods in optimization: a unified approach. Academic Press, 1971.
Google Scholar
B. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), pp. 838–855, 1992.
Article MathSciNet Google Scholar
D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323 (6088), pp. 533–536, 1986.
Article Google Scholar
D. Rumelhart, G. Hinton, and R. Williams. Learning internal representations by back-propagating errors. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pp. 318–362, 1986.
Google Scholar
T. Salimans and D. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NIPS Conference, pp. 901–909, 2016.
Google Scholar
A. Saxe, J. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
Google Scholar
T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. ICML Confererence, pp. 343–351, 2013.
Google Scholar
J. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical Report, CMU-CS-94-125, Carnegie-Mellon University, 1994.
Google Scholar
J. Snoek, H. Larochelle, and R. Adams. Practical bayesian optimization of machine learning algorithms. NIPS Conference, pp. 2951–2959, 2013.
Google Scholar
I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. ICML Confererence, pp. 1139–1147, 2013.
Google Scholar
C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. ACM KDD Conference, pp. 847–855, 2013.
Google Scholar
P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.
Google Scholar
P. Werbos. The roots of backpropagation: from ordered derivatives to neural networks and political forecasting (Vol. 1). John Wiley and Sons, 1994.
Google Scholar
S. Wieseler and H. Ney. A convergence analysis of log-linear training. NIPS Conference, pp. 657–665, 2011.
Google Scholar
O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. Multi-gpu training of convnets. arXiv:1312.5853, 2013.https://arxiv.org/abs/1312.5853
H. Yu and B. Wilamowski. Levenberg–Marquardt training. Industrial Electronics Handbook, 5(12), 1, 2011.
Google Scholar
M. Zeiler. ADADELTA: an adaptive learning rate method. arXiv:1212.5701, 2012.https://arxiv.org/abs/1212.5701
http://caffe.berkeleyvision.org/
http://torch.ch/
http://deeplearning.net/software/theano/
https://www.tensorflow.org/
http://jaberg.github.io/hyperopt/
http://www.cs.ubc.ca/labs/beta/Projects/SMAC/
https://github.com/JasperSnoek/spearmint
https://developer.nvidia.com/cudnn
http://www.nvidia.com/object/machine-learning.html
https://developer.nvidia.com/deep-learning-frameworks

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, International Business Machines, Yorktown Heights, NY, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C. (2018). Training Deep Neural Networks. In: Neural Networks and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-94463-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-94463-0_3
Published: 26 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94462-3
Online ISBN: 978-3-319-94463-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics