Efficient BackProp

LeCun, Yann A.; Bottou, Léon; Orr, Genevieve B.; Müller, Klaus-Robert

doi:10.1007/978-3-642-35289-8_3

Yann A. LeCun¹⁸,
Léon Bottou¹⁸,
Genevieve B. Orr¹⁹ &
…
Klaus-Robert Müller²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7700))

68k Accesses
744 Citations
10 Altmetric

Abstract

The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work.

Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amari, S.: Neural learning in structured parameter spaces — natural riemannian gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 127. MIT Press (1997)
Google Scholar
Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)
Article Google Scholar
Battiti, R.: First- and second-order methods for learning: Between steepest descent and newton’s method. Neural Computation 4, 141–166 (1992)
Article Google Scholar
Becker, S., LeCun, Y.: Improving the convergence of backbropagation learning with second oder metho ds. In: Touretzky, D., Hinton, G., Sejnowski, T. (eds.) Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37. Lawrence Erlbaum Associates (1989)
Google Scholar
Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)
MATH Google Scholar
Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.) Online Learning in Neural Networks (1997 Workshop at the Newton Institute). The Newton Institute Series. Cambridge University Press, Cambridge (1998)
Google Scholar
Broomhead, D.S., Lowe, D.: Multivariable function interpolation and adaptive networks. Complex Systems 2, 321–355 (1988)
MathSciNet MATH Google Scholar
Buntine, W.L., Weigend, A.S.: Computing second order derivatives in Feed-Forward networks: A review. IEEE Transactions on Neural Networks (1993) (to appear)
Google Scholar
Darken, C., Moody, J.E.: Note on learning rate schedules for stochastic optimization. In: Lippmann, R.P., Moody, J.E., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 3, pp. 832–838. Morgan Kaufmann, San Mateo (1991)
Google Scholar
Diamantaras, K.I., Kung, S.Y.: Principal Component Neural Networks. Wiley, New York (1996)
MATH Google Scholar
Fletcher, R.: Practical Methods of Optimization, ch. 8.7: Polynomial time algorithms, 2nd edn., pp. 183–188. John Wiley & Sons, New York (1987)
Google Scholar
Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)
Article Google Scholar
Goldstein, L.: Mean square optimality in the continuous time Robbins Monro procedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern California, LA (1987)
Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press, Baltimore (1989)
MATH Google Scholar
Heskes, T.M., Kappen, B.: On-line learning processes in artificial neural networks. In: Tayler, J.G. (ed.) Mathematical Approaches to Neural Networks, vol. 51, pp. 199–233. Elsevier, Amsterdam (1993)
Google Scholar
Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Networks 1, 295–307 (1988)
Article Google Scholar
Kramer, A.H., Sangiovanni-Vincentelli, A.: Efficient parallel learning algorithms for neural networks. In: Touretzky, D.S. (ed.) Proceedings of the 1988 Conference on Advances in Neural Information Processing Systems, pp. 40–48. Morgan Kaufmann, San Mateo (1989)
Google Scholar
LeCun, Y.: Modeles connexionnistes de l’apprentissage (connectionist learning models). PhD thesis, Universitè P. et M. Curie, Paris VI (1987)
Google Scholar
LeCun, Y.: Generalization and network design strategies. In: Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds.) Proceedings of the International Conference Connectionism in Perspective, University of Zürich, October 10-13. Elsevier, Amsterdam (1988)
Google Scholar
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a backpropagation network. In: Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2. Morgan Kaufmann, San Mateo (1990)
Google Scholar
LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2, pp. 598–605 (1990)
Google Scholar
LeCun, Y., Kanter, I., Solla, S.A.: Second order properties of error surfaces. In: Advances in Neural Information Processing Systems, vol. 3. Morgan Kaufmann, San Mateo (1991)
Google Scholar
LeCun, Y., Simard, P.Y., Pearlmutter, B.: Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In: Giles, Hanson, Cowan (eds.) Advances in Neural Information Processing Systems, vol. 5. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Møller, M.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6, 525–533 (1993)
Article Google Scholar
Møller, M.: Supervised learning on large redundant training sets. International Journal of Neural Systems 4(1), 15–25 (1993)
Article Google Scholar
Moody, J.E., Darken, C.J.: Fast learning in networks of locally-tuned processing units. Neural Computation 1, 281–294 (1989)
Article Google Scholar
Murata, N.: PhD thesis, University of Tokyo (1992) (in Japanese)
Google Scholar
Murata, N., Müller, K.-R., Ziehe, A., Amari, S.: Adaptive on-line learning in changing environments. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 599. MIT Press (1997)
Google Scholar
Oppenheim, A.V., Schafer, R.W.: Digital Signal Processing. Prentice-Hall, Englewood Cliffs (1975)
MATH Google Scholar
Orr, G.B.: Dynamics and Algorithms for Stochastic learning. PhD thesis, Oregon Graduate Institute (1995)
Google Scholar
Orr, G.B.: Removing noise in on-line search using adaptive batch sizes. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 232. MIT Press (1997)
Google Scholar
Orr, M.J.L.: Regularization in the selection of radial basis function centers. Neural Computation 7(3), 606–623 (1995)
Article Google Scholar
Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Computation 6, 147–160 (1994)
Article Google Scholar
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipies in C: The art of Scientific Programming. Cambridge University Press, Cambridge (1988)
MATH Google Scholar
Saad, D. (ed.): Online Learning in Neural Networks (1997 Workshop at the Newton Institute). The Newton Institute Series. Cambridge University Press, Cambridge (1998)
Google Scholar
Saad, D., Solla, S.A.: Exact solution for on-line learning in multilayer neural networks. Physical Review Letters 74, 4337–4340 (1995)
Article Google Scholar
Sompolinsky, H., Barkai, N., Seung, H.S.: On-line learning of dichotomies: algorithms and learning curves. In: Oh, J.-H., Kwon, C., Cho, S. (eds.) Neural Networks: The Statistical Mechanics Perspective, pp. 105–130. World Scientific, Singapore (1995)
Google Scholar
Sutton, R.S.: Adapting bias by gradient descent: An incremental version of delta-bar-delta. In: Swartout, W. (ed.) Proceedings of the 10th National Conference on Artificial Intelligence, pp. 171–176. MIT Press, San Jose (July 1992)
Google Scholar
van der Smagt, P.: Minimisation methods for training feed-forward networks. Neural Networks 7(1), 1–11 (1994)
Article Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book MATH Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-37, 328–339 (1989)
Article Google Scholar
Wiegerinck, W., Komoda, A., Heskes, T.: Stochastic dynamics of learning with momentum in neural networks. Journal of Physics A 27, 4425–4437 (1994)
Article MathSciNet MATH Google Scholar
Yang, H.H., Amari, S.: The efficiency and the robustness of natural gradient descent learning rule. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10. MIT Press (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Image Processing Research Department, AT& T Labs - Research, 100 Schulz Drive, Red Bank, NJ, 07701-7033, USA
Yann A. LeCun & Léon Bottou
Willamette University, 900 State Street, Salem, OR, 97301, USA
Genevieve B. Orr
GMD FIRST, Rudower Chaussee 5, 12489, Berlin, Germany
Klaus-Robert Müller

Authors

Yann A. LeCun
View author publications
You can also search for this author in PubMed Google Scholar
Léon Bottou
View author publications
You can also search for this author in PubMed Google Scholar
Genevieve B. Orr
View author publications
You can also search for this author in PubMed Google Scholar
Klaus-Robert Müller
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science, Technische Universität Berlin, Franklinstr. 28/29, 10587, Berlin, Germany
Grégoire Montavon & Klaus-Robert Müller &
Dept. of computer Science, Willamette University, 900 State Street, 97301, Salem, OR, USA
Geneviève B. Orr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

LeCun, Y.A., Bottou, L., Orr, G.B., Müller, KR. (2012). Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-35289-8_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35288-1
Online ISBN: 978-3-642-35289-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics