Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7700))

Abstract

The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work.

Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Amari, S.: Neural learning in structured parameter spaces — natural riemannian gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 127. MIT Press (1997)

    Google Scholar 

  2. Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)

    Article  Google Scholar 

  3. Battiti, R.: First- and second-order methods for learning: Between steepest descent and newton’s method. Neural Computation 4, 141–166 (1992)

    Article  Google Scholar 

  4. Becker, S., LeCun, Y.: Improving the convergence of backbropagation learning with second oder metho ds. In: Touretzky, D., Hinton, G., Sejnowski, T. (eds.) Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37. Lawrence Erlbaum Associates (1989)

    Google Scholar 

  5. Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)

    MATH  Google Scholar 

  6. Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.) Online Learning in Neural Networks (1997 Workshop at the Newton Institute). The Newton Institute Series. Cambridge University Press, Cambridge (1998)

    Google Scholar 

  7. Broomhead, D.S., Lowe, D.: Multivariable function interpolation and adaptive networks. Complex Systems 2, 321–355 (1988)

    MathSciNet  MATH  Google Scholar 

  8. Buntine, W.L., Weigend, A.S.: Computing second order derivatives in Feed-Forward networks: A review. IEEE Transactions on Neural Networks (1993) (to appear)

    Google Scholar 

  9. Darken, C., Moody, J.E.: Note on learning rate schedules for stochastic optimization. In: Lippmann, R.P., Moody, J.E., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 3, pp. 832–838. Morgan Kaufmann, San Mateo (1991)

    Google Scholar 

  10. Diamantaras, K.I., Kung, S.Y.: Principal Component Neural Networks. Wiley, New York (1996)

    MATH  Google Scholar 

  11. Fletcher, R.: Practical Methods of Optimization, ch. 8.7: Polynomial time algorithms, 2nd edn., pp. 183–188. John Wiley & Sons, New York (1987)

    Google Scholar 

  12. Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)

    Article  Google Scholar 

  13. Goldstein, L.: Mean square optimality in the continuous time Robbins Monro procedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern California, LA (1987)

    Google Scholar 

  14. Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press, Baltimore (1989)

    MATH  Google Scholar 

  15. Heskes, T.M., Kappen, B.: On-line learning processes in artificial neural networks. In: Tayler, J.G. (ed.) Mathematical Approaches to Neural Networks, vol. 51, pp. 199–233. Elsevier, Amsterdam (1993)

    Google Scholar 

  16. Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Networks 1, 295–307 (1988)

    Article  Google Scholar 

  17. Kramer, A.H., Sangiovanni-Vincentelli, A.: Efficient parallel learning algorithms for neural networks. In: Touretzky, D.S. (ed.) Proceedings of the 1988 Conference on Advances in Neural Information Processing Systems, pp. 40–48. Morgan Kaufmann, San Mateo (1989)

    Google Scholar 

  18. LeCun, Y.: Modeles connexionnistes de l’apprentissage (connectionist learning models). PhD thesis, Universitè P. et M. Curie, Paris VI (1987)

    Google Scholar 

  19. LeCun, Y.: Generalization and network design strategies. In: Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds.) Proceedings of the International Conference Connectionism in Perspective, University of Zürich, October 10-13. Elsevier, Amsterdam (1988)

    Google Scholar 

  20. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a backpropagation network. In: Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2. Morgan Kaufmann, San Mateo (1990)

    Google Scholar 

  21. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2, pp. 598–605 (1990)

    Google Scholar 

  22. LeCun, Y., Kanter, I., Solla, S.A.: Second order properties of error surfaces. In: Advances in Neural Information Processing Systems, vol. 3. Morgan Kaufmann, San Mateo (1991)

    Google Scholar 

  23. LeCun, Y., Simard, P.Y., Pearlmutter, B.: Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In: Giles, Hanson, Cowan (eds.) Advances in Neural Information Processing Systems, vol. 5. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  24. Møller, M.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6, 525–533 (1993)

    Article  Google Scholar 

  25. Møller, M.: Supervised learning on large redundant training sets. International Journal of Neural Systems 4(1), 15–25 (1993)

    Article  Google Scholar 

  26. Moody, J.E., Darken, C.J.: Fast learning in networks of locally-tuned processing units. Neural Computation 1, 281–294 (1989)

    Article  Google Scholar 

  27. Murata, N.: PhD thesis, University of Tokyo (1992) (in Japanese)

    Google Scholar 

  28. Murata, N., Müller, K.-R., Ziehe, A., Amari, S.: Adaptive on-line learning in changing environments. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 599. MIT Press (1997)

    Google Scholar 

  29. Oppenheim, A.V., Schafer, R.W.: Digital Signal Processing. Prentice-Hall, Englewood Cliffs (1975)

    MATH  Google Scholar 

  30. Orr, G.B.: Dynamics and Algorithms for Stochastic learning. PhD thesis, Oregon Graduate Institute (1995)

    Google Scholar 

  31. Orr, G.B.: Removing noise in on-line search using adaptive batch sizes. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 232. MIT Press (1997)

    Google Scholar 

  32. Orr, M.J.L.: Regularization in the selection of radial basis function centers. Neural Computation 7(3), 606–623 (1995)

    Article  Google Scholar 

  33. Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Computation 6, 147–160 (1994)

    Article  Google Scholar 

  34. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipies in C: The art of Scientific Programming. Cambridge University Press, Cambridge (1988)

    MATH  Google Scholar 

  35. Saad, D. (ed.): Online Learning in Neural Networks (1997 Workshop at the Newton Institute). The Newton Institute Series. Cambridge University Press, Cambridge (1998)

    Google Scholar 

  36. Saad, D., Solla, S.A.: Exact solution for on-line learning in multilayer neural networks. Physical Review Letters 74, 4337–4340 (1995)

    Article  Google Scholar 

  37. Sompolinsky, H., Barkai, N., Seung, H.S.: On-line learning of dichotomies: algorithms and learning curves. In: Oh, J.-H., Kwon, C., Cho, S. (eds.) Neural Networks: The Statistical Mechanics Perspective, pp. 105–130. World Scientific, Singapore (1995)

    Google Scholar 

  38. Sutton, R.S.: Adapting bias by gradient descent: An incremental version of delta-bar-delta. In: Swartout, W. (ed.) Proceedings of the 10th National Conference on Artificial Intelligence, pp. 171–176. MIT Press, San Jose (July 1992)

    Google Scholar 

  39. van der Smagt, P.: Minimisation methods for training feed-forward networks. Neural Networks 7(1), 1–11 (1994)

    Article  Google Scholar 

  40. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Book  MATH  Google Scholar 

  41. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  42. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-37, 328–339 (1989)

    Article  Google Scholar 

  43. Wiegerinck, W., Komoda, A., Heskes, T.: Stochastic dynamics of learning with momentum in neural networks. Journal of Physics A 27, 4425–4437 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  44. Yang, H.H., Amari, S.: The efficiency and the robustness of natural gradient descent learning rule. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10. MIT Press (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

LeCun, Y.A., Bottou, L., Orr, G.B., Müller, KR. (2012). Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35289-8_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35288-1

  • Online ISBN: 978-3-642-35289-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics