Advertisement

Efficient BackProp

  • Yann A. LeCun
  • Léon Bottou
  • Genevieve B. Orr
  • Klaus-Robert Müller
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7700)

Abstract

The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work.

Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

Keywords

Conjugate Gradient Gradient Descent Neural Information Processing System Newton Algorithm Handwritten Digit 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amari, S.: Neural learning in structured parameter spaces — natural riemannian gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 127. MIT Press (1997)Google Scholar
  2. 2.
    Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)CrossRefGoogle Scholar
  3. 3.
    Battiti, R.: First- and second-order methods for learning: Between steepest descent and newton’s method. Neural Computation 4, 141–166 (1992)CrossRefGoogle Scholar
  4. 4.
    Becker, S., LeCun, Y.: Improving the convergence of backbropagation learning with second oder metho ds. In: Touretzky, D., Hinton, G., Sejnowski, T. (eds.) Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37. Lawrence Erlbaum Associates (1989)Google Scholar
  5. 5.
    Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)zbMATHGoogle Scholar
  6. 6.
    Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.) Online Learning in Neural Networks (1997 Workshop at the Newton Institute). The Newton Institute Series. Cambridge University Press, Cambridge (1998)Google Scholar
  7. 7.
    Broomhead, D.S., Lowe, D.: Multivariable function interpolation and adaptive networks. Complex Systems 2, 321–355 (1988)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Buntine, W.L., Weigend, A.S.: Computing second order derivatives in Feed-Forward networks: A review. IEEE Transactions on Neural Networks (1993) (to appear)Google Scholar
  9. 9.
    Darken, C., Moody, J.E.: Note on learning rate schedules for stochastic optimization. In: Lippmann, R.P., Moody, J.E., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 3, pp. 832–838. Morgan Kaufmann, San Mateo (1991)Google Scholar
  10. 10.
    Diamantaras, K.I., Kung, S.Y.: Principal Component Neural Networks. Wiley, New York (1996)zbMATHGoogle Scholar
  11. 11.
    Fletcher, R.: Practical Methods of Optimization, ch. 8.7: Polynomial time algorithms, 2nd edn., pp. 183–188. John Wiley & Sons, New York (1987)Google Scholar
  12. 12.
    Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)CrossRefGoogle Scholar
  13. 13.
    Goldstein, L.: Mean square optimality in the continuous time Robbins Monro procedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern California, LA (1987)Google Scholar
  14. 14.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press, Baltimore (1989)zbMATHGoogle Scholar
  15. 15.
    Heskes, T.M., Kappen, B.: On-line learning processes in artificial neural networks. In: Tayler, J.G. (ed.) Mathematical Approaches to Neural Networks, vol. 51, pp. 199–233. Elsevier, Amsterdam (1993)Google Scholar
  16. 16.
    Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Networks 1, 295–307 (1988)CrossRefGoogle Scholar
  17. 17.
    Kramer, A.H., Sangiovanni-Vincentelli, A.: Efficient parallel learning algorithms for neural networks. In: Touretzky, D.S. (ed.) Proceedings of the 1988 Conference on Advances in Neural Information Processing Systems, pp. 40–48. Morgan Kaufmann, San Mateo (1989)Google Scholar
  18. 18.
    LeCun, Y.: Modeles connexionnistes de l’apprentissage (connectionist learning models). PhD thesis, Universitè P. et M. Curie, Paris VI (1987)Google Scholar
  19. 19.
    LeCun, Y.: Generalization and network design strategies. In: Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds.) Proceedings of the International Conference Connectionism in Perspective, University of Zürich, October 10-13. Elsevier, Amsterdam (1988)Google Scholar
  20. 20.
    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a backpropagation network. In: Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2. Morgan Kaufmann, San Mateo (1990)Google Scholar
  21. 21.
    LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2, pp. 598–605 (1990)Google Scholar
  22. 22.
    LeCun, Y., Kanter, I., Solla, S.A.: Second order properties of error surfaces. In: Advances in Neural Information Processing Systems, vol. 3. Morgan Kaufmann, San Mateo (1991)Google Scholar
  23. 23.
    LeCun, Y., Simard, P.Y., Pearlmutter, B.: Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In: Giles, Hanson, Cowan (eds.) Advances in Neural Information Processing Systems, vol. 5. Morgan Kaufmann, San Mateo (1993)Google Scholar
  24. 24.
    Møller, M.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6, 525–533 (1993)CrossRefGoogle Scholar
  25. 25.
    Møller, M.: Supervised learning on large redundant training sets. International Journal of Neural Systems 4(1), 15–25 (1993)CrossRefGoogle Scholar
  26. 26.
    Moody, J.E., Darken, C.J.: Fast learning in networks of locally-tuned processing units. Neural Computation 1, 281–294 (1989)CrossRefGoogle Scholar
  27. 27.
    Murata, N.: PhD thesis, University of Tokyo (1992) (in Japanese)Google Scholar
  28. 28.
    Murata, N., Müller, K.-R., Ziehe, A., Amari, S.: Adaptive on-line learning in changing environments. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 599. MIT Press (1997)Google Scholar
  29. 29.
    Oppenheim, A.V., Schafer, R.W.: Digital Signal Processing. Prentice-Hall, Englewood Cliffs (1975)zbMATHGoogle Scholar
  30. 30.
    Orr, G.B.: Dynamics and Algorithms for Stochastic learning. PhD thesis, Oregon Graduate Institute (1995)Google Scholar
  31. 31.
    Orr, G.B.: Removing noise in on-line search using adaptive batch sizes. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 232. MIT Press (1997)Google Scholar
  32. 32.
    Orr, M.J.L.: Regularization in the selection of radial basis function centers. Neural Computation 7(3), 606–623 (1995)CrossRefGoogle Scholar
  33. 33.
    Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Computation 6, 147–160 (1994)CrossRefGoogle Scholar
  34. 34.
    Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipies in C: The art of Scientific Programming. Cambridge University Press, Cambridge (1988)zbMATHGoogle Scholar
  35. 35.
    Saad, D. (ed.): Online Learning in Neural Networks (1997 Workshop at the Newton Institute). The Newton Institute Series. Cambridge University Press, Cambridge (1998)Google Scholar
  36. 36.
    Saad, D., Solla, S.A.: Exact solution for on-line learning in multilayer neural networks. Physical Review Letters 74, 4337–4340 (1995)CrossRefGoogle Scholar
  37. 37.
    Sompolinsky, H., Barkai, N., Seung, H.S.: On-line learning of dichotomies: algorithms and learning curves. In: Oh, J.-H., Kwon, C., Cho, S. (eds.) Neural Networks: The Statistical Mechanics Perspective, pp. 105–130. World Scientific, Singapore (1995)Google Scholar
  38. 38.
    Sutton, R.S.: Adapting bias by gradient descent: An incremental version of delta-bar-delta. In: Swartout, W. (ed.) Proceedings of the 10th National Conference on Artificial Intelligence, pp. 171–176. MIT Press, San Jose (July 1992)Google Scholar
  39. 39.
    van der Smagt, P.: Minimisation methods for training feed-forward networks. Neural Networks 7(1), 1–11 (1994)CrossRefGoogle Scholar
  40. 40.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)CrossRefzbMATHGoogle Scholar
  41. 41.
    Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)zbMATHGoogle Scholar
  42. 42.
    Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-37, 328–339 (1989)CrossRefGoogle Scholar
  43. 43.
    Wiegerinck, W., Komoda, A., Heskes, T.: Stochastic dynamics of learning with momentum in neural networks. Journal of Physics A 27, 4425–4437 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  44. 44.
    Yang, H.H., Amari, S.: The efficiency and the robustness of natural gradient descent learning rule. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10. MIT Press (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Yann A. LeCun
    • 1
  • Léon Bottou
    • 1
  • Genevieve B. Orr
    • 2
  • Klaus-Robert Müller
    • 3
  1. 1.Image Processing Research DepartmentAT& T Labs - ResearchRed BankUSA
  2. 2.Willamette UniversitySalemUSA
  3. 3.GMD FIRSTBerlinGermany

Personalised recommendations