Efficient BackProp

Part of the Lecture Notes in Computer Science book series (LNCS, volume 1524)


The convergence of back-propagation learning is analyzed so as to explain common phenomenon observedb y practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposedin serious technical publications. This paper gives some of those tricks, ando.ers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.


Conjugate Gradient Learning Rate Neural Information Processing System Newton Algorithm Handwritten Digit 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    S. Amari. Neural learning in structuredparameter spaces — natural riemannian gradient. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 127. The MIT Press, 1997.Google Scholar
  2. 2.
    S. Amari. Natural gradient works e.ciently in learning. Neural Computation, 10(2):251–276, 1998.CrossRefMathSciNetGoogle Scholar
  3. 3.
    R. Battiti. First-and second-order methods for learning: Between steepest descent andnewton’s method. Neural Computation, 4:141–166, 1992.CrossRefGoogle Scholar
  4. 4.
    S. Becker and Y. LeCun. Improving the convergence of backbropagation learning with secondo der metho ds. In David Touretzky, Geofrey Hinton, and T errence Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 29–37. Lawrence Erlbaum Associates, 1989.Google Scholar
  5. 5.
    C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.Google Scholar
  6. 6.
    L. Bottou. Online algorithms andsto chastic approximations. In David Saad, editor, Online Learning in Neural Networks (1997 Workshop at the Newton Institute), Cambridge, 1998. The Newton Institute Series, Cambridge University Press.Google Scholar
  7. 7.
    D. S. Broomheadand D. Lowe. Multivariable function interpolation andad aptive networks. Complex Systems, 2:321–355, 1988.MathSciNetGoogle Scholar
  8. 8.
    W. L. Buntine and A. S. Weigend. Computing second order derivatives in Feed-Forwardnet works: A review. IEEE Transactions on Neural Networks, 1993. To appear.Google Scholar
  9. 9.
    C. Darken and J. E. Moody. Note on learning rate schedules for stochastic optimization. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems, volume 3, pages 832–838. Morgan Kaufmann, San Mateo,CA, 1991.Google Scholar
  10. 10.
    K. I. Diamantaras and S. Y. Kung. Principal Component Neural Networks. Wiley, New York, 1996.zbMATHGoogle Scholar
  11. 11.
    R. Fletcher. Practical Methods of Optimization, chapter 8.7: Polynomial time algorithms, pages 183–188. John Wiley & Sons, New York, second edition, 1987.zbMATHGoogle Scholar
  12. 12.
    S. Geman, E. Bienenstock, and R. Doursat. Neural networks andthe bias/variance dilemma. Neural Computation, 4(1):1–58, 1992.CrossRefGoogle Scholar
  13. 13.
    L. Goldstein. Mean square optimality in the continuous time Robbins Monro procedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern California, LA, 1987.Google Scholar
  14. 14.
    G. H. Golub and C. F. Van Loan. Matrix Computations, 2nd ed. Johns Hopkins University Press, Baltimore, 1989.zbMATHGoogle Scholar
  15. 15.
    T.M. Heskes and B. Kappen. On-line learning processes in arti.cial neural networks. In J. G. Tayler, editor, Mathematical Approaches to Neural Networks, volume 51, pages 199–233. Elsevier, Amsterdam, 1993.Google Scholar
  16. 16.
    Robert A. Jacobs. Increasedrates of convergence through learning rate adaptation. Neural Networks, 1:295–307, 1988.CrossRefGoogle Scholar
  17. 17.
    A. H. Kramer and A. Sangiovanni-Vincentelli. Efficient parallel learning algorithms for neural networks. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems. Proceedings of the 1988 Conference, pages 40–48, San Mateo, CA, 1989. Morgan Kaufmann.Google Scholar
  18. 18.
    Y. LeCun. Modeles connexionnistes de l’apprentissage (connectionist learning models). PhD thesis, Université P. et M. Curie (Paris VI), 1987.Google Scholar
  19. 19.
    Y. LeCun. Generalization andnet work design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Amsterdam, 1989. Elsevier. Proceedings of the International Conference Connectionism in Perspective, University of Zürich, 10.–13. October 1988.Google Scholar
  20. 20.
    Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a backpropagation network. In D. S. Touretsky, editor, Advances in Neural Information Processing Systems, vol. 2, San Mateo, CA, 1990. Morgan Kaufman.Google Scholar
  21. 21.
    Y. LeCun, J.S. Denker, and S.A. Solla. Optimal brain damage. In D. S. Touretsky, editor, Advances in Neural Information Processing Systems, vol. 2, pages 598–605, 1990.Google Scholar
  22. 22.
    Y. LeCun, I. Kanter, and S. A. Solla. Secondord er properties of error surfaces. In Advances in Neural Information Processing Systems, vol. 3, San Mateo, CA, 1991. Morgan Kaufmann.Google Scholar
  23. 23.
    Y. LeCun, P. Y. Simard, and B. Pearlmutter. Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In Giles, Hanson, and Cowan, editors, Advances in Neural Information Processing Systems, vol. 5, San Mateo, CA, 1993. Morgan Kaufmann.Google Scholar
  24. 24.
    M. MØller. A scaledconjugate gradient algorithm for fast supervisedlearning. Neural Networks, 6:525–533, 1993.CrossRefGoogle Scholar
  25. 25.
    M. MØller. Supervised learning on large redundant training sets. International Journal of Neural Systems, 4(1):15–25, 1993.CrossRefGoogle Scholar
  26. 26.
    J. E. Moody and C. J. Darken. Fast learning in networks of locally-tunedpro cessing units. Neural Computation, 1:281–294, 1989.CrossRefGoogle Scholar
  27. 27.
    N. Murata. (in Japanese). PhD thesis, University of Tokyo, 1992.Google Scholar
  28. 28.
    N. Murata, K.-R. Müller, A. Ziehe, and S. Amari. Adaptive on-line learning in changing environments. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 599. The MIT Press, 1997.Google Scholar
  29. 29.
    A.V. Oppenheim and R.W. Schafer. Digital Signal Processing. Prentice Hall, Englewood Cliffs, 1975.zbMATHGoogle Scholar
  30. 30.
    G. B. Orr. Dynamics and Algorithms for Stochastic learning. PhD thesis, Oregon Graduate Institute, 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  1. 1.Image Processing Research Department AT&T Labs - ResearchRedBankUSA
  2. 2.Willamette UniversitySalemUSA
  3. 3.GMD FIRSTBerlinGermany

Personalised recommendations