Learning to Learn Using Gradient Descent

  • Sepp Hochreiter
  • A. Steven Younger
  • Peter R. Conwell
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2130)


This paper introduces the application of gradient descent methods to meta-learning. The concept of “meta-learning”, i.e. of a system that improves or discovers a learning algorithm, has been of interest in machine learning for decades because of its appealing applications. Previous meta-learning approaches have been based on evolutionary methods and, therefore, have been restricted to small models with few free parameters. We make meta-learning in large systems feasible by using recurrent neural networks with their attendant learning routines as meta-learning systems. Our system derived complex well performing learning algorithms from scratch. In this paper we also show that our approach performs non-stationary time series prediction.


Hide Layer Learning Algorithm Boolean Function Gradient Descent Turing Machine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    R. Caruana. Learning many related tasks at the same time with backpropagation. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7, pages 657–664. The MIT Press, 1995.Google Scholar
  2. 2.
    D. Chalmers. The evolution of learning: An experiment in genetic connectionism. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990 Con. Models Summer School, pages 81–90. Morgan Kaufmann, 1990.Google Scholar
  3. 3.
    N. E. Cotter and P. R. Conwell. Fixed-weight networks can learn. In Int. Joint Conference on Neural Networks, volume II, pages 553–559. IEEE, NY, 1990.CrossRefGoogle Scholar
  4. 4.
    H. Ellis. Transfer of Learning. MacMillan, New York, NY, 1965.Google Scholar
  5. 5.
    J. L. Elman. Finding structure in time. Technical Report CRL 8801, Center for Research in Language, University of California, San Diego, 1988.Google Scholar
  6. 6.
    D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial Intelligence, 36:177–221, 1988.MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    S. Hochreiter and J. Schmidhuber. Flat minima. Neural Comp., 9(1):1–42, 1997.MATHCrossRefGoogle Scholar
  8. 8.
    S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.CrossRefGoogle Scholar
  9. 9.
    A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Camb. Uni. Eng. Dep., 1987.Google Scholar
  10. 10.
    T. P. Runarsson and M. T. Jonsson. Evolution and design of distributed learning rules. In 2000 IEEE Symposium of Combinations of Evolutionary Computing and Neural Networks, San Antonio, Texas, USA, page 59. 2000.Google Scholar
  11. 11.
    J. Schmidhuber, J. Zhao, and M. Wiering. Simple principles of metalearning. Technical Report IDSIA-69-96, IDSIA, 1996.Google Scholar
  12. 12.
    J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28:105–130, 1997.CrossRefGoogle Scholar
  13. 13.
    J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook. Inst. für Inf., Tech. Univ. München, 1987.Google Scholar
  14. 14.
    S. Thrun and L. Pratt, editors. Learning To Learn. Kluwer Academic Pub., 1997.Google Scholar
  15. 15.
    P. Utgoff. Shift of bias for inductive concept learning. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, volume 2. Morgan Kaufmann, 1986.Google Scholar
  16. 16.
    P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 1988.Google Scholar
  17. 17.
    R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent networks. Technical Report ICS 8805, Univ. of Cal., La Jolla, 1988.Google Scholar
  18. 18.
    R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Back-propagation: Theory, Architectures and Applications. Hillsdale, 1992.Google Scholar
  19. 19.
    A. S. Younger, P. R. Conwell, and N. E. Cotter. Fixed-weight on-line learning. IEEE-Transactions on Neural Networks, 10(2):272–283, 1999.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Sepp Hochreiter
    • 1
  • A. Steven Younger
    • 1
  • Peter R. Conwell
    • 2
  1. 1.Department of Computer ScienceUniversity of ColoradoBoulder
  2. 2.Physics Department Westminster CollegeSalt Lake City

Personalised recommendations