On the Worst-Case Analysis of Temporal-Difference Learning Algorithms

  • Robert E. Schapire
  • Manfred K. Warmuth


We study the behavior of a family of learning algorithms based on Sutton’s method of temporal differences. In our on-line learning framework, learning takes place in a sequence of trials, and the goal of the learning algorithm is to estimate a discounted sum of all the reinforcements that will be received in the future. In this setting, we are able to prove general upper bounds on the performance of a slightly modified version of Sutton’s so-called TD(λ) algorithm. These bounds are stated in terms of the performance of the best linear predictor on the given training sequence, and are proved without making any statistical assumptions of any kind about the process producing the learner’s observed training sequence. We also prove lower bounds on the performance of any algorithm for this learning problem, and give a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions for a whole batch of observations before receiving reinforcement.


machine learning temporal-difference learning on-line learning worst-case analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Nicolò Cesa-Bianchi, Philip M. Long, & Manfred K. Warmuth. (1993). Worst-case quadratic loss bounds for a generalization of the Widrow-Hoff rule. In Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, pages 429–438.Google Scholar
  2. Peter Dayan. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8(3/4):341–362.zbMATHCrossRefGoogle Scholar
  3. Peter Dayan & Terrence J. Sejnowski. (1994). TD(λ) converges with probability 1. Machine Learning, 14(3):295–301.Google Scholar
  4. Roger A. Horn & Charles R. Johnson. (1985). Matrix Analysis. Cambridge University Press.Google Scholar
  5. Tommi Jaakkola, Michael I. Jordan, & Satinder P. Singh. (1993). On the convergence of stochastic iterative dynamic programming algorithms. Technical Report 9307, MIT Computational Cognitive Science.Google Scholar
  6. Jyrki Kivinen & Manfred K. Warmuth. (1994). Additive versus exponentiated gradient updates for learning linear functions. Technical Report UCSC-CRL-94-16, University of California Santa Cruz, Computer Research Laboratory.Google Scholar
  7. Richard S. Sutton. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44.Google Scholar
  8. C. J. C. H. Watkins. (1989). Learning from delayed rewards. PhD thesis, University of Cambridge, England, 1989.Google Scholar

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Robert E. Schapire
    • 1
  • Manfred K. Warmuth
    • 2
  1. 1.AT&T Bell LaboratoriesMurray Hill
  2. 2.Computer and Information SciencesUniversity of CaliforniaSanta Cruz

Personalised recommendations