Skip to main content
Log in

A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning

  • Published:
Discrete Event Dynamic Systems Aims and scope Submit manuscript

Abstract

The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal-difference learning, and our developments lead to a possibly more efficient variant of temporal-difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Barto A, Crites R 1996. Improving elevator performance using reinforcement learning, Adv Neural Inf Process Syst, 8:1017–1023.

    Google Scholar 

  • Bellman R, Dreyfuss S 1959. Functional approximations and dynamic programming, Math Tables Other Aids Comput, 13:247–251.

    Article  Google Scholar 

  • Benveniste A, Métivier M, and Priouret P 1991. Adaptive Algorithms and Stochastic Approximations. Berlin Heidelberg New York: Springer-Verlag

    Google Scholar 

  • Bertsekas DP 1995a. Nonlinear Programming. Athena Scientific.

  • Bertsekas DP 1995b. Dynamic Programming and Optimal Control. Athena Scientific.

  • Bertsekas DP, Singh S 1997. Reinforcement learning for dynamic channel allocation in cellular telephone systems. Adv Neural Inf Process Syst. MIT, vol. 9, p. 974.

  • Bertsekas DP, Tsitsiklis JN 1995. Neuro-Dynamic Programming. Athena Scientific.

  • Borkar V 1995. Probability theory: an advanced course. Berlin Heidelberg New York: Springer-Verlag

    Google Scholar 

  • Boyan J 1999. Least-squares temporal difference learning. Proceedings of the Sixteenth International Conference (ICML) on Machine Learning, pp. 49–56.

  • Boyan J 2002. Technical update: least-squares temporal difference learning, Mach Learn, 49(2):233–246.

    Article  MATH  Google Scholar 

  • Bradtke SJ, Barto AG 1996. Linear least-squares algorithms for temporal-difference learning, Mach Learn, 22:33–57.

    Google Scholar 

  • Choi DS, Van Roy B 2001. A generalized kalman filter for fixed point approximation and efficient temporal-difference learning, proceedings of the international joint conference on machine learning.

  • Dayan PD 1992. The convergence of TD(λ) for general (λ), Mach Learn, 8:341–362.

    MATH  Google Scholar 

  • de Farias DP, Van Roy B 2000. On the existence of fixed points for approximate value iteration and temporal-difference learning, J Optim Theory Appl, 105(3).

  • Gurvits L, Lin LJ, and Hanson SJ 1994. incremental learning of evaluation functions for absorbing markov chains: New Methods and Theorems, preprint.

  • Karatzas I, Shreve SE 1998. Methods of Mathematical Finance. Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Lagoudakis M, Parr R 2001. Model-free least-squares policy iteration. Neural Inf Process Syst (NPIS-14).

  • Nedic A, Bertsekas DP 2001. Policy evaluation algorithms with linear function approximation. Tech. Rep. LIDS-P-2537, MIT Laboratory for Information and Decision Systems, December 2001.

  • Pineda F 1997. Mean-field analysis for batched TD(λ), Neural Comput, 1403–1419.

  • Sutton RS 1988. Learning to predict by the method of temporal differences, Mach Learn, 3:9–44.

    Google Scholar 

  • Tadić V 2001. On the convergence of temporal-difference learning with linear function approximation, Mach Learn, 42:241–267.

    Article  Google Scholar 

  • Tesauro G 1995. Temporal difference learning and TD-gammon, Communications of the ACM, 38(3).

  • Tsitsiklis JN, Van Roy B 1997. An analysis of temporal-difference learning with function approximation, IEEE Trans Automat Contr, 42:674–690.

    Article  Google Scholar 

  • Tsitsiklis JN, Van Roy B 1999. Optimal stopping of markov processes: Hilbert Space Theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives, IEEE Trans Automat Contr, 44(10):1840–1851.

    Article  Google Scholar 

  • Van Roy B 1998. Learning and value function approximation in complex decision processes, Ph.D. dissertation, MIT.

  • Van Roy B, Bertsekas DP, Lee Y, and Tsitsiklis JN 1999. A Neuro-dynamic programming approach to retailer inventory management, Proc. of the IEEE Conf Decis Contr.

  • Varaiya P, Walrand J, and Buyukkoc C 1985. Extensions of the multiarmed bandit problem: the discounted case, IEEE Trans Automat Contr, 30(5).

  • Warmuth M, Forster J 2000. Relative loss bounds for temporal-difference learning. Proc. of the Seventeenth International Conference on Machine Learning, pp. 295–302.

  • Warmuth M, Schapire R 1997. On the worst-case analysis of temporal-difference learning algorithms, Journal of Machine Learning, 22(1,2,3):95–121.

    Google Scholar 

  • Zhang W, Dietterich TG 1995. A reinforcement learning approach to job-shop scheduling. Proc. of the International Joint Conference on Artificial Intellience.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Choi.

Additional information

This research was supported in part by NSF CAREER Grant ECS-9985229, and by the ONR under Grant MURI N00014-00-1-0637.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, D., Van Roy, B. A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning. Discrete Event Dyn Syst 16, 207–239 (2006). https://doi.org/10.1007/s10626-006-8134-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10626-006-8134-8

Keywords

Navigation