A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning

Choi, David; Van Roy, Benjamin

doi:10.1007/s10626-006-8134-8

A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning

Published: 05 May 2006

Volume 16, pages 207–239, (2006)
Cite this article

Discrete Event Dynamic Systems Aims and scope Submit manuscript

David Choi¹ &
Benjamin Van Roy²

283 Accesses
36 Citations
Explore all metrics

Abstract

The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal-difference learning, and our developments lead to a possibly more efficient variant of temporal-difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A generalized Kalman filter with its precision in recursive form when the stochastic model is misspecified

Article Open access 03 September 2021

Nonlinear Kalman filtering via ultradiscretization procedure

Article 08 January 2016

An iterative algorithm for l 1-norm approximation in dynamic estimation problems

Article 10 May 2015

References

Barto A, Crites R 1996. Improving elevator performance using reinforcement learning, Adv Neural Inf Process Syst, 8:1017–1023.
Google Scholar
Bellman R, Dreyfuss S 1959. Functional approximations and dynamic programming, Math Tables Other Aids Comput, 13:247–251.
Article Google Scholar
Benveniste A, Métivier M, and Priouret P 1991. Adaptive Algorithms and Stochastic Approximations. Berlin Heidelberg New York: Springer-Verlag
Google Scholar
Bertsekas DP 1995a. Nonlinear Programming. Athena Scientific.
Bertsekas DP 1995b. Dynamic Programming and Optimal Control. Athena Scientific.
Bertsekas DP, Singh S 1997. Reinforcement learning for dynamic channel allocation in cellular telephone systems. Adv Neural Inf Process Syst. MIT, vol. 9, p. 974.
Bertsekas DP, Tsitsiklis JN 1995. Neuro-Dynamic Programming. Athena Scientific.
Borkar V 1995. Probability theory: an advanced course. Berlin Heidelberg New York: Springer-Verlag
Google Scholar
Boyan J 1999. Least-squares temporal difference learning. Proceedings of the Sixteenth International Conference (ICML) on Machine Learning, pp. 49–56.
Boyan J 2002. Technical update: least-squares temporal difference learning, Mach Learn, 49(2):233–246.
Article MATH Google Scholar
Bradtke SJ, Barto AG 1996. Linear least-squares algorithms for temporal-difference learning, Mach Learn, 22:33–57.
Google Scholar
Choi DS, Van Roy B 2001. A generalized kalman filter for fixed point approximation and efficient temporal-difference learning, proceedings of the international joint conference on machine learning.
Dayan PD 1992. The convergence of TD(λ) for general (λ), Mach Learn, 8:341–362.
MATH Google Scholar
de Farias DP, Van Roy B 2000. On the existence of fixed points for approximate value iteration and temporal-difference learning, J Optim Theory Appl, 105(3).
Gurvits L, Lin LJ, and Hanson SJ 1994. incremental learning of evaluation functions for absorbing markov chains: New Methods and Theorems, preprint.
Karatzas I, Shreve SE 1998. Methods of Mathematical Finance. Berlin Heidelberg New York: Springer.
Google Scholar
Lagoudakis M, Parr R 2001. Model-free least-squares policy iteration. Neural Inf Process Syst (NPIS-14).
Nedic A, Bertsekas DP 2001. Policy evaluation algorithms with linear function approximation. Tech. Rep. LIDS-P-2537, MIT Laboratory for Information and Decision Systems, December 2001.
Pineda F 1997. Mean-field analysis for batched TD(λ), Neural Comput, 1403–1419.
Sutton RS 1988. Learning to predict by the method of temporal differences, Mach Learn, 3:9–44.
Google Scholar
Tadić V 2001. On the convergence of temporal-difference learning with linear function approximation, Mach Learn, 42:241–267.
Article Google Scholar
Tesauro G 1995. Temporal difference learning and TD-gammon, Communications of the ACM, 38(3).
Tsitsiklis JN, Van Roy B 1997. An analysis of temporal-difference learning with function approximation, IEEE Trans Automat Contr, 42:674–690.
Article Google Scholar
Tsitsiklis JN, Van Roy B 1999. Optimal stopping of markov processes: Hilbert Space Theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives, IEEE Trans Automat Contr, 44(10):1840–1851.
Article Google Scholar
Van Roy B 1998. Learning and value function approximation in complex decision processes, Ph.D. dissertation, MIT.
Van Roy B, Bertsekas DP, Lee Y, and Tsitsiklis JN 1999. A Neuro-dynamic programming approach to retailer inventory management, Proc. of the IEEE Conf Decis Contr.
Varaiya P, Walrand J, and Buyukkoc C 1985. Extensions of the multiarmed bandit problem: the discounted case, IEEE Trans Automat Contr, 30(5).
Warmuth M, Forster J 2000. Relative loss bounds for temporal-difference learning. Proc. of the Seventeenth International Conference on Machine Learning, pp. 295–302.
Warmuth M, Schapire R 1997. On the worst-case analysis of temporal-difference learning algorithms, Journal of Machine Learning, 22(1,2,3):95–121.
Google Scholar
Zhang W, Dietterich TG 1995. A reinforcement learning approach to job-shop scheduling. Proc. of the International Joint Conference on Artificial Intellience.

Download references

Author information

Authors and Affiliations

Lincoln Laboratory, Massachusetts Institue of Technology, 244 Wood Street, Lexington, MA, 02420-9108, USA
David Choi
Departments of Management Science and Engineering and Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
Benjamin Van Roy

Authors

David Choi
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Van Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Choi.

Additional information

This research was supported in part by NSF CAREER Grant ECS-9985229, and by the ONR under Grant MURI N00014-00-1-0637.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, D., Van Roy, B. A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning. Discrete Event Dyn Syst 16, 207–239 (2006). https://doi.org/10.1007/s10626-006-8134-8

Download citation

Published: 05 May 2006
Issue Date: April 2006
DOI: https://doi.org/10.1007/s10626-006-8134-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning

Abstract

Access this article

Similar content being viewed by others

A generalized Kalman filter with its precision in recursive form when the stochastic model is misspecified

Nonlinear Kalman filtering via ultradiscretization procedure

An iterative algorithm for l 1-norm approximation in dynamic estimation problems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning

Abstract

Access this article

Similar content being viewed by others

A generalized Kalman filter with its precision in recursive form when the stochastic model is misspecified

Nonlinear Kalman filtering via ultradiscretization procedure

An iterative algorithm for l 1-norm approximation in dynamic estimation problems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation