Abstract
Q(λ)-learning uses TD(λ)-methods to accelerate Q-learning. The update complexity of previous online Q(λ) implementations based on lookup tables is bounded by the size of the state/action space. Our faster algorithm's update complexity is bounded by the number of actions. The method is based on the observation that Q-value updates may be postponed until they are needed.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Albus, J.S. (1975). A new approach to manipulator control: The cerebellar model articulationcontroller (CMAC). Dynamic Systems, Measurement and Control, 97, 220–227.
Atkeson, C.G., Schaal, S., & Moore, A.W. (1997). Locally weighted learning. Artificial Intelligence Review, 11, 11–73.
Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike adaptive elements that can solvedifficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834–846.
Bertsekas, D.P., & Tsitsiklis, J.N. (1996). Neuro-dynamic programming. Belmont, MA: AthenaScientific.
Caironi, P.V.C., & Dorigo, M. (1994). Training Q-agents (Technical Report IRIDIA-94-14). Université Libre de Bruxelles.
Cichosz, P. (1995). Truncating temporal differences: On theefficient implementation of TD(λ) for reinforcement learning. Journal of Artificial Intelligence Research, 2, 287–318.
Fritzke, B. (1994). Supervised learning with growing cell structures. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 255–262). San Mateo, CA: Morgan Kaufmann.
Koenig, S., & Simmons, R.G. (1996). The effect ofrepresentation and knowledge on goal-directed exploration with reinforcement learning algorithms. Machine Learning, 22, 228–250.
Kohonen, T. (1988). Self-organization and associative memory (2nded.). Springer.
Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Ph.D. thesis,Carnegie Mellon University, Pittsburgh.
Peng, J., & Williams, R. (1996). Incremental multi-step Q-learning. Machine Learning, 22, 283–290.
Rummery, G., & Niranjan, M. (1994). On-line Q-learning using connectionist sytems (Technical Report CUED/ F-INFENG-TR 166). UK: Cambridge University.
Singh, S., & Sutton, R. (1996). Reinforcement learning with replacing eligibility traces.Machine Learning, 22, 123–158.
Sutton, R.S. (1988). Learning to predict by the methods oftemporal differences. Machine Learning, 3, 9–44.
Sutton, R.S. (1996). Generalization inreinforcement learning: Successful examples using sparse coarse coding. In D.S. Touretzky, M.C. Mozer, & M.E. Hasselmo (Eds.), Advances in neural information processing systems, (Vol. 8, pp. 1033–1045). Cambridge, MA: MIT Press.
Tesauro, G. (1992). Practical issues in temporal difference learning. InD.S., Lippman, J.E. Moody, & D.S Touretzky (Eds.), Advances in neural information processing systems (Vol. 4, pp. 259–266). San Mateo, CA: Morgan Kaufmann.
Thrun, S. (1992). Efficient explorationin reinforcement learning (Technical Report CMU-CS-92-102). Carnegie-Mellon University.
Watkins, C.J.C.H. (1989). Learning from delayed rewards. Ph.D. thesis, King's College, Cambridge,England.
Watkins, C.J.C.H., & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8,279–292.
Whitehead, S. (1992). Reinforcement learning for the adaptive control of perception and action.Ph.D. thesis, University of Rochester.
Wiering, M.A., & Schmidhuber, J. (1998). Speeding up Q(λ)-learning. In C. Nedellec, & C. Rouveirol (Eds.), Machine Learning: Proceedings of the Tenth European Conference. Berlin: Springer Verlag.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Wiering, M., Schmidhuber, J. Fast Online Q(λ). Machine Learning 33, 105–115 (1998). https://doi.org/10.1023/A:1007562800292
Issue Date:
DOI: https://doi.org/10.1023/A:1007562800292