Q(λ)-learning uses TD(λ)-methods to accelerate Q-learning. The update complexity of previous online Q(λ) implementations based on lookup tables is bounded by the size of the state/action space. Our faster algorithm's update complexity is bounded by the number of actions. The method is based on the observation that Q-value updates may be postponed until they are needed.
Albus, J.S. (1975). A new approach to manipulator control: The cerebellar model articulationcontroller (CMAC). Dynamic Systems, Measurement and Control, 97, 220–227.
Atkeson, C.G., Schaal, S., & Moore, A.W. (1997). Locally weighted learning. Artificial Intelligence Review, 11, 11–73.
Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike adaptive elements that can solvedifficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834–846.
Bertsekas, D.P., & Tsitsiklis, J.N. (1996). Neuro-dynamic programming. Belmont, MA: AthenaScientific.
Caironi, P.V.C., & Dorigo, M. (1994). Training Q-agents (Technical Report IRIDIA-94-14). Université Libre de Bruxelles.
Cichosz, P. (1995). Truncating temporal differences: On theefficient implementation of TD(λ) for reinforcement learning. Journal of Artificial Intelligence Research, 2, 287–318.
Fritzke, B. (1994). Supervised learning with growing cell structures. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 255–262). San Mateo, CA: Morgan Kaufmann.
Koenig, S., & Simmons, R.G. (1996). The effect ofrepresentation and knowledge on goal-directed exploration with reinforcement learning algorithms. Machine Learning, 22, 228–250.
Kohonen, T. (1988). Self-organization and associative memory (2nded.). Springer.
Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Ph.D. thesis,Carnegie Mellon University, Pittsburgh.
Peng, J., & Williams, R. (1996). Incremental multi-step Q-learning. Machine Learning, 22, 283–290.
Rummery, G., & Niranjan, M. (1994). On-line Q-learning using connectionist sytems (Technical Report CUED/ F-INFENG-TR 166). UK: Cambridge University.
Singh, S., & Sutton, R. (1996). Reinforcement learning with replacing eligibility traces.Machine Learning, 22, 123–158.
Sutton, R.S. (1988). Learning to predict by the methods oftemporal differences. Machine Learning, 3, 9–44.
Sutton, R.S. (1996). Generalization inreinforcement learning: Successful examples using sparse coarse coding. In D.S. Touretzky, M.C. Mozer, & M.E. Hasselmo (Eds.), Advances in neural information processing systems, (Vol. 8, pp. 1033–1045). Cambridge, MA: MIT Press.
Tesauro, G. (1992). Practical issues in temporal difference learning. InD.S., Lippman, J.E. Moody, & D.S Touretzky (Eds.), Advances in neural information processing systems (Vol. 4, pp. 259–266). San Mateo, CA: Morgan Kaufmann.
Thrun, S. (1992). Efficient explorationin reinforcement learning (Technical Report CMU-CS-92-102). Carnegie-Mellon University.
Watkins, C.J.C.H. (1989). Learning from delayed rewards. Ph.D. thesis, King's College, Cambridge,England.
Watkins, C.J.C.H., & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8,279–292.
Whitehead, S. (1992). Reinforcement learning for the adaptive control of perception and action.Ph.D. thesis, University of Rochester.
Wiering, M.A., & Schmidhuber, J. (1998). Speeding up Q(λ)-learning. In C. Nedellec, & C. Rouveirol (Eds.), Machine Learning: Proceedings of the Tenth European Conference. Berlin: Springer Verlag.