Near-Optimal Reinforcement Learning in Polynomial Time
Article
- 1.8k Downloads
- 100 Citations
Abstract
We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.
reinforcement learning Markov decision processes exploration versus exploitation
Download
to read the full article text
References
- Barto, A. G., Sutton, R. S., & Watkins, C. (1990). Sequential decision problems and neural networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 686–693). San Mateo, CA: Morgan Kaufmann.Google Scholar
- Bertsekas, D. P. (1987). Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
- Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
- Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.Google Scholar
- Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI-92.Google Scholar
- Fiechter, C. (1994). Efficient reinforcement learning. In COLT94: Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (pp. 88–97). New York: ACM Press.Google Scholar
- Fiechter, C. (1997). Expected mistake bound model for on-line reinforcement learning. In Machine Learning: Proceedings of the Fourteenth International Conference, ICML97 (pp. 116–124). San Mateo, CA: Morgan Kaufmann.Google Scholar
- Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis, & S., Russell (Eds.), Machine Learning: Proceedings of the Twelth International Conference (pp. 261–268). San Mateo, CA: Morgan Kaufmann.Google Scholar
- Gullapalli, V., & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances is neural information processing systems 6 (pp. 695–702). San Mateo, CA: Morgan Kauffman.Google Scholar
- Jaakkola, T., Jordan, M. I., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:6, 1185–1201.Google Scholar
- Jaakkola, T., Singh, S., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7 (pp. 345–352). San Mateo, CA: Morgan Kaufmann.Google Scholar
- Jalali, A., & Ferguson, M. (1989). A distributed asynchronous algorithm for expected average cost dynamic programming. In Proceedings of the 29th Conference on Decision and Control, Honolulu, Hawaii (pp. 1283-1288).Google Scholar
- Kearns, M., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. In Proceeding of the Sixteenth International Joint Conference on Artificial Intelligence (pp. 740-747). Morgan Kaufmann.Google Scholar
- Kumar, P. R., & Varaiya, P. P. (1986). Stochastic systems: Estimation, identification, and adaptive control. Englewood Cliffs, N.J.: Prentice Hall.Google Scholar
- Littman, M., Cassandra, A., & Kaelbling., L. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis, & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 362–370). San Francisco, CA: Morgan Kaufmann.Google Scholar
- Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 12:1.Google Scholar
- Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: John Wiley & Sons.Google Scholar
- Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Dept.Google Scholar
- Saul, L., & Singh, S. (1996). Learning curve bounds for markov decision processes with undiscounted rewards. In COLT96: Proceedings of the Ninth Annual ACM Conference on Computational Learning Theory.Google Scholar
- Schapire, R. E., & Warmuth, M. K. (1994). On the worst-case analysis of temporal-difference learning algorithms. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning: Proceedings of the Eleventh International Conference (pp. 266–274). San Mateo, CA: Morgan Kaufmann.Google Scholar
- Sinclair, A. (1993). Algorithms for random generation and counting: A Markov chain approach. Boston: Birkhauser.Google Scholar
- Singh, S., & Dayan, P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning, 32:1, 5–40.Google Scholar
- Singh, S., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems 7. San Mateo, CA: Morgan Kaufmann.Google Scholar
- Singh, S., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement learning algorithms. Machine Learning, 38:3, 287–308.Google Scholar
- Singh, S., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123–158.Google Scholar
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
- Sutton, R. S. (1995). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8 (pp. 1038–1044). Cambridge, MA: MIT Press.Google Scholar
- Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.Google Scholar
- Thrun, S. B. (1992). The role of exploration in learning control. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy and adaptive approaches. Florence, KY: Van Nostrand Reinhold.Google Scholar
- Tsitsiklis, J. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:3, 185–202.Google Scholar
- Tsitsiklis, J., & Roy, B. V. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.Google Scholar
- Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, Cambridge Univ., Cambridge, England, UK.Google Scholar
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8:3/4, 279–292.Google Scholar
Copyright information
© Kluwer Academic Publishers 2002