Machine Learning

, Volume 49, Issue 2–3, pp 209–232 | Cite as

Near-Optimal Reinforcement Learning in Polynomial Time

  • Michael Kearns
  • Satinder Singh


We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

reinforcement learning Markov decision processes exploration versus exploitation 


  1. Barto, A. G., Sutton, R. S., & Watkins, C. (1990). Sequential decision problems and neural networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 686–693). San Mateo, CA: Morgan Kaufmann.Google Scholar
  2. Bertsekas, D. P. (1987). Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  3. Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  4. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.Google Scholar
  5. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI-92.Google Scholar
  6. Fiechter, C. (1994). Efficient reinforcement learning. In COLT94: Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (pp. 88–97). New York: ACM Press.Google Scholar
  7. Fiechter, C. (1997). Expected mistake bound model for on-line reinforcement learning. In Machine Learning: Proceedings of the Fourteenth International Conference, ICML97 (pp. 116–124). San Mateo, CA: Morgan Kaufmann.Google Scholar
  8. Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis, & S., Russell (Eds.), Machine Learning: Proceedings of the Twelth International Conference (pp. 261–268). San Mateo, CA: Morgan Kaufmann.Google Scholar
  9. Gullapalli, V., & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances is neural information processing systems 6 (pp. 695–702). San Mateo, CA: Morgan Kauffman.Google Scholar
  10. Jaakkola, T., Jordan, M. I., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:6, 1185–1201.Google Scholar
  11. Jaakkola, T., Singh, S., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7 (pp. 345–352). San Mateo, CA: Morgan Kaufmann.Google Scholar
  12. Jalali, A., & Ferguson, M. (1989). A distributed asynchronous algorithm for expected average cost dynamic programming. In Proceedings of the 29th Conference on Decision and Control, Honolulu, Hawaii (pp. 1283-1288).Google Scholar
  13. Kearns, M., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. In Proceeding of the Sixteenth International Joint Conference on Artificial Intelligence (pp. 740-747). Morgan Kaufmann.Google Scholar
  14. Kumar, P. R., & Varaiya, P. P. (1986). Stochastic systems: Estimation, identification, and adaptive control. Englewood Cliffs, N.J.: Prentice Hall.Google Scholar
  15. Littman, M., Cassandra, A., & Kaelbling., L. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis, & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 362–370). San Francisco, CA: Morgan Kaufmann.Google Scholar
  16. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 12:1.Google Scholar
  17. Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: John Wiley & Sons.Google Scholar
  18. Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Dept.Google Scholar
  19. Saul, L., & Singh, S. (1996). Learning curve bounds for markov decision processes with undiscounted rewards. In COLT96: Proceedings of the Ninth Annual ACM Conference on Computational Learning Theory.Google Scholar
  20. Schapire, R. E., & Warmuth, M. K. (1994). On the worst-case analysis of temporal-difference learning algorithms. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning: Proceedings of the Eleventh International Conference (pp. 266–274). San Mateo, CA: Morgan Kaufmann.Google Scholar
  21. Sinclair, A. (1993). Algorithms for random generation and counting: A Markov chain approach. Boston: Birkhauser.Google Scholar
  22. Singh, S., & Dayan, P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning, 32:1, 5–40.Google Scholar
  23. Singh, S., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems 7. San Mateo, CA: Morgan Kaufmann.Google Scholar
  24. Singh, S., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement learning algorithms. Machine Learning, 38:3, 287–308.Google Scholar
  25. Singh, S., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123–158.Google Scholar
  26. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
  27. Sutton, R. S. (1995). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8 (pp. 1038–1044). Cambridge, MA: MIT Press.Google Scholar
  28. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.Google Scholar
  29. Thrun, S. B. (1992). The role of exploration in learning control. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy and adaptive approaches. Florence, KY: Van Nostrand Reinhold.Google Scholar
  30. Tsitsiklis, J. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:3, 185–202.Google Scholar
  31. Tsitsiklis, J., & Roy, B. V. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.Google Scholar
  32. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, Cambridge Univ., Cambridge, England, UK.Google Scholar
  33. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8:3/4, 279–292.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Michael Kearns
    • 1
  • Satinder Singh
    • 2
  1. 1.Department of Computer and Information ScienceUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.Syntek CapitalNew YorkUSA

Personalised recommendations