Machine Learning

, Volume 49, Issue 2, pp 209–232

Near-Optimal Reinforcement Learning in Polynomial Time

  • Michael Kearns
  • Satinder Singh

DOI: 10.1023/A:1017984413808

Cite this article as:
Kearns, M. & Singh, S. Machine Learning (2002) 49: 209. doi:10.1023/A:1017984413808


We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

reinforcement learning Markov decision processes exploration versus exploitation 
Download to read the full article text

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Michael Kearns
    • 1
  • Satinder Singh
    • 2
  1. 1.Department of Computer and Information ScienceUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.Syntek CapitalNew YorkUSA

Personalised recommendations