Reducing reinforcement learning to KWIK online regression


DOI: 10.1007/s10472-010-9201-2

Cite this article as:
Li, L. & Littman, M.L. Ann Math Artif Intell (2010) 58: 217. doi:10.1007/s10472-010-9201-2


One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large Markov decision processes (MDPs) where compact function approximation has to be used. This paper introduces REKWIRE, a provably efficient, model-free algorithm for finite-horizon RL problems with value function approximation (VFA) that addresses the exploration-exploitation tradeoff in a principled way. The crucial element of this algorithm is a reduction of RL to online regression in the recently proposed KWIK learning model. We show that, if the KWIK online regression problem can be solved efficiently, then the sample complexity of exploration of REKWIRE is polynomial. Therefore, the reduction suggests a new and sound direction to tackle general RL problems. The efficiency of our algorithm is verified on a set of proof-of-concept experiments where popular, ad hoc exploration approaches fail.


Reinforcement learning Exploration PAC-MDP Knows What It Knows (KWIK) Online regression Value function approximation 

Mathematics Subject Classification (2010)


Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Yahoo! ResearchSanta ClaraUSA
  2. 2.Rutgers Laboratory for Real-Life Reinforcement Learning (RL3), Department of Computer ScienceRutgers UniversityPiscatawayUSA

Personalised recommendations