A Framework for Computing Bounds for the Return of a Policy

  • Cosmin Păduraru
  • Doina Precup
  • Joelle Pineau
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7188)

Abstract

We present a framework for computing bounds for the return of a policy in finite-horizon, continuous-state Markov Decision Processes with bounded state transitions. The state transition bounds can be based on either prior knowledge alone, or on a combination of prior knowledge and data. Our framework uses a piecewise-constant representation of the return bounds and a backwards iteration process. We instantiate this framework for a previously investigated type of prior knowledge – namely, Lipschitz continuity of the transition function. In this context, we show that the existing bounds of Fonteneau et al. (2009, 2010) can be expressed as a particular instantiation of our framework, by bounding the immediate rewards using Lipschitz continuity and choosing a particular form for the regions in the piecewise-constant representation. We also show how different instantiations of our framework can improve upon their bounds.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brunskill, E., Leffler, B., Li, L., Littman, M., Roy, N.: CORL: A continuous-state offset-dynamics reinforcement learner. In: Proceedings of the International Conference on Uncertainty in Artificial Intelligence, pp. 53–61 (2008)Google Scholar
  2. 2.
    Delage, E., Mannor, S.: Percentile Optimization for Markov Decision Processes with Parameter Uncertainty. Operations Research 58(1), 203–213 (2009)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Ermon, S., Conrad, J., Gomes, C., Selman, B.: Playing games against nature: optimal policies for renewable resource allocation. In: Proceedings of The 26th Conference on Uncertainty in Artificial Intelligence (2010)Google Scholar
  4. 4.
    Fonteneau, R., Murphy, S., Wehenkel, L., Ernst, D.: Inferring bounds on the performance of a control policy from a sample of trajectories. In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 117–123 (2009)Google Scholar
  5. 5.
    Fonteneau, R., Murphy, S.A., Wehenkel, L., Ernst, D.: Towards Min Max Generalization in Reinforcement Learning. In: Filipe, J., Fred, A., Sharp, B. (eds.) ICAART 2010. CCIS, vol. 129, pp. 61–77. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Kaelbling, L.P.: Learning in embedded systems. MIT Press (1993)Google Scholar
  7. 7.
    Kakade, S., Kearns, M., Langford, J.: Exploration in Metric State Spaces. In: International Conference on Machine Learning, vol. 20, p. 306 (2003)Google Scholar
  8. 8.
    Nilim, A., El Ghaoui, L.: Robust Control of Markov Decision Processes with Uncertain Transition Matrices. Operations Research 53(5), 780–798 (2005)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Cosmin Păduraru
    • 1
  • Doina Precup
    • 1
  • Joelle Pineau
    • 1
  1. 1.School of Computer ScienceMcGill UniversityMontrealCanada

Personalised recommendations