Advertisement

Active Learning of MDP Models

  • Mauricio Araya-López
  • Olivier Buffet
  • Vincent Thomas
  • François Charpillet
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7188)

Abstract

We consider the active learning problem of inferring the transition model of a Markov Decision Process by acting and observing transitions. This is particularly useful when no reward function is a priori defined. Our proposal is to cast the active learning task as a utility maximization problem using Bayesian reinforcement learning with belief-dependent rewards. After presenting three possible performance criteria, we derive from them the belief-dependent rewards to be used in the decision-making process. As computing the optimal Bayesian value function is intractable for large horizons, we use a simple algorithm to approximately solve this optimization problem. Despite the sub-optimality of this technique, we show experimentally that our proposal is efficient in a number of domains.

Keywords

Performance Criterion Reinforcement Learn Markov Decision Process Reward Function Dirichlet Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Araya-López, M., Buffet, O., Thomas, V., Charpillet, F.: A POMDP extension with belief-dependent rewards. In: Advances in Neural Information Processing Systems 23 (NIPS 2010) (2010)Google Scholar
  2. 2.
    Asmuth, J., Li, L., Littman, M., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI 2009) (2009)Google Scholar
  3. 3.
    Bellman, R.: The theory of dynamic programming. Bull. Amer. Math. Soc. 60, 503–516 (1954)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Brafman, R., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3, 213–231 (2003)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Şimşek, O., Barto, A.G.: An intrinsic reward mechanism for efficient exploration. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 833–840. ACM, New York (2006)Google Scholar
  6. 6.
    Dimitrakakis, C.: Tree exploration for Bayesian RL exploration. In: CIMCA/IAWTIC/ISE, pp. 1029–1034 (2008)Google Scholar
  7. 7.
    Duff, M.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. Ph.D. thesis, University of Massachusetts Amherst (2002)Google Scholar
  8. 8.
    Gittins, J.C.: Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society 41(2), 148–177 (1979)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Jonsson, A., Barto, A.G.: Active Learning of Dynamic Bayesian Networks in Markov Decision Processes. In: Miguel, I., Ruml, W. (eds.) SARA 2007. LNCS (LNAI), vol. 4612, pp. 273–284. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  10. 10.
    Kolter, J., Ng, A.: Near-Bayesian exploration in polynomial time. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning, ICML 2009 (2009)Google Scholar
  11. 11.
    Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Morgan Kaufmann (1999)Google Scholar
  12. 12.
    Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning (ICML 2006) (2006)Google Scholar
  13. 13.
    Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience (April 1994)Google Scholar
  14. 14.
    Rauber, T., Braun, T., Berns, K.: Probabilistic distance measures of the Dirichlet and Beta distributions. Pattern Recognition 41(2), 637–645 (2008)zbMATHCrossRefGoogle Scholar
  15. 15.
    Roy, N., Thrun, S.: Coastal navigation with mobile robots. In: Advances in Neural Information Processing Systems 12, pp. 1043–1049 (1999)Google Scholar
  16. 16.
    Sorg, J., Singh, S., Lewis, R.: Variance-based rewards for approximate Bayesian reinforcement learning. In: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (2010)Google Scholar
  17. 17.
    Strens, M.J.A.: A Bayesian framework for reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML 2000), pp. 943–950 (2000)Google Scholar
  18. 18.
    Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press (1998)Google Scholar
  19. 19.
    Szepesvári, C.: Reinforcement learning algorithms for MDPs – a survey. Tech. Rep. TR09-13, Department of Computing Science, University of Alberta (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Mauricio Araya-López
    • 1
  • Olivier Buffet
    • 1
  • Vincent Thomas
    • 1
  • François Charpillet
    • 1
  1. 1.Nancy Université / INRIA LORIAVandoeuvre-lès-Nancy CedexFrance

Personalised recommendations