Advertisement

Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes

  • Jordi Grau-Moya
  • Felix Leibfried
  • Tim Genewein
  • Daniel A. Braun
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9852)

Abstract

Information-theoretic principles for learning and acting have been proposed to solve particular classes of Markov Decision Problems. Mathematically, such approaches are governed by a variational free energy principle and allow solving MDP planning problems with information-processing constraints expressed in terms of a Kullback-Leibler divergence with respect to a reference distribution. Here we consider a generalization of such MDP planners by taking model uncertainty into account. As model uncertainty can also be formalized as an information-processing constraint, we can derive a unified solution from a single generalized variational principle. We provide a generalized value iteration scheme together with a convergence proof. As limit cases, this generalized scheme includes standard value iteration with a known model, Bayesian MDP planning, and robust planning. We demonstrate the benefits of this approach in a grid world simulation.

Keywords

Bounded rationality Model uncertainty Robustness Planning Markov decision processes 

Notes

Acknowledgments

This study was supported by the DFG, Emmy Noether grant BR4164/1-1. The code was developed on top of the RLPy library [9].

References

  1. 1.
    Åström, K.J., Wittenmark, B.: Adaptive control. Courier Corporation, Mineola (2013)Google Scholar
  2. 2.
  3. 3.
    Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)MATHGoogle Scholar
  4. 4.
    Braun, D.A., Ortega, P.A., Theodorou, E., Schaal, S.: Path integral control and bounded rationality. In: 2011 IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), pp. 202–209. IEEE (2011)Google Scholar
  5. 5.
    van den Broek, B., Wiegerinck, W., Kappen, H.J.: Risk sensitive path integral control. In: UAI (2010)Google Scholar
  6. 6.
    Chow, Y., Tamar, A., Mannor, S., Pavone, M.: Risk-sensitive and robust decision-making: a CVaR optimization approach. In: Advances in Neural Information Processing Systems, pp. 1522–1530 (2015)Google Scholar
  7. 7.
    Duff, M.O.: Optimal learning: computational procedures for Bayes-adaptive Markov decision processes. Ph.d. thesis, University of Massachusetts Amherst (2002)Google Scholar
  8. 8.
    Fox, R., Pakman, A., Tishby, N.: G-learning: Taming the noise in reinforcement learning via soft updates. arXiv preprint (2015). arXiv:1512.08562
  9. 9.
    Geramifard, A., Dann, C., Klein, R.H., Dabney, W., How, J.P.: Rlpy: a value-function-based reinforcement learning framework for education and research. J. Mach. Learn. Res. 16, 1573–1578 (2015)MATHGoogle Scholar
  10. 10.
    Guez, A., Silver, D., Dayan, P.: Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Advances in Neural Information Processing Systems, pp. 1025–1033 (2012)Google Scholar
  11. 11.
    Guez, A., Silver, D., Dayan, P.: Scalable and efficient Bayes-adaptive reinforcement learning based on Monte-Carlo tree search. J. Artif. Intell. Res. 48, 841–883 (2013)MathSciNetMATHGoogle Scholar
  12. 12.
    Hansen, L.P., Sargent, T.J.: Robustness. Princeton University Press, Princeton (2008)CrossRefMATHGoogle Scholar
  13. 13.
    Iyengar, G.N.: Robust dynamic programming. Math. Oper. Res. 30(2), 257–280 (2005)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Kappen, H.J.: Linear theory for control of nonlinear stochastic systems. Phys. Rev. Lett. 95(20), 200201 (2005)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value function estimates. Manag. Sci. 53(2), 308–322 (2007)CrossRefMATHGoogle Scholar
  16. 16.
    Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5), 780–798 (2005)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Ortega, P.A., Braun, D.A.: A Bayesian rule for adaptive control based on causal interventions. In: 3rd Conference on Artificial General Intelligence (AGI-2010), Atlantis Press (2010)Google Scholar
  18. 18.
    Ortega, P.A., Braun, D.A.: A minimum relative entropy principle for learning and acting. J. Artif. Intell. Res. 38(11), 475–511 (2010)MathSciNetMATHGoogle Scholar
  19. 19.
    Ortega, P.A., Braun, D.A.: Thermodynamics as a theory of decision-making with information-processing costs. Proc. R. Soc. A. 469, 20120683 (2013). The Royal SocietyMathSciNetCrossRefGoogle Scholar
  20. 20.
    Ortega, P.A., Braun, D.A.: Generalized Thompson sampling for sequential decision-making and causal inference. Complex Adapt. Syst. Model. 2(1), 2 (2014)CrossRefGoogle Scholar
  21. 21.
    Ortega, P.A., Braun, D.A., Tishby, N.: Monte Carlo methods for exact & efficient solution of the generalized optimality equations. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4322–4327. IEEE (2014)Google Scholar
  22. 22.
    Osogami, T.: Robustness and risk-sensitivity in Markov decision processes. In: Advances in Neural Information Processing Systems, pp. 233–241 (2012)Google Scholar
  23. 23.
    Peters, J., Mülling, K., Altun, Y., Poole, F.D., et al.: Relative entropy policy search. In: Twenty-Fourth National Conference on Artificial Intelligence (AAAI-10), pp. 1607–1612. AAAI Press (2010)Google Scholar
  24. 24.
    Ross, S., Pineau, J., Chaib-draa, B., Kreitmann, P.: A Bayesian approach for learning and planning in partially observable Markov decision processes. J. Mach. Learn. Res. 12, 1729–1770 (2011)MathSciNetMATHGoogle Scholar
  25. 25.
    Rubin, J., Shamir, O., Tishby, N.: Trading value and information in MDPs. In: Guy, T.V., Kárný, M., Wolpert, D.H. (eds.) Decision Making with Imperfect Decision Makers. Intelligent Systems Reference Library, vol. 28, pp. 57–74. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  26. 26.
    Shen, Y., Tobia, M.J., Sommer, T., Obermayer, K.: Risk-sensitive reinforcement learning. Neural Comput. 26(7), 1298–1328 (2014)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: Pac analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009)MathSciNetMATHGoogle Scholar
  28. 28.
    Szita, I., Lőrincz, A.: The many faces of optimism: a unifying approach. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1048–1055. ACM (2008)Google Scholar
  29. 29.
    Szita, I., Szepesvári, C.: Model-based reinforcement learning with nearly tight exploration complexity bounds. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031–1038 (2010)Google Scholar
  30. 30.
    Tishby, N., Polani, D.: Information theory of decisions and actions. In: Cutsuridis, V., Hussain, A., Taylor, J.G. (eds.) Perception-Action Cycle. Springer Series in Cognitive and Neural Systems, pp. 601–636. Springer, New York (2011)CrossRefGoogle Scholar
  31. 31.
    Todorov, E.: Linearly-solvable Markov decision problems. In: Advances in Neural Information Processing Systems, pp. 1369–1376 (2006)Google Scholar
  32. 32.
    Todorov, E.: Efficient computation of optimal actions. Proc. Nat. Acad. Sci. 106(28), 11478–11483 (2009)CrossRefMATHGoogle Scholar
  33. 33.
    Wiesemann, W., Kuhn, D., Rustem, B.: Robust Markov decision processes. Math. Oper. Res. 38(1), 153–183 (2013)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Jordi Grau-Moya
    • 1
    • 2
    • 3
  • Felix Leibfried
    • 1
    • 2
    • 3
  • Tim Genewein
    • 1
    • 2
    • 3
  • Daniel A. Braun
    • 1
    • 2
  1. 1.Max Planck Institute for Intelligent SystemsTübingenGermany
  2. 2.Max Planck Institute for Biological CyberneticsTübingenGermany
  3. 3.Graduate Training Centre for NeuroscienceTübingenGermany

Personalised recommendations