Skip to main content

Bayesian Reinforcement Learning

  • Chapter

Part of the Adaptation, Learning, and Optimization book series (ALO,volume 12)

Abstract

This chapter surveys recent lines of work that use Bayesian techniques for reinforcement learning. In Bayesian learning, uncertainty is expressed by a prior distribution over unknown parameters and learning is achieved by computing a posterior distribution based on the data observed. Hence, Bayesian reinforcement learning distinguishes itself from other forms of reinforcement learning by explicitly maintaining a distribution over various quantities such as the parameters of the model, the value function, the policy or its gradient. This yields several benefits: a) domain knowledge can be naturally encoded in the prior distribution to speed up learning; b) the exploration/exploitation tradeoff can be naturally optimized; and c) notions of risk can be naturally taken into account to obtain robust policies.

Keywords

  • Reinforcement Learning
  • Markov Decision Process
  • Neural Information Processing System
  • Transfer Learning
  • Policy Parameter

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-27645-3_11
  • Chapter length: 28 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   299.00
Price excludes VAT (USA)
  • ISBN: 978-3-642-27645-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   379.99
Price excludes VAT (USA)
Hardcover Book
USD   379.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Aharony, N., Zehavi, T., Engel, Y.: Learning wireless network association control with Gaussian process temporal difference methods. In: Proceedings of OPNETWORK (2005)

    Google Scholar 

  • Asmuth, J., Li, L., Littman, M.L., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2009, pp. 19–26. AUAI Press (2009)

    Google Scholar 

  • Bagnell, J., Schneider, J.: Covariant policy search. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (2003)

    Google Scholar 

  • Barto, A., Sutton, R., Anderson, C.: Neuron-like elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics 13, 835–846 (1983)

    Google Scholar 

  • Baxter, J.: A model of inductive bias learning. Journal of Artificial Intelligence Research 12, 149–198 (2000)

    MathSciNet  MATH  Google Scholar 

  • Baxter, J., Bartlett, P.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350 (2001)

    MathSciNet  MATH  Google Scholar 

  • Bellman, R.: A problem in sequential design of experiments. Sankhya 16, 221–229 (1956)

    MathSciNet  MATH  Google Scholar 

  • Bellman, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press (1961)

    Google Scholar 

  • Bellman, R., Kalaba, R.: On adaptive control processes. Transactions on Automatic Control, IRE 4(2), 1–9 (1959)

    CrossRef  Google Scholar 

  • Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Incremental natural actor-critic algorithms. In: Proceedings of Advances in Neural Information Processing Systems, vol. 20, pp. 105–112. MIT Press (2007)

    Google Scholar 

  • Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)

    MathSciNet  MATH  CrossRef  Google Scholar 

  • Brafman, R., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. JMLR 3, 213–231 (2002)

    MathSciNet  Google Scholar 

  • Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997)

    MathSciNet  CrossRef  Google Scholar 

  • Castro, P., Precup, D.: Using linear programming for Bayesian exploration in Markov decision processes. In: Proc. 20th International Joint Conference on Artificial Intelligence (2007)

    Google Scholar 

  • Chalkiadakis, G., Boutilier, C.: Coordination in multi-agent reinforcement learning: A Bayesian approach. In: International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 709–716 (2003)

    Google Scholar 

  • Chalkiadakis, G., Boutilier, C.: Bayesian reinforcement learning for coalition formation under uncertainty. In: International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 1090–1097 (2004)

    Google Scholar 

  • Cozzolino, J., Gonzales-Zubieta, R., Miller, R.L.: Markovian decision processes with uncertain transition probabilities. Tech. Rep. Technical Report No. 11, Research in the Control of Complex Systems. Operations Research Center, Massachusetts Institute of Technology (1965)

    Google Scholar 

  • Cozzolino, J.M.: Optimal sequential decision making under uncertainty. Master’s thesis, Massachusetts Institute of Technology (1964)

    Google Scholar 

  • Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 761–768 (1998)

    Google Scholar 

  • Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: UAI, pp. 150–159 (1999)

    Google Scholar 

  • DeGroot, M.H.: Optimal Statistical Decisions. McGraw-Hill, New York (1970)

    MATH  Google Scholar 

  • Delage, E., Mannor, S.: Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research 58(1), 203–213 (2010)

    MathSciNet  MATH  CrossRef  Google Scholar 

  • Dimitrakakis, C.: Complexity of stochastic branch and bound methods for belief tree search in bayesian reinforcement learning. In: ICAART (1), pp. 259–264 (2010)

    Google Scholar 

  • Doshi-Velez, F.: The infinite partially observable Markov decision process. In: Neural Information Processing Systems (2009)

    Google Scholar 

  • Doshi-Velez, F., Wingate, D., Roy, N., Tenenbaum, J.: Nonparametric Bayesian policy priors for reinforcement learning. In: NIPS (2010)

    Google Scholar 

  • Duff, M.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massassachusetts Amherst (2002)

    Google Scholar 

  • Duff, M.: Design for an optimal probe. In: ICML, pp. 131–138 (2003)

    Google Scholar 

  • Engel, Y.: Algorithms and representations for reinforcement learning. PhD thesis, The Hebrew University of Jerusalem, Israel (2005)

    Google Scholar 

  • Engel, Y., Mannor, S., Meir, R.: Sparse Online Greedy Support Vector Regression. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 84–96. Springer, Heidelberg (2002)

    CrossRef  Google Scholar 

  • Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 154–161 (2003)

    Google Scholar 

  • Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the Twenty Second International Conference on Machine Learning, pp. 201–208 (2005a)

    Google Scholar 

  • Engel, Y., Szabo, P., Volkinshtein, D.: Learning to control an octopus arm with Gaussian process temporal difference methods. In: Proceedings of Advances in Neural Information Processing Systems, vol. 18, pp. 347–354. MIT Press (2005b)

    Google Scholar 

  • Fard, M.M., Pineau, J.: PAC-Bayesian model selection for reinforcement learning. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 1624–1632 (2010)

    Google Scholar 

  • Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Proceedings of Advances in Neural Information Processing Systems, vol. 19, MIT Press (2006)

    Google Scholar 

  • Ghavamzadeh, M., Engel, Y.: Bayesian Actor-Critic algorithms. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning (2007)

    Google Scholar 

  • Gmytrasiewicz, P., Doshi, P.: A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research (JAIR) 24, 49–79 (2005)

    MATH  Google Scholar 

  • Greensmith, E., Bartlett, P., Baxter, J.: Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5, 1471–1530 (2004)

    MathSciNet  MATH  Google Scholar 

  • Iyengar, G.N.: Robust dynamic programming. Mathematics of Operations Research 30(2), 257–280 (2005)

    MathSciNet  MATH  CrossRef  Google Scholar 

  • Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of Advances in Neural Information Processing Systems, vol. 11, MIT Press (1999)

    Google Scholar 

  • Kaelbling, L.P.: Learning in Embedded Systems. MIT Press (1993)

    Google Scholar 

  • Kakade, S.: A natural policy gradient. In: Proceedings of Advances in Neural Information Processing Systems, vol. 14 (2002)

    Google Scholar 

  • Kearns, M., Mansour, Y., Ng, A.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In: Proc. IJCAI (1999)

    Google Scholar 

  • Kolter, J.Z., Ng, A.Y.: Near-bayesian exploration in polynomial time. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 513–520. ACM, New York (2009)

    Google Scholar 

  • Konda, V., Tsitsiklis, J.: Actor-Critic algorithms. In: Proceedings of Advances in Neural Information Processing Systems, vol. 12, pp. 1008–1014 (2000)

    Google Scholar 

  • Lazaric, A., Ghavamzadeh, M.: Bayesian multi-task reinforcement learning. In: Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp. 599–606 (2010)

    Google Scholar 

  • Lazaric, A., Restelli, M., Bonarini, A.: Transfer of samples in batch reinforcement learning. In: Proceedings of ICML, vol. 25, pp. 544–551 (2008)

    Google Scholar 

  • Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value function estimates. Management Science 53(2), 308–322 (2007)

    MATH  CrossRef  Google Scholar 

  • Marbach, P.: Simulated-based methods for Markov decision processes. PhD thesis, Massachusetts Institute of Technology (1998)

    Google Scholar 

  • Martin, J.J.: Bayesian decision problems and Markov chains. John Wiley, New York (1967)

    MATH  Google Scholar 

  • Mehta, N., Natarajan, S., Tadepalli, P., Fern, A.: Transfer in variable-reward hierarchical reinforcement learning. Machine Learning 73(3), 289–312 (2008)

    CrossRef  Google Scholar 

  • Meuleau, N., Bourgine, P.: Exploration of multi-state environments: local measures and back-propagation of uncertainty. Machine Learning 35, 117–154 (1999)

    MATH  CrossRef  Google Scholar 

  • Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5), 780–798 (2005)

    MathSciNet  MATH  CrossRef  Google Scholar 

  • O’Hagan, A.: Monte Carlo is fundamentally unsound. The Statistician 36, 247–249 (1987)

    CrossRef  Google Scholar 

  • O’Hagan, A.: Bayes-Hermite quadrature. Journal of Statistical Planning and Inference 29, 245–260 (1991)

    MathSciNet  MATH  CrossRef  Google Scholar 

  • Pavlov, M., Poupart, P.: Towards global reinforcement learning. In: NIPS Workshop on Model Uncertainty and Risk in Reinforcement Learning (2008)

    Google Scholar 

  • Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Networks 21(4), 682–697 (2008)

    CrossRef  Google Scholar 

  • Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots (2003)

    Google Scholar 

  • Peters, J., Vijayakumar, S., Schaal, S.: Natural Actor-Critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 280–291. Springer, Heidelberg (2005)

    CrossRef  Google Scholar 

  • Porta, J.M., Spaan, M.T., Vlassis, N.: Robot planning in partially observable continuous domains. In: Proc. Robotics: Science and Systems (2005)

    Google Scholar 

  • Poupart, P., Vlassis, N.: Model-based Bayesian reinforcement learning in partially observable domains. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM (2008)

    Google Scholar 

  • Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proc. Int. Conf. on Machine Learning, Pittsburgh, USA (2006)

    Google Scholar 

  • Rasmussen, C., Ghahramani, Z.: Bayesian Monte Carlo. In: Proceedings of Advances in Neural Information Processing Systems, vol. 15, pp. 489–496. MIT Press (2003)

    Google Scholar 

  • Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT Press (2006)

    Google Scholar 

  • Reisinger, J., Stone, P., Miikkulainen, R.: Online kernel selection for Bayesian reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Machine Learning, pp. 816–823 (2008)

    Google Scholar 

  • Ross, S., Pineau, J.: Model-based Bayesian reinforcement learning in large structured domains. In: Uncertainty in Artificial Intelligence, UAI (2008)

    Google Scholar 

  • Ross, S., Chaib-Draa, B., Pineau, J.: Bayes-adaptive POMDPs. In: Advances in Neural Information Processing Systems, NIPS (2007)

    Google Scholar 

  • Ross, S., Chaib-Draa, B., Pineau, J.: Bayesian reinforcement learning in continuous POMDPs with application to robot navigation. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2845–2851 (2008)

    Google Scholar 

  • Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004)

    Google Scholar 

  • Silver, E.A.: Markov decision processes with uncertain transition probabilities or rewards. Tech. Rep. Technical Report No. 1, Research in the Control of Complex Systems. Operations Research Center, Massachusetts Institute of Technology (1963)

    Google Scholar 

  • Spaan, M.T.J., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24, 195–220 (2005)

    MATH  Google Scholar 

  • Strehl, A.L., Li, L., Littman, M.L.: Incremental model-based learners with formal learning-time guarantees. In: UAI (2006)

    Google Scholar 

  • Strens, M.: A Bayesian framework for reinforcement learning. In: ICML (2000)

    Google Scholar 

  • Sutton, R.: Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts Amherst (1984)

    Google Scholar 

  • Sutton, R.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)

    Google Scholar 

  • Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of Advances in Neural Information Processing Systems, vol. 12, pp. 1057–1063 (2000)

    Google Scholar 

  • Taylor, M., Stone, P., Liu, Y.: Transfer learning via inter-task mappings for temporal difference learning. JMLR 8, 2125–2167 (2007)

    MathSciNet  MATH  Google Scholar 

  • Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)

    MATH  Google Scholar 

  • Veness, J., Ng, K.S., Hutter, M., Silver, D.: Reinforcement learning via AIXI approximation. In: AAAI (2010)

    Google Scholar 

  • Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line reward optimization. In: ICML (2005)

    Google Scholar 

  • Watkins, C.: Learning from delayed rewards. PhD thesis, Kings College, Cambridge, England (1989)

    Google Scholar 

  • Wiering, M.: Explorations in efficient reinforcement learning. PhD thesis, University of Amsterdam (1999)

    Google Scholar 

  • Williams, R.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)

    MATH  Google Scholar 

  • Wilson, A., Fern, A., Ray, S., Tadepalli, P.: Multi-task reinforcement learning: A hierarchical Bayesian approach. In: Proceedings of ICML, vol. 24, pp. 1015–1022 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikos Vlassis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Vlassis, N., Ghavamzadeh, M., Mannor, S., Poupart, P. (2012). Bayesian Reinforcement Learning. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27645-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27645-3_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27644-6

  • Online ISBN: 978-3-642-27645-3

  • eBook Packages: EngineeringEngineering (R0)