Efficient Methods for Near-Optimal Sequential Decision Making under Uncertainty

  • Christos Dimitrakakis
Part of the Studies in Computational Intelligence book series (SCI, volume 281)


This chapter discusses decision making under uncertainty. More specifically, it offers an overview of efficient Bayesian and distribution-free algorithms for making near-optimal sequential decisions under uncertainty about the environment. Due to the uncertainty, such algorithms must not only learn from their interaction with the environment but also perform as well as possible while learning is taking place.


Leaf Node Reinforcement Learning Markov Decision Process Belief State Action Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agrawal, R.: The continuum-armed bandit problem. SIAM Journal on Control and Optimization 33(6), 1926–1951 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learning Research 3, 397–422 (2002)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2/3), 235–256 (2002)zbMATHCrossRefGoogle Scholar
  4. 4.
    Auer, P., Jaksch, T., Ortner, R.: Near-optimal regret bounds for reinforcement learning. In: Proceedings of NIPS 2008 (2008)Google Scholar
  5. 5.
    Auer, P., Ortner, R., Szepesvari, C.: Improved Rates for the Stochastic Continuum-Armed Bandit Problem. In: Bshouty, N.H., Gentile, C. (eds.) COLT. LNCS (LNAI), vol. 4539, pp. 454–468. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Bellman, R., Kalaba, R.: A mathematical theory of adaptive control processes. Proceedings of the National Academy of Sciences of the United States of America 45(8), 1288–1290 (1959), zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Berger, J.: The case for objective Bayesian analysis. Bayesian Analysis 1(3), 385–402 (2006)MathSciNetGoogle Scholar
  8. 8.
    Bertsekas, D.: Dynamic programming and suboptimal control: From ADP to MPC. Fundamental Issues in Control, European Journal of Control 11(4-5) (2005); From 2005 CDC, Seville, SpainGoogle Scholar
  9. 9.
    Bertsekas, D.P.: Dynamic Programming and Optimal Control. Athena Scientific, Belmont (2001)zbMATHGoogle Scholar
  10. 10.
    Blumer, A., Ehrenfeuch, A., Haussler, D., Warmuth, M.: Learnability and the Vapnik-Chervonenkis Dimension. Journal of the Association for Computing Machinery 36(4), 929–965 (1989)zbMATHMathSciNetGoogle Scholar
  11. 11.
    Boender, C., Rinnooy Kan, A.: Bayesian stopping rules for multistart global optimization methods. Mathematical Programming 37(1), 59–80 (1987)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Castro, P.S., Precup, D.: Using linear programming for bayesian exploration in Markov decision processes. In: Veloso, M.M. (ed.) IJCAI, pp. 2437–2442 (2007)Google Scholar
  13. 13.
    Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning and Games. Cambridge University Press, Cambridge (2006)zbMATHCrossRefGoogle Scholar
  14. 14.
    Chen, M.H., Ibrahim, J.G., Yiannoutsos, C.: Prior elicitation, variable selection and Bayesian computation for logistic regression models. Journal of the Royal Statistical Society (Series B): Statistical Methodology 61(1), 223–242 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Chernoff, H.: Sequential Models for Clinical Trials. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 4, pp. 805–812. Univ. of Calif. Press, Berkeley (1966)Google Scholar
  16. 16.
    Coquelin, P.A., Munos, R.: Bandit algorithms for tree search. In: UAI 2007, Proceedings of the 23rd Conference in Uncertainty in Artificial Intelligence, Vancouver, BC, Canada (2007)Google Scholar
  17. 17.
    Dearden, R., Friedman, N., Russell, S.J.: Bayesian Q-learning. In: AAAI/IAAI, pp. 761–768 (1998),
  18. 18.
    DeGroot, M.H.: Optimal Statistical Decisions. John Wiley & Sons, Chichester (1970)zbMATHGoogle Scholar
  19. 19.
    Dey, D., Muller, P., Sinha, D.: Practical nonparametric and semiparametric Bayesian statistics. Springer, Heidelberg (1998)zbMATHGoogle Scholar
  20. 20.
    Dimitrakakis, C.: Nearly optimal exploration-exploitation decision thresholds. In: Int. Conf. on Artificial Neural Networks, ICANN (2006); IDIAP-RR 06-12Google Scholar
  21. 21.
    Dimitrakakis, C.: Tree exploration for Bayesian RL exploration. In: Proceedings of the international conference on computational intelligence for modelling, control and automation, CIMCA 2008 (2008)Google Scholar
  22. 22.
    Dimitrakakis, C.: Complexity of stochastic branch and bound for belief tree search in Bayesian reinforcement learning. Tech. Rep. IAS-UVA-09-01, University of Amsterdam (2009)Google Scholar
  23. 23.
    Dimitrakakis, C., Lagoudakis, M.G.: Algorithms and bounds for rollout sampling approximate policy iteration. In: Girgin, S., Loth, M., Munos, R., Preux, P., Ryabko, D. (eds.) EWRL 2008. LNCS (LNAI), vol. 5323, pp. 27–40. Springer, Heidelberg (2008), CrossRefGoogle Scholar
  24. 24.
    Dimitrakakis, C., Savu-Krohn, C.: Cost-minimising strategies for data labelling: optimal stopping and active learning. In: Hartmann, S., Kern-Isberner, G. (eds.) FoIKS 2008. LNCS, vol. 4932, pp. 96–111. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  25. 25.
    Duff, M.O.: Optimal learning computational procedures for Bayes-adaptive Markov decision processes. Ph.D. thesis, University of Massachusetts at Amherst (2002)Google Scholar
  26. 26.
    Duff, M.O., Barto, A.G.: Local bandit approximation for optimal learning problems. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 1019. The MIT Press, Cambridge (1997), Google Scholar
  27. 27.
    Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed and reinforcement learning problems. Journal of Machine Learning Research, 1079–1105 (2006)Google Scholar
  28. 28.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)zbMATHCrossRefMathSciNetGoogle Scholar
  29. 29.
    Friedman, M., Savage, L.J.: The Utility Analysis of Choices Involving Risk. The Journal of Political Economy 56(4), 279 (1948)CrossRefGoogle Scholar
  30. 30.
    Friedman, M., Savage, L.J.: The Expected-Utility Hypothesis and the Measurability of Utility. The Journal of Political Economy 60(6), 463 (1952)CrossRefGoogle Scholar
  31. 31.
    Gittins, C.J.: Multi-armed Bandit Allocation Indices. John Wiley & Sons, New Jersey (1989)zbMATHGoogle Scholar
  32. 32.
    Goldstein, M.: Subjective Bayesian analysis: Principles and practice. Bayesian Analysis 1(3), 403–420 (2006)MathSciNetGoogle Scholar
  33. 33.
    Hauskrecht, M.: Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Resesarch, 33–94 (2000)Google Scholar
  34. 34.
    Hoeffding, W.: Lower bounds for the expected sample size and the average risk of a sequential procedure. The Annals of Mathematical Statistics 31(2), 352–368 (1960), zbMATHCrossRefMathSciNetGoogle Scholar
  35. 35.
    Hren, J.F., Munos, R.: Optimistic planning of deterministic systems. In: Girgin, S., Loth, M., Munos, R., Preux, P., Ryabko, D. (eds.) EWRL 2008. LNCS (LNAI), vol. 5323, pp. 151–164. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  36. 36.
    Kall, P., Wallace, S.: Stochastic programming. Wiley, New York (1994)zbMATHGoogle Scholar
  37. 37.
    Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. In: Proc. 15th International Conf. on Machine Learning, pp. 260–268. Morgan Kaufmann, San Francisco (1998), Google Scholar
  38. 38.
    Kelly, F.P.: Multi-armed bandits with discount factor near one: The bernoulli case. The Annals of Statistics 9(5), 987–1001 (1981)zbMATHCrossRefMathSciNetGoogle Scholar
  39. 39.
    Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  40. 40.
    Luce, R.D., Raiffa, H.: Games and Decisions. John Wiley and Sons, Chichester (1957); Republished by Dover in 1989zbMATHGoogle Scholar
  41. 41.
    McCall, J.: The Economics of Information and Optimal Stopping Rules. Journal of Business 38(3), 300–317 (1965)CrossRefGoogle Scholar
  42. 42.
    Moustakides, G.: Optimal stopping times for detecting changes in distributions. Annals of Statistics 14(4), 1379–1387 (1986)zbMATHCrossRefMathSciNetGoogle Scholar
  43. 43.
    Poupart, P., Vlassis, N.: Model-based bayesian reinforcement learning in partially observable domains. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM (2008)Google Scholar
  44. 44.
    Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: ICML 2006, pp. 697–704. ACM Press New York (2006)Google Scholar
  45. 45.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New Jersey (1994/2005)Google Scholar
  46. 46.
    Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
  47. 47.
    Ross, S., Chaib-draa, B., Pineau, J.: Bayes-adaptive POMDPs. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems, vol. 20. MIT Press, Cambridge (2008)Google Scholar
  48. 48.
    Ross, S., Pineau, J., Paquet, S., Chaib-draa, B.: Online planning algorithms for POMDPs. Journal of Artificial Intelligence Resesarch 32, 663–704 (2008)zbMATHMathSciNetGoogle Scholar
  49. 49.
    Roy, N., McCallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proc. 18th International Conf. on Machine Learning, pp. 441–448. Morgan Kaufmann, San Francisco (2001), Google Scholar
  50. 50.
    Savage, L.J.: The Foundations of Statistics. Dover Publications, New York (1972)zbMATHGoogle Scholar
  51. 51.
    Smith, T., Simmons, R.: Point-based POMDP algorithms: Improved analysis and implementation. In: Proceedigns of the 21st Conference on Uncertainty in Artificial Intelligence (UAI 2005), pp. 542–547 (2005)Google Scholar
  52. 52.
    Stengel, R.F.: Optimal Control and Estimation, 2nd edn. Dover, New York (1994)zbMATHGoogle Scholar
  53. 53.
    Talagrand, M.: A new look at independence. Annals of Probability 24(1), 1–34 (1996)zbMATHCrossRefMathSciNetGoogle Scholar
  54. 54.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (2000)zbMATHGoogle Scholar
  55. 55.
    Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16(2), 264–280 (1971)zbMATHCrossRefMathSciNetGoogle Scholar
  56. 56.
    Wald, A.: Sequential Analysis. John Wiley & Sons, Chichester (1947); Republished by Dover in 2004zbMATHGoogle Scholar
  57. 57.
    Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line reward optimization. In: ICML 2005, pp. 956–963. ACM, New York (2005)CrossRefGoogle Scholar
  58. 58.
    Zhang, T.: From ε-entropy to KL-entropy: Analysis of minimum information complexity density estimation. Annals of Statistics 34(5), 2180–2210 (2006)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Christos Dimitrakakis
    • 1
  1. 1.Informatics InstituteUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations