Abstract
In Chap. 2, we present simulation-based algorithms for estimating the optimal value function in finite-horizon MDPs with large (possibly uncountable) state spaces, where the usual techniques of policy iteration and value iteration are either computationally impractical or infeasible to implement. We present two adaptive sampling algorithms that estimate the optimal value function by choosing actions to sample in each state visited on a finite-horizon simulated sample path. The first approach builds upon the expected regret analysis of multi-armed bandit models and uses upper confidence bounds to determine which action to sample next, whereas the second approach uses ideas from learning automata to determine the next sampled action. The first approach is also the predecessor of a closely related approach in artificial intelligence (AI) called Monte Carlo tree search that led to a breakthrough in developing the current best computer Go-playing programs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, R.: Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27, 1054–1078 (1995)
Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31(2), 282–344 (1993)
Auer, P., Cesa-Bianchi, N., Fisher, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)
Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–49 (2012)
Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17(2), 122–142 (1996)
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1), 126–139 (2005)
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: Recursive learning automata approach to Markov decision processes. IEEE Trans. Autom. Control 52(7), 1349–1355 (2007)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Jain, R., Varaiya, P.: Simulation-based uniform value function estimates of Markov decision processes. SIAM J. Control Optim. 45(5), 1633–1656 (2006)
Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2001)
Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning, pp. 282–293. Springer, Berlin (2006)
Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22 (1985)
Narendra, K.S., Thathachar, A.L.: Learning Automata: An Introduction. Prentice-Hall, Englewood Cliffs (1989)
Oommen, B.J., Lanctot, J.K.: Discrete pursuit learning automata. IEEE Trans. Syst. Man Cybern. 20, 931–938 (1990)
Poznyak, A.S., Najim, K.: Learning Automata and Stochastic Optimization. Springer, New York (1997)
Poznyak, A.S., Najim, K., Gomez-Ramirez, E.: Self-Learning Control of Finite Markov Chains. Marcel Dekker, New York (2000)
Rajaraman, K., Sastry, P.S.: Finite time analysis of the pursuit algorithm for learning automata. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 26(4), 590–598 (1996)
Santharam, G., Sastry, P.S., Thathachar, M.A.L.: Continuous action set learning automata for stochastic optimization. J. Franklin Inst. 331B(5), 607–628 (1994)
Shiryaev, A.N.: Probability, 2nd edn. Springer, New York (1995)
Thathachar, M.A.L., Sastry, P.S.: A class of rapidly converging algorithms for learning automata. IEEE Trans. Syst. Man Cybern. SMC-15, 168–175 (1985)
Thathachar, M.A.L., Sastry, P.S.: Varieties of learning automata: an overview. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 32(6), 711–722 (2002)
Wheeler, R.M., Jr., Narendra, K.S.: Decentralized learning in finite Markov chains. IEEE Trans. Autom. Control 31(6), 519–526 (1986)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag London
About this chapter
Cite this chapter
Chang, H.S., Hu, J., Fu, M.C., Marcus, S.I. (2013). Multi-stage Adaptive Sampling Algorithms. In: Simulation-Based Algorithms for Markov Decision Processes. Communications and Control Engineering. Springer, London. https://doi.org/10.1007/978-1-4471-5022-0_2
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5022-0_2
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5021-3
Online ISBN: 978-1-4471-5022-0
eBook Packages: EngineeringEngineering (R0)