Multi-stage Adaptive Sampling Algorithms

Chang, Hyeong Soo; Hu, Jiaqiao; Fu, Michael C.; Marcus, Steven I.

doi:10.1007/978-1-4471-5022-0_2

Hyeong Soo Chang⁵,
Jiaqiao Hu⁶,
Michael C. Fu⁷ &
…
Steven I. Marcus⁸

Part of the book series: Communications and Control Engineering ((CCE))

Abstract

In Chap. 2, we present simulation-based algorithms for estimating the optimal value function in finite-horizon MDPs with large (possibly uncountable) state spaces, where the usual techniques of policy iteration and value iteration are either computationally impractical or infeasible to implement. We present two adaptive sampling algorithms that estimate the optimal value function by choosing actions to sample in each state visited on a finite-horizon simulated sample path. The first approach builds upon the expected regret analysis of multi-armed bandit models and uses upper confidence bounds to determine which action to sample next, whereas the second approach uses ideas from learning automata to determine the next sampled action. The first approach is also the predecessor of a closely related approach in artificial intelligence (AI) called Monte Carlo tree search that led to a breakthrough in developing the current best computer Go-playing programs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, R.: Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27, 1054–1078 (1995)
Article MATH Google Scholar
Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31(2), 282–344 (1993)
Article MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Fisher, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)
Article MATH Google Scholar
Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–49 (2012)
Article Google Scholar
Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17(2), 122–142 (1996)
Article MathSciNet MATH Google Scholar
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1), 126–139 (2005)
Article MathSciNet MATH Google Scholar
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: Recursive learning automata approach to Markov decision processes. IEEE Trans. Autom. Control 52(7), 1349–1355 (2007)
Article MathSciNet Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Article MathSciNet MATH Google Scholar
Jain, R., Varaiya, P.: Simulation-based uniform value function estimates of Markov decision processes. SIAM J. Control Optim. 45(5), 1633–1656 (2006)
Article MathSciNet MATH Google Scholar
Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2001)
Article Google Scholar
Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning, pp. 282–293. Springer, Berlin (2006)
Google Scholar
Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22 (1985)
Article MathSciNet MATH Google Scholar
Narendra, K.S., Thathachar, A.L.: Learning Automata: An Introduction. Prentice-Hall, Englewood Cliffs (1989)
Google Scholar
Oommen, B.J., Lanctot, J.K.: Discrete pursuit learning automata. IEEE Trans. Syst. Man Cybern. 20, 931–938 (1990)
Article MathSciNet MATH Google Scholar
Poznyak, A.S., Najim, K.: Learning Automata and Stochastic Optimization. Springer, New York (1997)
MATH Google Scholar
Poznyak, A.S., Najim, K., Gomez-Ramirez, E.: Self-Learning Control of Finite Markov Chains. Marcel Dekker, New York (2000)
Google Scholar
Rajaraman, K., Sastry, P.S.: Finite time analysis of the pursuit algorithm for learning automata. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 26(4), 590–598 (1996)
Article Google Scholar
Santharam, G., Sastry, P.S., Thathachar, M.A.L.: Continuous action set learning automata for stochastic optimization. J. Franklin Inst. 331B(5), 607–628 (1994)
Article MathSciNet MATH Google Scholar
Shiryaev, A.N.: Probability, 2nd edn. Springer, New York (1995)
MATH Google Scholar
Thathachar, M.A.L., Sastry, P.S.: A class of rapidly converging algorithms for learning automata. IEEE Trans. Syst. Man Cybern. SMC-15, 168–175 (1985)
Article MathSciNet Google Scholar
Thathachar, M.A.L., Sastry, P.S.: Varieties of learning automata: an overview. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 32(6), 711–722 (2002)
Article Google Scholar
Wheeler, R.M., Jr., Narendra, K.S.: Decentralized learning in finite Markov chains. IEEE Trans. Autom. Control 31(6), 519–526 (1986)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and Engineering, Sogang University, Seoul, South Korea
Hyeong Soo Chang
Dept. Applied Mathematics & Statistics, State University of New York, Stony Brook, NY, USA
Jiaqiao Hu
Smith School of Business, University of Maryland, College Park, MD, USA
Michael C. Fu
Dept. Electrical & Computer Engineering, University of Maryland, College Park, MD, USA
Steven I. Marcus

Authors

Hyeong Soo Chang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqiao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Fu
View author publications
You can also search for this author in PubMed Google Scholar
Steven I. Marcus
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chang, H.S., Hu, J., Fu, M.C., Marcus, S.I. (2013). Multi-stage Adaptive Sampling Algorithms. In: Simulation-Based Algorithms for Markov Decision Processes. Communications and Control Engineering. Springer, London. https://doi.org/10.1007/978-1-4471-5022-0_2

Download citation

DOI: https://doi.org/10.1007/978-1-4471-5022-0_2
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5021-3
Online ISBN: 978-1-4471-5022-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics