Abstract
Bandits are a finite collection of random variables. Bandit problems are Markov decision problems in which, at each decision time, the decision maker selects a random variable (referred to as a bandit “arm”) and observes an outcome. The selection is based on the observation history. The objective is to sequentially choose arms so as to minimize growth (with decision time) rate of the number of suboptimal selections.
The appellation “bandit” refers to mechanical gambling machines, and the tradition stems from the question of allocating competing treatments to a sequence of patients having the same disease. Our motivation is “machine learning” in which a game-playing or assembly-line adjusting computer is faced with a sequence of statistically-similar decision problems and, as resource, has access to an expanding data base relevant to these problems.
The setting for the present study is nonparametric and infinite horizon. The central aim is to relate a methodology which postulates finite moments or, alternatively, bounded bandit arms. Under these circumstances, strategies proposed are shown to be asymptotically optimal and converge at guaranteed rates. In the bounded-arm case, the rate is optimal.
We extend the theory to the case in which the bandit population is infinite, and share some computational experience.
Similar content being viewed by others
References
J. Bather, Randomised allocation of treatments in sequential trials, Adv. Appl. Prob. 12 (1980) 174–182.
R. Bellman,Adaptive Control Processes (Princeton University Press, Princeton, NJ, 1961).
D.A. Berry and B. Fristedt,Bandit Problems: Sequential Allocation of Experiments (Chapman-Hall, New York, 1985).
H. Chernoff,Sequential Analysis and Optimal Design (SIAM, Philadelphia, 1972).
Y.S. Chow and T.L. Lai, Some one-sided theorems on the tail distribution of sample sums with applications to the last time and largest excess of boundary crossings, Trans. AMS 208 (1975) 51–72.
M.K. Clayton and D.A. Berry, Bayesian nonparametric bandits, Ann. Statist. 13 (1985) 1523–1534.
L.P. Devroye, The uniform convergence of nearest neighbor regression function estimators and their application in optimization, IEEE Trans. Info. Theory IT-24 (2) (1978) 142–151.
D.K. Fuk and S.V. Nagaev, Probability inequalities for sums of independent random variables, Theory Probab. Appl. 16 (1971) 643–660.
O. Hernández-Lerma,Adaptive Markov Control Processes (Springer, New York, 1989).
J.D. Holland, Genetic algorithms and the optimal allocation of trials, SIAM J. Comput. 2 (2) (1973) 88–105.
T.L. Lai, Adaptive treatment allocation and the multi-armed bandit problem, Ann. Statist. 15 (1987) 1091–1114.
T.L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Adv. Appl. Math. 6 (1985) 4–22.
T.L. Lai, H. Robbins and D. Siegmund, Sequential design of comparamtive clinical trials, in:Recent Advances in Statistics, eds. M. Rizvi, J. Rustagi, and D. Siegmund (Academic Press, New York, 1983).
A. Renyi,Probability Theory (Elsevier, New York, 1970).
H. Robbins, Some aspects of the sequential design of experiments, Bull. AMS 58 (1952) 527–535.
S. Yakowitz, A statistical foundation for machine learning, with application to go-moku, Comput. Math. Appl. 17 (1989) 1085–1102.
S. Yakowitz,Mathematics of Adaptive Control Processes (Elsevier, New York, 1969).
S. Yakowitz and E. Lugosi, Random search in the presence of noise, with application to machine learning, SIAM J. Sci. Statist. Comput. (1990).
S. Yakowitz, T. Jawayardena and S. Li, Theory for automatic learning under Markov-dependent noise, with applications, submitted for publication.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Yakowitz, S., Lowe, W. Nonparametric bandit methods. Ann Oper Res 28, 297–312 (1991). https://doi.org/10.1007/BF02055587
Issue Date:
DOI: https://doi.org/10.1007/BF02055587