## Abstract

We consider a novel stochastic multi-armed bandit problem called *good arm identification* (GAI), where a good arm is defined as an arm with expected reward greater than or equal to a given threshold. GAI is a pure-exploration problem in which a single agent repeats a process of outputting an arm as soon as it is identified as a good one before confirming the other arms are actually not good. The objective of GAI is to minimize the number of samples for each process. We find that GAI faces a new kind of dilemma, the *exploration-exploitation dilemma of confidence*, which is different from the best arm identification. As a result, an efficient design of algorithms for GAI is quite different from that for the best arm identification. We derive a lower bound on the sample complexity of GAI that is tight up to the logarithmic factor \(\mathrm {O}(\log \frac{1}{\delta })\) for acceptance error rate \(\delta \). We also develop an algorithm whose sample complexity almost matches the lower bound. We also confirm experimentally that our proposed algorithm outperforms naive algorithms in synthetic settings based on a conventional bandit problem and clinical trial researches for rheumatoid arthritis.

## Introduction

The stochastic multi-armed bandit (MAB) problem is one of the most fundamental problems for sequential decision-making under uncertainty (Sutton and Barto 1998). It is regarded as a subfield of reinforcement learning in which an agent aims to acquire a policy to select the best-rewarding action via trial and error. In the stochastic MAB problem, a single agent repeatedly plays *K* slot machines called *arms*, where an arm generates a stochastic reward when pulled. At each round *t*, the agent pulls arm \(i \in [K] = \{1,2,\ldots ,K\}\) and then observes an i.i.d. reward \(X_i(t)\) from distribution \(\nu _i\) with expectation \(\mu _i \in [0,1]\).

One of the most classic MAB formulations is the *cumulative regret minimization* (Lai and Robbins 1985; Auer et al. 2002), where the agent tries to maximize the cumulative reward over the fixed number of trials. In this setting, the agent faces the *exploration-exploitation dilemma of reward*, where the exploration means that the agent pulls seemingly suboptimal arms to discover the arm whose expected reward is largest, and the exploitation indicates that the agent pulls the currently best arm to increase the cumulative reward. The related frameworks can be widely applied to various real-world problems such as clinical trials (Grieve and Krams 2005; Genovese et al. 2013; Choy et al. 2013; Curtis et al. 2015; Liu et al. 2017) and personalized recommendations (Tang et al. 2015).

Another classic branch of the MAB problem is the *best arm identification* (Kaufmann et al. 2016; Kalyanakrishnan et al. 2012), which is a pure-exploration problem that the agent tries to identify the best arm \(a^* = \mathrm {arg \, max}_{i \in \{1,2,\ldots ,K\}} \mu _i\). So far, the conceptual idea of the best arm identification has also been successfully applied to many kinds of real-world problems (Koenig and Law 1985; Schmidt et al. 2006; Zhou et al. 2014; Jun et al. 2016). Recently, the *thresholding bandit problem* was proposed (Locatelli et al. 2016) as a variant of pure-exploration MAB formulations. In the thresholding bandit problem, the agent tries to correctly partition all the *K* arms into good arms and bad arms, where a good arm is defined as an arm whose expected reward is greater than or equal to a given threshold \(\xi >0\), and a bad arm is defined as an arm whose expected reward is lower than the threshold \(\xi \). However, in practice, neither correctly partitioning all the *K* arms nor exactly identifying the very best arm is always needed; rather, finding some of reasonably good arms as fast as possible is often more useful.

Take a problem of personalized recommendations for example. The objective is to increase our profit by sending direct emails recommending personalized items. In this problem, timely recommendation is a key, because the best sellers in the past are not necessarily the best sellers in the future. Now, there arise three troubles if this problem is formulated as the best arm identification or the thresholding bandit problem. First, an inflation of exploration costs could break out when the purchase probabilities of the multiple best sellers are much close with each other. Although this trouble can be partly relaxed by the \(\epsilon \)-best arm identification (Even-Dar et al. 2006), in which an arm with expectation greater than or equal to \(\max _{i\in [K]} \mu _i-\epsilon \) is also acceptable, the tolerance parameter \(\epsilon \) has to be set very conservatively. Second, recommending even the best sellers is not a good idea if the “best” purchase probability is too small considering the advertising costs. Third, it needlessly increases exploration costs to partition all items into good (or profitable) items and bad (or not profitable) items, if it is enough to find only some good items to increase our profit. For the above reasons, the formulation of the personalized recommendation problem as the best arm identification or the thresholding bandit problem is not necessarily effective.

Similar troubles also occur in clinical trials for finding drugs (Kim et al. 2011) or for finding appropriate doses of a drug (Grieve and Krams 2005; Genovese et al. 2013; Choy et al. 2013; Curtis et al. 2015; Liu et al. 2017), where the number of patients is extremely limited. In such a case, it is vitally important to find some drugs or doses with satisfactory effect as fast as possible rather than either to classify all drugs or doses into satisfactory ones and others or to identify the exactly best ones.

In this paper, we propose a new bandit framework named *good arm identification* (GAI), where a good arm is defined as an arm whose expected reward is greater than or equal to a given threshold. We formulate GAI as a pure-exploration problem in the *fixed confidence* setting, which is often considered in conventional pure-exploration problems. In the fixed confidence setting, an acceptance error rate \(\delta \) is fixed in advance, and we minimize the number of pulling arms needed to assure the correctness of the output with probability greater than or equal to \(1-\delta \). In GAI, a single agent repeats a process of outputting an arm as soon as the agent identifies it as a good one with error probability at most \(\delta \). If it is found that there remain no good arms, then the agent stops working. Although the agent does not face the exploration-exploitation dilemma of reward since GAI is a pure-exploration problem, the agent suffers from a new kind of dilemma, that is the *exploration-exploitation dilemma of confidence*, where the exploration means that the agent pulls other arms than the currently best one in order to discover the arm that the agent can identify as a good one with the least arm-pulls, and the exploitation indicates that the agent pulls the currently best arm to increase the confidence on the goodness.

To address the dilemma of confidence, we propose a Hybrid algorithm for the Dilemma of Confidence (HDoC). The sampling strategy of HDoC is based on the upper confidence bound (UCB) algorithm for the cumulative regret minimization (Auer et al. 2002), and the identification rule (that is, the criterion to output an arm as a good one) of HDoC is based on the lower confidence bound (LCB) for the best arm identification (Kalyanakrishnan et al. 2012). In addition, we show that a lower bound on the sample complexity for GAI is \(\Omega (\lambda \log \frac{1}{\delta })\), and HDoC can find \(\lambda \) good arms within \(\mathrm {O}\left( \lambda \log \frac{1}{\delta } + (K-\lambda ) \log \log \frac{1}{\delta } \right) \) samples. This result suggests that HDoC is superior to naive algorithms based on conventional pure-exploration problems, because they require \(\mathrm {O}\left( K \log \frac{1}{\delta } \right) \) samples.

For the personalized recommendation problem, the GAI approach is more appropriate, because the agent can quickly identify good items since the agent only focuses on finding good items rather than identifying the best item (as in the best arm identification) and bad items (as in the thresholding bandit). Certainly, there exists a possibility that the recommended item does not possess the best purchase probabilities. However, that does not necessarily matter when customers’ interests and item repositories undergo frequent changes, because identifying the exactly best item requires too many samples, and thus we cannot do that in practice. In addition, thanks to the absolute comparison, not the relative comparison, the inflation of exploration costs does not break out even if the purchase probabilities are close to each other, and then the agent can refrain from recommending items when the purchase probabilities are too small.

Our contributions can be summarized as four folds. First, we formulate a novel pure-exploration problem called GAI and find there is a new kind of dilemma, that is, the exploration-exploitation dilemma of confidence. Second, we derive a lower bound for GAI in the *fixed confidence* setting. Third, we propose the HDoC algorithm and show that an upper bound on the sample complexity of HDoC almost matches the lower bound. Fourth, we experimentally demonstrate that HDoC outperforms two naive algorithms derived from other pure exploration problems in synthetic settings based on the thresholding bandit problem (Locatelli et al. 2016) and the clinical trial researches for rheumatoid arthritis (Genovese et al. 2013; Choy et al. 2013; Curtis et al. 2015; Liu et al. 2017).

## Good arm identification

In this section, we first formulate GAI as a pure exploration problem in the fixed confidence setting. Next, we derive a lower bound on the sample complexity for GAI. We give the notation list in Table 1.

### Problem formulation

Let *K* be the number of arms, \(\xi \in (0,1)\) be a threshold and \(\delta >0\) be an acceptance error rate. Each arm \(i \in [K] = \{1,2,\ldots ,K\}\) is associated with Bernoulli distribution \(\nu _i\) with mean \(\mu _i\). The parameters \(\{\mu _i\}_{i=1}^{K}\) are unknown to the agent. We define a good arm as an arm whose expected reward is greater than or equal to threshold \(\xi \). The number of good arms is denoted by *m* which is unknown to the agent and, without loss of generality, we assume an indexing of the arms such that

The agent is naturally unaware of this indexing. At each round *t*, the agent pulls an arm \(a(t) \in [K]\) and receives an i.i.d. reward drawn from distribution \(\nu _{a(t)}\). The agent outputs an arm when it is identified as a good one. The agent repeats this process until there remain no good arms, where the stopping time is denoted by \(\tau _{\mathrm {stop}}\). To be more precise, the agent outputs \({\hat{a}}_1, {\hat{a}}_2, \ldots ,{\hat{a}}_{{\hat{m}}}\) as good arms (which are different from each other) at rounds \(\tau _1, \tau _2, \ldots ,\tau _{{\hat{m}}}\), respectively, where \({\hat{m}}\) is the number of arms that the agent outputs as good ones. The agent stops working after outputting \(\bot \) (NULL) at round \(\tau _{\mathrm {stop}}\) when the agent finds that there remain no good arms. If all arms are identified as good ones, then the agent stops after outputting \({\hat{a}}_K\) and \(\bot \) together at the same round. For \(\lambda >{\hat{m}}\) we define \(\tau _{\lambda }=\tau _{\mathrm {stop}}\). Now, we introduce the definitions of (\(\lambda \), \(\delta \))-PAC (Probably Approximately Correct) and \(\delta \)-PAC.

### Definition 1

(\((\lambda , \delta )\)-*PAC*) An algorithm satisfying the following conditions is called (\(\lambda \), \(\delta \))-PAC: if there are at least \(\lambda \) good arms then \(\mathbb {P}[\{{\hat{m}}< \lambda \}\,\cup \,\bigcup _{i\in \{{\hat{a}}_1, {\hat{a}}_2, \ldots , {\hat{a}}_\lambda \} }\{\mu _i< \xi \}]\le \delta \) and if there are less than \(\lambda \) good arms then \(\mathbb {P}[ {\hat{m}}\ge \lambda ]\le \delta \).

### Definition 2

(\(\delta \)-*PAC*) An algorithm is called \(\delta \)-PAC if the algorithm is \((\lambda , \delta )\)-PAC for all \(\lambda \in [K]\).

The agent aims to minimize \(\{ \tau _1, \tau _2, \ldots , \tau _{\mathrm {stop}} \}\) simultaneously by a \(\delta \)-PAC algorithm. On the other hand, the minimization of \(\tau _{\mathrm {stop}}\) corresponds to the thresholding bandit if we consider the fixed confidence setting.

As we can easily see from these definitions, the condition for a (\(\lambda ,\delta \))-PAC algorithm is weaker than that for a \(\delta \)-PAC algorithm. Thus, there is a possibility that we can construct a good algorithm to minimize \(\tau _{\lambda }\) by using a \((\lambda ,\delta )\)-PAC algorithm rather than a \(\delta \)-PAC algorithm if a specific value of \(\lambda \) is considered. Nevertheless, we will show that the lower bound on \(\tau _{\lambda }\) for \((\lambda ,\delta )\)-PAC algorithms can be achieved by a \(\delta \)-PAC algorithm without knowledge of \(\lambda \).

### Lower bound on the sample complexity

We give a lower bound on the sample complexity for GAI. This proof is given in Sect. 5.

### Theorem 1

Under any \((\lambda , \delta )\)-PAC algorithm, if there are \(m \ge \lambda \) good arms, then

where \(d(x,y) = x\log (x/y)+(1-x)\log ((1-x)/(1-y))\) is the binary relative entropy, with convention that \(d(0,0)=d(1,1)=0\).

This lower bound on the sample complexity for GAI is given in terms of top-\(\lambda \) expectations \(\{\mu _i\}_{i=1}^{\lambda }\). In the next section we confirm that this lower bound is tight up to the logarithmic factor \(\mathrm {O}(\log \frac{1}{\delta })\).

## Algorithms

In this section, we first consider naive algorithms based on other pure-exploration problems. Next, we propose an algorithm for GAI and bound its sample complexity from above. Pseudo codes of all the algorithms are described in Algorithm 1. These algorithms can be decomposed into two components: a sampling strategy and an identification criterion. A sampling strategy is a policy to decide which arm the agent pulls. An identification criterion is a policy for the agent to decide whether arms are good or bad. All the algorithms adopt the same identification criterion of Lines 5–11 in Algorithm 1, which is based on the Lower Confidence Bound (LCB) for the best arm identification Kalyanakrishnan et al. (2012). See Remark 3 at the end of Sect. 3.2 for other choices of identification criteria.

### Naive algorithms

We consider two naive algorithms: the Lower and Upper Confidence Bounds algorithm for GAI (LUCB-G), which is based on the LUCB algorithm for the best arm identification (Kalyanakrishnan et al. 2012) and the Anytime Parameter-free Thresholding algorithm for GAI (APT-G), which is based on the APT algorithm for the thresholding bandit problem (Locatelli et al. 2016). In both algorithms, the sampling strategy is the same as the original algorithms. These algorithms sample all arms at the same order \(\mathrm {O}\left( \log \frac{1}{\delta } \right) \).

### Proposed algorithm

We propose a Hybrid algorithm for the Dilemma of Confidence (HDoC). The sampling strategy of HDoC is based on the UCB score of the cumulative regret minimization (Auer et al. 2002). As we will see later, the algorithm stops within \(t=\mathrm {O}(\log \frac{1}{\delta })\) rounds with high probability. Thus, the second term of the UCB score of HDoC in (2) is \(\mathrm {O}\left( \sqrt{ \frac{\log \log (1/\delta )}{N_i (t)} } \right) ,\) whereas that of LUCB-G in (3) is \(\mathrm {O}\left( \sqrt{ \frac{\log (1/\delta )}{N_i (t)} } \right) \). Therefore, the HDoC algorithm pulls the currently best arm more frequently than LUCB-G, which means that HDoC puts more emphasis on exploitation than exploration.

The correctness of the output of the HDoC algorithm can be verified by the following theorem, whose proof is given in Appendix A.

### Theorem 2

The HDoC algorithm is \(\delta \)-PAC.

This theorem means that the HDoC algorithm outputs a bad arm with probability at most \(\delta \).

Next we give an upper bound on the sample complexity of HDoC. We bound the sample complexity in terms of \({\varDelta }_i=|\mu _i-\xi |\) and \({\varDelta }_{i,j}=\mu _i-\mu _j\).

### Theorem 3

Assume that \({\varDelta }_{\lambda ,\lambda +1}>0\). Then, for any \(\lambda \le m\) and \(\epsilon <\min \{\min _{i\in [K]}{\varDelta }_i,\,{\varDelta }_{\lambda ,\lambda +1}/2\}\),

where

We prove this theorem in Appendix B. The following corollary is straightforward from this theorem.

### Corollary 1

Let \({\varDelta }=\min \{\min _{i\in [K]}{\varDelta }_i,\min _{\lambda \in [K-1]}{\varDelta }_{\lambda ,\lambda +1}/2\}\). Then, for any \(\lambda \le m\),

### Proof

Since

we obtain (4) and (5) by letting \(\epsilon \downarrow 0\). We obtain (6) and (7) by letting \(\epsilon ={\varDelta }/2\) in Theorem 3. \(\square \)

Note that \(d(\mu _i,\xi )\ge 2(\mu _i-\xi )^2=2{\varDelta }_i^2\) from Pinsker’s inequality and its coefficient two cannot be improved. Thus we see that the upper bound in (4) in Corollary 1 is almost optimal in view of the lower bound in Theorem 1 for sufficiently small \(\delta \). The authors believe that the coefficient \(2{\varDelta }_i^2\) can be improved to \(d(\mu _i,\xi )\) by the techniques in the KL-UCB (Kullback-Leibler-UCB) algorithm (Cappé et al. 2012) and the Thompson sampling algorithm (Agrawal and Goyal 2012), although we use the sampling strategy based on the UCB algorithm (Auer et al. 2002) for simplicity of the analysis. Eq. (6) means that the sample complexity of \(\mathbb {E}[\tau _{\lambda }]\) scales with \(\mathrm {O}(\lambda \log \frac{1}{\delta }+(K-\lambda ) \log \log \frac{1}{\delta })\) for moderately small \(\delta \), which is contrasted with the sample complexity \(\mathrm {O}(K \log \frac{1}{\delta })\) for the best arm identification (Kaufmann et al. 2016). Furthermore, we see from (5) and (7) that the HDoC algorithm reproduces the optimal sample complexity for the thresholding bandits (Locatelli et al. 2016).

### Remark 1

We can easily extend GAI in a Bernoulli setting to GAI in a Gaussian setting with known variance \(\sigma ^2\). In the proofs of Theorems 2 and 3, we used the assumption of the Bernoulli reward only in Hoeffding’s inequality expressed as

where \({\hat{\mu }}_{i,n}\) is the empirical mean of the rewards when arm *i* has been pulled *n* times. When each reward follows a Gaussian distribution with variance \(\sigma ^2\), the distribution of the empirical mean is evaluated as

by Cramér’s inequality. By this replacement the score of HDoC becomes \({\tilde{\mu }}_i (t) = {\hat{\mu }}_i (t) + \sqrt{\frac{2\sigma ^2 \log t}{N_i(t)}}\), the score of LUCB-G becomes \({\overline{\mu }}_i (t) = {\hat{\mu }}_i (t) + \sqrt{\frac{2\sigma ^2 \log (4KN_i^2(t)/\delta )}{N_i(t)}}\) and the score for identifying good arms becomes \({\underline{\mu }}_i (t) = {\hat{\mu }}_i (t) - \sqrt{\frac{2\sigma ^2 \log (4KN_i^2(t)/\delta )}{N_i(t)}}\) in a Gaussian setting given variance \(\sigma ^2\), while the score of APT-G in a Gaussian setting is the same as the score of APT-G in a Bernoulli setting.

### Remark 2

Theorem 2 and the evalution of \(\tau _{\mathrm {stop}}\) in Theorem 3 do not depend on the sampling strategy and only use the fact that the identification criterion is given by Lines 5–11 in Algorithm 1. Thus, these results still hold even if we use the LUCB-G and APT-G algorithms.

### Remark 3

The evaluation of the error probability is based on the union bound over all rounds \(t\in \mathbb {N}\), and the identification criterion in Lines 5–11 in Algorithm 1 is designed for this evaluation. The use of the union bound does not worsen the asymptotic analysis for \(\delta \rightarrow 0\) and we use this identification criterion to obtain a simple sample complexity bound. On the other hand, it is known that the empirical performance can be considerably improved by, for example, the bound based on the law of iterated logarithm in Jamieson et al. (2014) that can avoid the union bound. We can also use an identification criterion based on such a bound to improve empirical performance but this does not affect the result of relative comparison since we use the same identification criterion between algorithms with different sampling strategies.

### Gap between lower and upper bounds

As we can see from Theorem 3 and its proof, an arm \(i>\lambda \) (that is, an arm other than top-\(\lambda \) ones) is pulled roughly \(\mathrm {O}( \frac{\log \log (1/\delta )}{{\varDelta }_{\lambda ,i}})\) times until HDoC outputs \(\lambda \) good arms. On the other hand, the lower bound in Theorem 1 only considers \(\mathrm {O}(\log \frac{1}{\delta })\) term and does not depend on arms \(i>\lambda \). Therefore, in the case where \((K-\lambda )\) is very large compared to \(\frac{1}{\delta }\) (more specifically, in the case of \(K-\lambda ={\Omega }\left( \frac{\log (1/\delta )}{\log \log (1/\delta )}\right) \)), there still exists a gap between the lower bound in (1) and the upper bound in (6). Furthermore, the bound in (6) becomes meaningless when \({\varDelta }_{\lambda ,\lambda +1}\approx 0\). In fact, the \(\mathrm {O}(\log \log \frac{1}{\delta })\) term for small \({\varDelta }_{\lambda ,\lambda +1}\) is not negligible in some cases as we will see experimentally in Sect. 4.

To fill this gap, it is necessary to consider the following difference between the cumulative regret minimization and GAI. Let us consider the case of pulling two good arms with the same expected rewards. In the cumulative regret minimization, which of these two arms is pulled makes no difference in the reward and, for example, it suffices for pulling these two arms alternately. On the other hand in GAI, the agent should output one of these good arms as fast as possible; hence, it is desirable to pull one of these equivalent arms with a biased frequency. However, the bias in the numbers of samples between seemingly equivalent arms increases the risk to miss an actually better arm and this dilemma becomes a specific difficulty in GAI. The proposed algorithm, HDoC, is not designed to cope with this difficulty and, improving \(\mathrm {O}(\log \log \frac{1}{\delta })\) term from this viewpoint is important future work.

## Numerical experiments

In this section we experimentally compare the performance of HDoC with that of LUCB-G and APT-G. In all experiments, each arm is pulled five times as burn-in and the results are the averages over 1,000 independent runs.

### Threshold settings

We consider three settings named Threshold 1–3, which are based on Experiment 1–2 in Locatelli et al. (2016) and Experiment 4 in Mukherjee et al. (2017).

*Threshold 1 (Three group setting)* Ten Bernoulli arms with mean \(\mu _{1:3} = 0.1\), \(\mu _{4:7} = 0.35 + 0.1 \cdot (0:3)\) and \(\mu _{8:10} = 0.9\), and threshold \(\xi = 0.5\), where (*i* : *j*) denotes \(\{i,i+1,i+2,\ldots ,j-1,j\}\).

*Threshold 2 (Arithmetically progressive setting)* Six Bernoulli arms with mean \(\mu _{1:6} = 0.1 \cdot (1:6)\) and threshold \(\xi = 0.35\).

*Threshold 3 (Close-to-threshold setting)* Ten Bernoulli arms with mean \(\mu _{1:3} = 0.55\) and \(\mu _{4:10} = 0.45\) and threshold \(\xi = 0.5\).

### Medical settings

We also consider two medical settings of dose-finding in clinical trials as GAI. In general, the dose of a drug is quite important. Although high doses are usually more effective than low doses, low doses can be effective than high doses because high doses often cause bad side effects. Therefore, it is desirable to list various doses of a drug with satisfactory effect, which can be formulated as GAI. We considered two instances of the dose-finding problem based on Genovese et al. (2013) and Liu et al. (2017) as Medical 1–2, respectively, specified as follows. In both settings, the threshold \(\xi \) corresponds to the satisfactory effect.

*Medical 1 (Dose-finding of secukinumab for rheumatoid arthritis with satisfactory effect)* Five Bernoulli arms with mean \(\mu _1=0.36\), \(\mu _2=0.34\), \(\mu _3=0.469\), \(\mu _4=0.465\), \(\mu _5=0.537\), and threshold \(\xi =0.5\).

Here, \(\mu _1, \mu _2,\ldots , \mu _5\) represent placebo, secukinumab 25mg, 75mg, 150mg and 300mg, respectively. The expected reward indicates American College of Rheumatology 20% Response (ACR20) at week 16 given in (Genovese et al. 2013, Table 2).

*Medical 2 (Dose-finding of GSK654321 for rheumatoid arthritis with satisfactory effect)* Seven Gaussian arms with mean \(\mu _1=0.5\), \(\mu _2=0.7\), \(\mu _3=1.6\), \(\mu _4=1.8\), \(\mu _5=1.2\), \(\mu _6=1.0\) and \(\mu _7=0.6\) with variance \(\sigma ^2=1.44\) and threshold \(\xi =1.2\).

Here, \(\mu _1, \mu _2,\ldots , \mu _7\) represent the positive effect^{Footnote 1} of placebo, the dose of GSK654321 0.03, 0.3, 10, 20 and 30 mg/kg, respectively, where GSK654321 (Liu et al. 2017) is a developing drug with nonlinear dose-response, which is based on the real drug GSK315234 (Choy et al. 2013). The expected reward indicates change from the baseline in \({\varDelta }\) Disease Activity Score 28 (DAS28) given in (Liu et al. 2017, Profile 4). The threshold \(\xi = 1.2\) is based on Curtis et al. (2015).

### Results

First we compare HDoC, LUCB-G and APT-G for acceptance error rates \(\delta = 0.05,\,0.005\). Tables 2 and 3 show the averages and standard deviations of \(\tau _1, \tau _2, \ldots , \tau _\lambda \) and \(\tau _{\mathrm {stop}}\) for these algorithms. In most settings, HDoC outperforms LUCB-G and APT-G. In particular, the number of samples required for APT-G is very large compared to those required for HDoC or LUCB-G, and the stopping times of all algorithms are close as discussed in Remark 2. The results verify that HDoC addresses GAI more efficiently than LUCB-G or APT-G.

In Medical 2, we can easily see \(\tau _3, \tau _{\mathrm {stop}} =+\infty \) with high probability since the expected reward \(\mu _5\) is equal to the threshold \(\xi \). Moreover, APT-G fails to work completely, since it prefers to pull an arm whose expected reward is closest to the threshold \(\xi \) and selects the arm with mean \(\mu _5\) almost all the times. In fact, Tables 2, 3 show that APT-G cannot identify even one good arm within 100,000 arm-pulls whereas HDoC and LUCB-G can identify some good arms reasonably even in such a case.

As shown in Tables 2, 3, the performance of HDoC is almost the same as that of LUCB-G in Medical 2, where the expectations of the arms are very close to each other, taking the variance \(\sigma ^2\) into consideration. Figure 1 shows the result of an experiment to investigate the behavior of HDoC and LUCB-G for Medical 2 in more detail, where \(\tau _1,\tau _2\) are plotted for (possibly unrealistically) small \(\delta \). Here “Lower bound” in the figure is the asymptotic lower bound \(\sum _{i=1}^{\lambda }\frac{2\sigma ^2 \log (1/\delta )}{{\varDelta }_i^2}\) of \(\tau _{\lambda }\) for normal distributions (see Theorem 1 and Remark 1). Since the result of HDoC asymptotically approaches to the lower bound, the \(\mathrm {O}(\log \frac{1}{\delta })\) term of the sample complexity of HDoC is almost optimal, and the results show that the effect of \(\mathrm {O}(\log \log \frac{1}{\delta })\) term is not negligible for practical acceptance error rates such as \(\delta =0.05\) and 0.005.

## Proof of Theorem 1

In this section, we prove Theorem 1 based on the following proposition on the expected number of samples to distinguish two sets of reward distributions.

### Proposition 1

(Lemma 1 in Kaufmann et al. 2016) Let \(\nu \) and \(\nu '\) be two bandit models with *K* arms such that for all *i*, the distributions \(\nu _i\) and \(\nu _i'\) are mutually absolutely continuous. For any almost-surely finite stopping time \(\sigma \) and event \({\mathcal {E}}\),

where \(\mathrm {KL}(\nu _i, \nu _j)\) is the Kullback-Leibler divergence between distributions \(\nu _i\) and \(\nu _j\), and \(d(x,y) = x\log (x/y)+(1-x)\log ((1-x)/(1-y))\) is the binary relative entropy, with convention that \(d(0,0)=d(1,1)=0\).

Standard proofs on the best arm identification problems set \({\mathcal {E}}\) as an event such that \(\mathbb {P}[{\mathcal {E}}]\ge 1-\delta \) under any \(\delta \)-PAC algorithm. On the other hand, we leave \(\mathbb {P}[{\mathcal {E}}]\) to range from 0 to 1 and establish a lower bound as a minimization problem over \(\mathbb {P}[{\mathcal {E}}]\).

### Proof of Theorem 1

Fix \(j\in [m]\) and consider a set of Bernoulli distributions \(\{\nu _i'\}\) with expectations \(\{\mu _i'\}\) given by

Let \({\mathcal {E}}_j=\{j\in \{ {\hat{a}}_i\}_{i=1}^{\min \{\lambda ,{\hat{m}}\} } \}\) and \(p_j=\mathbb {P}\left[ j\in \{{\hat{a}}_i\}_{i=1}^{\min \{\lambda ,{\hat{m}}\} } \right] \) under \(\{\nu _i\}\). Since *j* is not a good arm under \(\{\nu _i'\}\), we obtain from Proposition 1 that

where we set \(d_i=d(\mu _i,\xi -\epsilon )\) and \(h(p)=-p\log p-(1-p)\log (1-p)\le \log 2\) is the binary entropy function.

Here note that

under any \((\lambda ,\delta )\)-PAC algorithm. Thus we have

where \(C^*\) is the optimal value of the optimization problem

which is equivalent to the linear programming problem

The dual problem of \((\mathrm {P}_2)\) is given by

Here consider the feasible solution of \((\mathrm {P}_2')\) given by

which attains the objective function

Since the objective function of a feasible solution for the dual problem \((\mathrm {P}_2')\) of \((\mathrm {P}_2)\) is always smaller than the optimal value \(C^*\) of \((\mathrm {P}_2)\), we have

We complete the proof by letting \(\epsilon \downarrow 0\). \(\square \)

## Conclusion

In this paper, we considered and discussed a new multi-armed bandit problem called good arm identification (GAI). The objective of GAI is to minimize not only the total number of samples to identify all good arms but also the number of samples until identifying \(\lambda \) good arms for each \(\lambda =1,2,\ldots \), where a good arm is an arm whose expected reward is greater than or equal to threshold \(\xi \). Even though GAI, which is a pure-exploration problem, does not face the exploration-exploitation dilemma of reward, GAI encounters a new kind of dilemma: the exploration-exploitation dilemma of confidence. We derived a lower bound on the sample complexity of GAI, developed an efficient algorithm, HDoC, and then we theoretically showed the sample complexity of HDoC almost matches the lower bound. We also experimentally demonstrated that HDoC outperforms algorithms based on other pure-exploration problems in the three settings based on the thresholding bandit and two settings based on the dose-finding problem in the clinical trials.

## Notes

The original values (smaller than zero) in Liu et al. (2017) represent the negative effect and we inverted the sign to denote the positive effect.

## References

Agrawal, S., & Goyal, N. (2012). Analysis of thompson sampling for the multi-armed bandit problem. In

*Proceedings of the 25th annual conference on learning theory*(vol. 23, pp. 39.1–39.26).Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem.

*Machine Learning*,*47*(2), 235–256.Cappé, O., Garivier, A., Maillard, O. A., Munos, R., & Stoltz, G. (2012). Kullback-leibler upper confidence bounds for optimal sequential allocation.

*The Annals of Statistics*,*41*(3), 1516–1541.Choy, E., Bendit, M., McAleer, D., Liu, F., Feeney, M., Brett, S., et al. (2013). Safety, tolerability, pharmacokinetics and pharmacodynamics of an anti- oncostatin m monoclonal antibody in rheumatoid arthritis: Results from phase ii randomized, placebo-controlled trials.

*Arthritis Research & Therapy*,*15*(5), R132.Curtis, J., Yang, S., Chen, L., Pope, J., Keystone, E., Haraoui, B., et al. (2015). Determining the minimally important difference in the clinical disease activity index for improvement and worsening in early rheumatoid arthritis patients.

*Arthritis Care & Research*,*67*(10), 1345–1353.Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.

*Journal of Machine Learning Research*,*7*, 1079–1105.Genovese, M. C., Durez, P., Richards, H. B., Supronik, J., Dokoupilova, E., Mazurov, V., et al. (2013). Efficacy and safety of secukinumab in patients with rheumatoid arthritis: a phase ii, dose-finding, double-blind, randomised, placebo controlled study.

*Annals of the Rheumatic Diseases*,*72*(6), 863–869.Grieve, A. P., & Krams, M. (2005). ASTIN: A bayesian adaptive dose-response trial in acute stroke.

*Clinical Trials*,*2*(4), 340–351.Jamieson, K., Malloy, M., Nowak, R., & Bubeck, S. (2014). lil’ ucb : An optimal exploration algorithm for multi-armed bandits. In

*Proceedings of The 27th conference on learning theory*(vol. 35, pp. 423–439).Jun, K. S., Jamieson, K., Nowak, R., & Zhu, X. (2016). Top arm identification in multi-armed bandits with batch arm pulls. In

*Proceedings of the 19th international conference on artificial intelligence and statistics*(pp. 139–148).Kalyanakrishnan, S., Tewari, A., Auer, P., & Stone, P. (2012). PAC subset selection in stochastic multi-armed bandits. In

*Proceedings of the 29th international conference on machine learning*(pp. 655–662).Kaufmann, E., Cappé, O., & Garivier, A. (2016). On the complexity of best-arm identification in multi-armed bandit models.

*Journal of Machine Learning Research*,*17*(1), 1–42.Kim, E. S., Herbst, R. S., Wistuba, I. I., Lee, J. J., Blumenschein, G. R., Tsao, A., et al. (2011). The BATTLE trial: Personalizing therapy for lung cancer.

*Cancer Discovery*,*1*(1), 44–53.Koenig, L. W., & Law, A. M. (1985). A procedure for selecting a subset of size m containing the l best of k independent normal populations, with applications to simulation.

*Communications in Statistics–Simulation and Computation*,*14*(3), 719–734.Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules.

*Advances in Applied Mathematics*,*6*(1), 4–22.Liu, F., Walters, S. J., & Julious, S. A. (2017). Design considerations and analysis planning of a phase 2a proof of concept study in rheumatoid arthritis in the presence of possible non-monotonicity.

*BMC Medical Research Methodology*,*17*(1), 149.Locatelli, A., Gutzeit, M., & Carpentier, A. (2016). An optimal algorithm for the thresholding bandit problem. In

*Proceedings of the 33rd international conference on machine learning*(pp. 1690–1698).Mukherjee, S., Naveen, K. P., Sudarsanam, N., & Ravindran, B. (2017). Thresholding bandits with augmented UCB. In

*Proceedings of the 26th international joint conference on artificial intelligence*(pp. 2515–2521).Schmidt, C., Branke, J., & Chick, S. E. (2006). Integrating techniques from statistical ranking into evolutionary algorithms. In

*Applications of evolutionary computing*(pp. 752–763). Springer: Heidelberg.Sutton, R. S., & Barto, A. G. (1998).

*Introduction to reinforcement learning*(1st ed.). Cambridge: MIT Press.Tang, L., Jiang, Y., Li, L., Zeng, C., & Li, T. (2015). Personalized recommendation via parameter-free contextual bandits. In

*Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval*(pp. 323–332).Zhou, Y., Chen, X., & Li, J. (2014). Optimal pac multiple arm identification with applications to crowdsourcing. In

*Proceedings of the 31st international conference on machine learning*(pp. 217–225).

## Acknowledgements

JH acknowledges support by KAKENHI 16H00881. MS acknowledges support by KAKENHI 17H00757. This work was partially supported by JST CREST Grant Number JPMJCR1662, Japan.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor: Yung-Kyun Noh.

## Appendices

### Proof of Theorem 2

In this appendix we prove Theorem 2 based on the following lemma.

### Lemma 1

### Proof

For any \(i\in [m]\,,\)

For any \(i\in [K] \setminus [m]\,,\) the same argument holds. \(\square \)

### Proof of Theorem 2

We show that HDoC is \((\lambda ,\delta )\)-PAC for arbitrary \(\lambda \in [K]\).

First we consider the case that there are more than or equal to \(\lambda \) good arms and show

Since we are now considering the case \(m\ge \lambda \), the event \(\{{\hat{m}}< \lambda \}\) implies that at least one good arm \(j\in [m]\) is regarded as a bad arm, that is, \(\{{\underline{\mu }}_{j,n} \le \xi \}\) occurs for some \(j\in [m]\) and \(n\in \mathbb {N}\). Thus we have

On the other hand, since the event \(\bigcup _{i\in \{{\hat{a}}_1, {\hat{a}}_2, \ldots , {\hat{a}}_\lambda \} }\{\mu _i< \xi \}\) implies that \(j \in \{{\hat{a}}_i\}_{i=1}^{\lambda }\) for some bad arm \(j\in [K]\setminus [m]\), we have

in the same way as (9). We obtain (8) by putting (9) and (10) together.

Next we consider the case that the number of good arms *m* is less than \(\lambda \) and show

Since there are at most \(m<\lambda \) good arms, the event \(\{{\hat{m}}\ge \lambda \}\) implies that \(j \in \{{\hat{a}}_i\}_{i=1}^{\lambda }\) for some \(j\in [K]\setminus [\lambda ]\). Thus, in the same way as (10) we have

which proves (11). \(\square \)

### Proof of Theorem 3

In this appendix, we prove Theorem 3 based on the following lemmas, and we define \(T=K\max _{i\in [K]}\lfloor n_i + 2\rfloor \).

### Lemma 2

If \(n\ge n_i\) then

### Proof

We only show (12) for \(i\in [m]\). Eq. (13) for \(i\in [K]\setminus [m]\) is exactly the same. From Hoeffding’s inequality it suffices to show that for \(n\ge n_i\)

We write \(c=({\varDelta }_i-\epsilon )^2\le 1\) in the following for notational simplicity. Then we can express \(n\ge n_i\) as

for some \(t>\log \frac{5\sqrt{K/\delta }}{c}>\log (5\sqrt{2})> 1\). Then

We obtain the lemma since \(t>\log \frac{5\sqrt{K/\delta }}{c}\) satisfies (14). \(\square \)

### Lemma 3

### Proof

If arm \(i \in [m]\) then

If arm \(i \in [K] \setminus [m]\) then

\(\square \)

### Lemma 4

### Proof

For the first term of (15) we have

Since the event \(\{a(t)=i,\,N_i(t)=n\}\) occurs for at most one \(t\in \mathbb {N}\) we have

By combining (16) and (17) with Lemma 3 we obtain

Next we consider the second term of (15). By using the same argument as (16) we obtain for \(i\notin [\lambda ]\) that

By taking the expectation we have

where (19) follows from Hoeffding’s inequality. We complete the proof by combining (18) with (20). \(\square \)

### Lemma 5

### Proof

Note that at the *t*-th round some arm is pulled at least \(\lceil (t-1)/K\rceil \) times. Furthermore, \(N_i(t)\ge \lceil (t-1)/K\rceil \) implies that the arm *i* is still in \({\mathcal {A}}(t)\) when the arm *i* is pulled \(\lceil (t-1)/K\rceil -1\) times. Thus we have

From the definition of \(T=K\max _{i\in [K]}\lfloor n_i + 2\rfloor \), we have \(\lceil (t-1)/K\rceil -1\ge n_i\) for all \(i\in [K]\). Thus, the expectation of (21) is bounded by Lemma 2 as

where (22) follows from

\(\square \)

### Lemma 6

### Proof

The summation is decomposed into

where \({\mathcal {A}}(t)=\{i\in [K]: {\underline{\mu }}_i(t) < \xi \le {\overline{\mu }}_i(t)\}\). From definition \({\tilde{\mu }}^*(t)=\max _{i\in {\mathcal {A}}(t)}{\tilde{\mu }}_i(t)\) the first term of (23) is evaluated as

Let \(P_{i,n}(x)=\mathbb {P}[{\hat{\mu }}_{i,n}<x]\). Then the expectation of the inner summation of (24) is bounded by

Combining (24) with (25) we obtain

Next we bound the second term of (23). Note that \(\{t\le \tau _{\lambda },\,[\lambda ]\cap {\mathcal {A}}(t)= \emptyset \}\) implies that \({\overline{\mu }}_j(t')\le \xi \) occured for some \(j\in [\lambda ]\) and \(t'<t\). Thus we have

where we used the same argument as (17) in (27). We can bound the expectation of (27) by Lemma 1 and 3 as

We obtain the lemma by putting (23), (26) and (28) together. \(\square \)

### Proof of Theorem 3

The stopping time is decomposed into

Lemmas 4–6 bound the expectation of these terms, which complete the proof. \(\square \)

## Rights and permissions

**OpenAccess** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## About this article

### Cite this article

Kano, H., Honda, J., Sakamaki, K. *et al.* Good arm identification via bandit feedback.
*Mach Learn* **108**, 721–745 (2019). https://doi.org/10.1007/s10994-019-05784-4

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10994-019-05784-4

### Keywords

- Thresholding bandits
- Multi-armed bandits
- Reinforcement learning
- Machine learning