Introduction

The stochastic multi-armed bandit (MAB) problem is one of the most fundamental problems for sequential decision-making under uncertainty (Sutton and Barto 1998). It is regarded as a subfield of reinforcement learning in which an agent aims to acquire a policy to select the best-rewarding action via trial and error. In the stochastic MAB problem, a single agent repeatedly plays K slot machines called arms, where an arm generates a stochastic reward when pulled. At each round t, the agent pulls arm \(i \in [K] = \{1,2,\ldots ,K\}\) and then observes an i.i.d. reward \(X_i(t)\) from distribution \(\nu _i\) with expectation \(\mu _i \in [0,1]\).

One of the most classic MAB formulations is the cumulative regret minimization (Lai and Robbins 1985; Auer et al. 2002), where the agent tries to maximize the cumulative reward over the fixed number of trials. In this setting, the agent faces the exploration-exploitation dilemma of reward, where the exploration means that the agent pulls seemingly suboptimal arms to discover the arm whose expected reward is largest, and the exploitation indicates that the agent pulls the currently best arm to increase the cumulative reward. The related frameworks can be widely applied to various real-world problems such as clinical trials (Grieve and Krams 2005; Genovese et al. 2013; Choy et al. 2013; Curtis et al. 2015; Liu et al. 2017) and personalized recommendations (Tang et al. 2015).

Another classic branch of the MAB problem is the best arm identification (Kaufmann et al. 2016; Kalyanakrishnan et al. 2012), which is a pure-exploration problem that the agent tries to identify the best arm \(a^* = \mathrm {arg \, max}_{i \in \{1,2,\ldots ,K\}} \mu _i\). So far, the conceptual idea of the best arm identification has also been successfully applied to many kinds of real-world problems (Koenig and Law 1985; Schmidt et al. 2006; Zhou et al. 2014; Jun et al. 2016). Recently, the thresholding bandit problem was proposed (Locatelli et al. 2016) as a variant of pure-exploration MAB formulations. In the thresholding bandit problem, the agent tries to correctly partition all the K arms into good arms and bad arms, where a good arm is defined as an arm whose expected reward is greater than or equal to a given threshold \(\xi >0\), and a bad arm is defined as an arm whose expected reward is lower than the threshold \(\xi \). However, in practice, neither correctly partitioning all the K arms nor exactly identifying the very best arm is always needed; rather, finding some of reasonably good arms as fast as possible is often more useful.

Take a problem of personalized recommendations for example. The objective is to increase our profit by sending direct emails recommending personalized items. In this problem, timely recommendation is a key, because the best sellers in the past are not necessarily the best sellers in the future. Now, there arise three troubles if this problem is formulated as the best arm identification or the thresholding bandit problem. First, an inflation of exploration costs could break out when the purchase probabilities of the multiple best sellers are much close with each other. Although this trouble can be partly relaxed by the \(\epsilon \)-best arm identification (Even-Dar et al. 2006), in which an arm with expectation greater than or equal to \(\max _{i\in [K]} \mu _i-\epsilon \) is also acceptable, the tolerance parameter \(\epsilon \) has to be set very conservatively. Second, recommending even the best sellers is not a good idea if the “best” purchase probability is too small considering the advertising costs. Third, it needlessly increases exploration costs to partition all items into good (or profitable) items and bad (or not profitable) items, if it is enough to find only some good items to increase our profit. For the above reasons, the formulation of the personalized recommendation problem as the best arm identification or the thresholding bandit problem is not necessarily effective.

Similar troubles also occur in clinical trials for finding drugs (Kim et al. 2011) or for finding appropriate doses of a drug (Grieve and Krams 2005; Genovese et al. 2013; Choy et al. 2013; Curtis et al. 2015; Liu et al. 2017), where the number of patients is extremely limited. In such a case, it is vitally important to find some drugs or doses with satisfactory effect as fast as possible rather than either to classify all drugs or doses into satisfactory ones and others or to identify the exactly best ones.

In this paper, we propose a new bandit framework named good arm identification (GAI), where a good arm is defined as an arm whose expected reward is greater than or equal to a given threshold. We formulate GAI as a pure-exploration problem in the fixed confidence setting, which is often considered in conventional pure-exploration problems. In the fixed confidence setting, an acceptance error rate \(\delta \) is fixed in advance, and we minimize the number of pulling arms needed to assure the correctness of the output with probability greater than or equal to \(1-\delta \). In GAI, a single agent repeats a process of outputting an arm as soon as the agent identifies it as a good one with error probability at most \(\delta \). If it is found that there remain no good arms, then the agent stops working. Although the agent does not face the exploration-exploitation dilemma of reward since GAI is a pure-exploration problem, the agent suffers from a new kind of dilemma, that is the exploration-exploitation dilemma of confidence, where the exploration means that the agent pulls other arms than the currently best one in order to discover the arm that the agent can identify as a good one with the least arm-pulls, and the exploitation indicates that the agent pulls the currently best arm to increase the confidence on the goodness.

To address the dilemma of confidence, we propose a Hybrid algorithm for the Dilemma of Confidence (HDoC). The sampling strategy of HDoC is based on the upper confidence bound (UCB) algorithm for the cumulative regret minimization (Auer et al. 2002), and the identification rule (that is, the criterion to output an arm as a good one) of HDoC is based on the lower confidence bound (LCB) for the best arm identification (Kalyanakrishnan et al. 2012). In addition, we show that a lower bound on the sample complexity for GAI is \(\Omega (\lambda \log \frac{1}{\delta })\), and HDoC can find \(\lambda \) good arms within \(\mathrm {O}\left( \lambda \log \frac{1}{\delta } + (K-\lambda ) \log \log \frac{1}{\delta } \right) \) samples. This result suggests that HDoC is superior to naive algorithms based on conventional pure-exploration problems, because they require \(\mathrm {O}\left( K \log \frac{1}{\delta } \right) \) samples.

For the personalized recommendation problem, the GAI approach is more appropriate, because the agent can quickly identify good items since the agent only focuses on finding good items rather than identifying the best item (as in the best arm identification) and bad items (as in the thresholding bandit). Certainly, there exists a possibility that the recommended item does not possess the best purchase probabilities. However, that does not necessarily matter when customers’ interests and item repositories undergo frequent changes, because identifying the exactly best item requires too many samples, and thus we cannot do that in practice. In addition, thanks to the absolute comparison, not the relative comparison, the inflation of exploration costs does not break out even if the purchase probabilities are close to each other, and then the agent can refrain from recommending items when the purchase probabilities are too small.

Our contributions can be summarized as four folds. First, we formulate a novel pure-exploration problem called GAI and find there is a new kind of dilemma, that is, the exploration-exploitation dilemma of confidence. Second, we derive a lower bound for GAI in the fixed confidence setting. Third, we propose the HDoC algorithm and show that an upper bound on the sample complexity of HDoC almost matches the lower bound. Fourth, we experimentally demonstrate that HDoC outperforms two naive algorithms derived from other pure exploration problems in synthetic settings based on the thresholding bandit problem (Locatelli et al. 2016) and the clinical trial researches for rheumatoid arthritis (Genovese et al. 2013; Choy et al. 2013; Curtis et al. 2015; Liu et al. 2017).

Table 1 Notation list

Good arm identification

In this section, we first formulate GAI as a pure exploration problem in the fixed confidence setting. Next, we derive a lower bound on the sample complexity for GAI. We give the notation list in Table 1.

Problem formulation

Let K be the number of arms, \(\xi \in (0,1)\) be a threshold and \(\delta >0\) be an acceptance error rate. Each arm \(i \in [K] = \{1,2,\ldots ,K\}\) is associated with Bernoulli distribution \(\nu _i\) with mean \(\mu _i\). The parameters \(\{\mu _i\}_{i=1}^{K}\) are unknown to the agent. We define a good arm as an arm whose expected reward is greater than or equal to threshold \(\xi \). The number of good arms is denoted by m which is unknown to the agent and, without loss of generality, we assume an indexing of the arms such that

$$\begin{aligned} \mu _1 \ge \mu _2 \ge \cdots \ge \mu _m \ge \xi \ge \mu _{m+1} \ge \cdots \ge \mu _K. \end{aligned}$$

The agent is naturally unaware of this indexing. At each round t, the agent pulls an arm \(a(t) \in [K]\) and receives an i.i.d. reward drawn from distribution \(\nu _{a(t)}\). The agent outputs an arm when it is identified as a good one. The agent repeats this process until there remain no good arms, where the stopping time is denoted by \(\tau _{\mathrm {stop}}\). To be more precise, the agent outputs \({\hat{a}}_1, {\hat{a}}_2, \ldots ,{\hat{a}}_{{\hat{m}}}\) as good arms (which are different from each other) at rounds \(\tau _1, \tau _2, \ldots ,\tau _{{\hat{m}}}\), respectively, where \({\hat{m}}\) is the number of arms that the agent outputs as good ones. The agent stops working after outputting \(\bot \) (NULL) at round \(\tau _{\mathrm {stop}}\) when the agent finds that there remain no good arms. If all arms are identified as good ones, then the agent stops after outputting \({\hat{a}}_K\) and \(\bot \) together at the same round. For \(\lambda >{\hat{m}}\) we define \(\tau _{\lambda }=\tau _{\mathrm {stop}}\). Now, we introduce the definitions of (\(\lambda \), \(\delta \))-PAC (Probably Approximately Correct) and \(\delta \)-PAC.

Definition 1

(\((\lambda , \delta )\)-PAC) An algorithm satisfying the following conditions is called (\(\lambda \), \(\delta \))-PAC: if there are at least \(\lambda \) good arms then \(\mathbb {P}[\{{\hat{m}}< \lambda \}\,\cup \,\bigcup _{i\in \{{\hat{a}}_1, {\hat{a}}_2, \ldots , {\hat{a}}_\lambda \} }\{\mu _i< \xi \}]\le \delta \) and if there are less than \(\lambda \) good arms then \(\mathbb {P}[ {\hat{m}}\ge \lambda ]\le \delta \).

Definition 2

(\(\delta \)-PAC) An algorithm is called \(\delta \)-PAC if the algorithm is \((\lambda , \delta )\)-PAC for all \(\lambda \in [K]\).

The agent aims to minimize \(\{ \tau _1, \tau _2, \ldots , \tau _{\mathrm {stop}} \}\) simultaneously by a \(\delta \)-PAC algorithm. On the other hand, the minimization of \(\tau _{\mathrm {stop}}\) corresponds to the thresholding bandit if we consider the fixed confidence setting.

As we can easily see from these definitions, the condition for a (\(\lambda ,\delta \))-PAC algorithm is weaker than that for a \(\delta \)-PAC algorithm. Thus, there is a possibility that we can construct a good algorithm to minimize \(\tau _{\lambda }\) by using a \((\lambda ,\delta )\)-PAC algorithm rather than a \(\delta \)-PAC algorithm if a specific value of \(\lambda \) is considered. Nevertheless, we will show that the lower bound on \(\tau _{\lambda }\) for \((\lambda ,\delta )\)-PAC algorithms can be achieved by a \(\delta \)-PAC algorithm without knowledge of \(\lambda \).

Lower bound on the sample complexity

We give a lower bound on the sample complexity for GAI. This proof is given in Sect. 5.

Theorem 1

Under any \((\lambda , \delta )\)-PAC algorithm, if there are \(m \ge \lambda \) good arms, then

$$\begin{aligned} \mathbb {E}[\tau _{\lambda }]&\ge \left( \sum _{i=1}^{\lambda }\frac{1}{d(\mu _i,\xi )}\log \frac{1}{2\delta } \right) - \frac{m}{d(\mu _{\lambda },\xi )}\,,\end{aligned}$$
(1)

where \(d(x,y) = x\log (x/y)+(1-x)\log ((1-x)/(1-y))\) is the binary relative entropy, with convention that \(d(0,0)=d(1,1)=0\).

This lower bound on the sample complexity for GAI is given in terms of top-\(\lambda \) expectations \(\{\mu _i\}_{i=1}^{\lambda }\). In the next section we confirm that this lower bound is tight up to the logarithmic factor \(\mathrm {O}(\log \frac{1}{\delta })\).

Algorithms

In this section, we first consider naive algorithms based on other pure-exploration problems. Next, we propose an algorithm for GAI and bound its sample complexity from above. Pseudo codes of all the algorithms are described in Algorithm 1. These algorithms can be decomposed into two components: a sampling strategy and an identification criterion. A sampling strategy is a policy to decide which arm the agent pulls. An identification criterion is a policy for the agent to decide whether arms are good or bad. All the algorithms adopt the same identification criterion of Lines 5–11 in Algorithm 1, which is based on the Lower Confidence Bound (LCB) for the best arm identification Kalyanakrishnan et al. (2012). See Remark 3 at the end of Sect. 3.2 for other choices of identification criteria.

Naive algorithms

We consider two naive algorithms: the Lower and Upper Confidence Bounds algorithm for GAI (LUCB-G), which is based on the LUCB algorithm for the best arm identification (Kalyanakrishnan et al. 2012) and the Anytime Parameter-free Thresholding algorithm for GAI (APT-G), which is based on the APT algorithm for the thresholding bandit problem (Locatelli et al. 2016). In both algorithms, the sampling strategy is the same as the original algorithms. These algorithms sample all arms at the same order \(\mathrm {O}\left( \log \frac{1}{\delta } \right) \).

Proposed algorithm

We propose a Hybrid algorithm for the Dilemma of Confidence (HDoC). The sampling strategy of HDoC is based on the UCB score of the cumulative regret minimization (Auer et al. 2002). As we will see later, the algorithm stops within \(t=\mathrm {O}(\log \frac{1}{\delta })\) rounds with high probability. Thus, the second term of the UCB score of HDoC in (2) is \(\mathrm {O}\left( \sqrt{ \frac{\log \log (1/\delta )}{N_i (t)} } \right) ,\) whereas that of LUCB-G in (3) is \(\mathrm {O}\left( \sqrt{ \frac{\log (1/\delta )}{N_i (t)} } \right) \). Therefore, the HDoC algorithm pulls the currently best arm more frequently than LUCB-G, which means that HDoC puts more emphasis on exploitation than exploration.

figure a

The correctness of the output of the HDoC algorithm can be verified by the following theorem, whose proof is given in Appendix A.

Theorem 2

The HDoC algorithm is \(\delta \)-PAC.

This theorem means that the HDoC algorithm outputs a bad arm with probability at most \(\delta \).

Next we give an upper bound on the sample complexity of HDoC. We bound the sample complexity in terms of \({\varDelta }_i=|\mu _i-\xi |\) and \({\varDelta }_{i,j}=\mu _i-\mu _j\).

Theorem 3

Assume that \({\varDelta }_{\lambda ,\lambda +1}>0\). Then, for any \(\lambda \le m\) and \(\epsilon <\min \{\min _{i\in [K]}{\varDelta }_i,\,{\varDelta }_{\lambda ,\lambda +1}/2\}\),

$$\begin{aligned} \mathbb {E}[\tau _{\lambda }]&\le \sum _{i\in [\lambda ]}n_i + \sum _{i\in [K]\setminus [\lambda ]} \left( \frac{\log (K\max _{j\in [K]}n_j )}{2({\varDelta }_{\lambda ,i} - 2\epsilon )^2} +\delta n_i\right) + \frac{K^{2-\frac{\epsilon ^2}{(\min _{i \in [K]} {\varDelta }_i-\epsilon )^2}}}{2\epsilon ^2}\\&\quad + \frac{K\left( 5+\log \frac{1}{2\epsilon ^2}\right) }{4\epsilon ^2}\,,\\ \mathbb {E}[\tau _{\mathrm {stop}}]&\le \sum _{i\in [K]}n_i+\frac{K}{2\epsilon ^2}\,,\end{aligned}$$

where

$$\begin{aligned} n_i=\frac{1}{({\varDelta }_i-\epsilon )^2} \log \left( \frac{4\sqrt{K/\delta }}{({\varDelta }_i-\epsilon )^2}\log \frac{5\sqrt{K/\delta }}{({\varDelta }_i-\epsilon )^2} \right) \,.\end{aligned}$$

We prove this theorem in Appendix B. The following corollary is straightforward from this theorem.

Corollary 1

Let \({\varDelta }=\min \{\min _{i\in [K]}{\varDelta }_i,\min _{\lambda \in [K-1]}{\varDelta }_{\lambda ,\lambda +1}/2\}\). Then, for any \(\lambda \le m\),

$$\begin{aligned} \limsup _{\delta \rightarrow 0}\frac{\mathbb {E}[\tau _{\lambda }]}{\log (1/\delta )}&\le \sum _{i\in [\lambda ]} \frac{1}{2{\varDelta }_i^2}\,, \end{aligned}$$
(4)
$$\begin{aligned} \limsup _{\delta \rightarrow 0}\frac{\mathbb {E}[\tau _{\mathrm {stop}}]}{\log (1/\delta )}&\le \sum _{i\in [K]} \frac{1}{2{\varDelta }_i^2}\,, \end{aligned}$$
(5)
$$\begin{aligned} \mathbb {E}[\tau _{\lambda }]&= \mathrm {O}\left( \frac{\lambda \log \frac{1}{\delta } +(K-\lambda )\log \log \frac{1}{\delta }+K\log \frac{K}{{\varDelta }}}{{\varDelta }^2} \right) \,, \end{aligned}$$
(6)
$$\begin{aligned} \mathbb {E}[\tau _{\mathrm {stop}}]&= \mathrm {O}\left( \frac{K \log (1/\delta )+K\log (K/{\varDelta }) }{{\varDelta }^2} \right) \,. \end{aligned}$$
(7)

Proof

Since

$$\begin{aligned} \limsup _{\delta \rightarrow 0}\frac{n_i}{\log (1/\delta )}=\frac{1}{2({\varDelta }_i-\epsilon )^2}\,,\end{aligned}$$

we obtain (4) and (5) by letting \(\epsilon \downarrow 0\). We obtain (6) and (7) by letting \(\epsilon ={\varDelta }/2\) in Theorem 3. \(\square \)

Note that \(d(\mu _i,\xi )\ge 2(\mu _i-\xi )^2=2{\varDelta }_i^2\) from Pinsker’s inequality and its coefficient two cannot be improved. Thus we see that the upper bound in (4) in Corollary 1 is almost optimal in view of the lower bound in Theorem 1 for sufficiently small \(\delta \). The authors believe that the coefficient \(2{\varDelta }_i^2\) can be improved to \(d(\mu _i,\xi )\) by the techniques in the KL-UCB (Kullback-Leibler-UCB) algorithm (Cappé et al. 2012) and the Thompson sampling algorithm (Agrawal and Goyal 2012), although we use the sampling strategy based on the UCB algorithm (Auer et al. 2002) for simplicity of the analysis. Eq. (6) means that the sample complexity of \(\mathbb {E}[\tau _{\lambda }]\) scales with \(\mathrm {O}(\lambda \log \frac{1}{\delta }+(K-\lambda ) \log \log \frac{1}{\delta })\) for moderately small \(\delta \), which is contrasted with the sample complexity \(\mathrm {O}(K \log \frac{1}{\delta })\) for the best arm identification (Kaufmann et al. 2016). Furthermore, we see from (5) and (7) that the HDoC algorithm reproduces the optimal sample complexity for the thresholding bandits (Locatelli et al. 2016).

Remark 1

We can easily extend GAI in a Bernoulli setting to GAI in a Gaussian setting with known variance \(\sigma ^2\). In the proofs of Theorems 2 and 3, we used the assumption of the Bernoulli reward only in Hoeffding’s inequality expressed as

$$\begin{aligned} \mathbb {P}[{\hat{\mu }}_{i,n} \le \mu _i - \epsilon ] \le \mathrm {e}^{-2n\epsilon ^2}\,,\end{aligned}$$

where \({\hat{\mu }}_{i,n}\) is the empirical mean of the rewards when arm i has been pulled n times. When each reward follows a Gaussian distribution with variance \(\sigma ^2\), the distribution of the empirical mean is evaluated as

$$\begin{aligned} \mathbb {P}[{\hat{\mu }}_{i,n} \le \mu _i - \epsilon ] \le \mathrm {e}^{-\frac{n\epsilon ^2}{2\sigma ^2}} \end{aligned}$$

by Cramér’s inequality. By this replacement the score of HDoC becomes \({\tilde{\mu }}_i (t) = {\hat{\mu }}_i (t) + \sqrt{\frac{2\sigma ^2 \log t}{N_i(t)}}\), the score of LUCB-G becomes \({\overline{\mu }}_i (t) = {\hat{\mu }}_i (t) + \sqrt{\frac{2\sigma ^2 \log (4KN_i^2(t)/\delta )}{N_i(t)}}\) and the score for identifying good arms becomes \({\underline{\mu }}_i (t) = {\hat{\mu }}_i (t) - \sqrt{\frac{2\sigma ^2 \log (4KN_i^2(t)/\delta )}{N_i(t)}}\) in a Gaussian setting given variance \(\sigma ^2\), while the score of APT-G in a Gaussian setting is the same as the score of APT-G in a Bernoulli setting.

Remark 2

Theorem 2 and the evalution of \(\tau _{\mathrm {stop}}\) in Theorem 3 do not depend on the sampling strategy and only use the fact that the identification criterion is given by Lines 5–11 in Algorithm 1. Thus, these results still hold even if we use the LUCB-G and APT-G algorithms.

Remark 3

The evaluation of the error probability is based on the union bound over all rounds \(t\in \mathbb {N}\), and the identification criterion in Lines 5–11 in Algorithm 1 is designed for this evaluation. The use of the union bound does not worsen the asymptotic analysis for \(\delta \rightarrow 0\) and we use this identification criterion to obtain a simple sample complexity bound. On the other hand, it is known that the empirical performance can be considerably improved by, for example, the bound based on the law of iterated logarithm in Jamieson et al. (2014) that can avoid the union bound. We can also use an identification criterion based on such a bound to improve empirical performance but this does not affect the result of relative comparison since we use the same identification criterion between algorithms with different sampling strategies.

Gap between lower and upper bounds

As we can see from Theorem 3 and its proof, an arm \(i>\lambda \) (that is, an arm other than top-\(\lambda \) ones) is pulled roughly \(\mathrm {O}( \frac{\log \log (1/\delta )}{{\varDelta }_{\lambda ,i}})\) times until HDoC outputs \(\lambda \) good arms. On the other hand, the lower bound in Theorem 1 only considers \(\mathrm {O}(\log \frac{1}{\delta })\) term and does not depend on arms \(i>\lambda \). Therefore, in the case where \((K-\lambda )\) is very large compared to \(\frac{1}{\delta }\) (more specifically, in the case of \(K-\lambda ={\Omega }\left( \frac{\log (1/\delta )}{\log \log (1/\delta )}\right) \)), there still exists a gap between the lower bound in (1) and the upper bound in (6). Furthermore, the bound in (6) becomes meaningless when \({\varDelta }_{\lambda ,\lambda +1}\approx 0\). In fact, the \(\mathrm {O}(\log \log \frac{1}{\delta })\) term for small \({\varDelta }_{\lambda ,\lambda +1}\) is not negligible in some cases as we will see experimentally in Sect. 4.

To fill this gap, it is necessary to consider the following difference between the cumulative regret minimization and GAI. Let us consider the case of pulling two good arms with the same expected rewards. In the cumulative regret minimization, which of these two arms is pulled makes no difference in the reward and, for example, it suffices for pulling these two arms alternately. On the other hand in GAI, the agent should output one of these good arms as fast as possible; hence, it is desirable to pull one of these equivalent arms with a biased frequency. However, the bias in the numbers of samples between seemingly equivalent arms increases the risk to miss an actually better arm and this dilemma becomes a specific difficulty in GAI. The proposed algorithm, HDoC, is not designed to cope with this difficulty and, improving \(\mathrm {O}(\log \log \frac{1}{\delta })\) term from this viewpoint is important future work.

Numerical experiments

In this section we experimentally compare the performance of HDoC with that of LUCB-G and APT-G. In all experiments, each arm is pulled five times as burn-in and the results are the averages over 1,000 independent runs.

Threshold settings

We consider three settings named Threshold 1–3, which are based on Experiment 1–2 in Locatelli et al. (2016) and Experiment 4 in Mukherjee et al. (2017).

Threshold 1 (Three group setting) Ten Bernoulli arms with mean \(\mu _{1:3} = 0.1\), \(\mu _{4:7} = 0.35 + 0.1 \cdot (0:3)\) and \(\mu _{8:10} = 0.9\), and threshold \(\xi = 0.5\), where (i : j) denotes \(\{i,i+1,i+2,\ldots ,j-1,j\}\).

Threshold 2 (Arithmetically progressive setting) Six Bernoulli arms with mean \(\mu _{1:6} = 0.1 \cdot (1:6)\) and threshold \(\xi = 0.35\).

Threshold 3 (Close-to-threshold setting) Ten Bernoulli arms with mean \(\mu _{1:3} = 0.55\) and \(\mu _{4:10} = 0.45\) and threshold \(\xi = 0.5\).

Table 2 Averages and standard deviations of arm-pulls over 1000 independent runs in Threshold 1–3 and Medical 1–2 for \(\delta =0.05\)
Table 3 Averages and standard deviations of arm-pulls over 1000 independent runs in Threshold 1–3 and Medical 1–2 for \(\delta =0.005\)
Fig. 1
figure 1

Number-of-round plots of HDoC, LUCB-G and the lower bound for \(\log \frac{1}{\delta }=5,10,\ldots ,50\) in Medical 2

Medical settings

We also consider two medical settings of dose-finding in clinical trials as GAI. In general, the dose of a drug is quite important. Although high doses are usually more effective than low doses, low doses can be effective than high doses because high doses often cause bad side effects. Therefore, it is desirable to list various doses of a drug with satisfactory effect, which can be formulated as GAI. We considered two instances of the dose-finding problem based on Genovese et al. (2013) and Liu et al. (2017) as Medical 1–2, respectively, specified as follows. In both settings, the threshold \(\xi \) corresponds to the satisfactory effect.

Medical 1 (Dose-finding of secukinumab for rheumatoid arthritis with satisfactory effect) Five Bernoulli arms with mean \(\mu _1=0.36\), \(\mu _2=0.34\), \(\mu _3=0.469\), \(\mu _4=0.465\), \(\mu _5=0.537\), and threshold \(\xi =0.5\).

Here, \(\mu _1, \mu _2,\ldots , \mu _5\) represent placebo, secukinumab 25mg, 75mg, 150mg and 300mg, respectively. The expected reward indicates American College of Rheumatology 20% Response (ACR20) at week 16 given in (Genovese et al. 2013, Table 2).

Medical 2 (Dose-finding of GSK654321 for rheumatoid arthritis with satisfactory effect) Seven Gaussian arms with mean \(\mu _1=0.5\), \(\mu _2=0.7\), \(\mu _3=1.6\), \(\mu _4=1.8\), \(\mu _5=1.2\), \(\mu _6=1.0\) and \(\mu _7=0.6\) with variance \(\sigma ^2=1.44\) and threshold \(\xi =1.2\).

Here, \(\mu _1, \mu _2,\ldots , \mu _7\) represent the positive effectFootnote 1 of placebo, the dose of GSK654321 0.03, 0.3, 10, 20 and 30 mg/kg, respectively, where GSK654321 (Liu et al. 2017) is a developing drug with nonlinear dose-response, which is based on the real drug GSK315234 (Choy et al. 2013). The expected reward indicates change from the baseline in \({\varDelta }\) Disease Activity Score 28 (DAS28) given in (Liu et al. 2017, Profile 4). The threshold \(\xi = 1.2\) is based on Curtis et al. (2015).

Results

First we compare HDoC, LUCB-G and APT-G for acceptance error rates \(\delta = 0.05,\,0.005\). Tables 2 and 3 show the averages and standard deviations of \(\tau _1, \tau _2, \ldots , \tau _\lambda \) and \(\tau _{\mathrm {stop}}\) for these algorithms. In most settings, HDoC outperforms LUCB-G and APT-G. In particular, the number of samples required for APT-G is very large compared to those required for HDoC or LUCB-G, and the stopping times of all algorithms are close as discussed in Remark 2. The results verify that HDoC addresses GAI more efficiently than LUCB-G or APT-G.

In Medical 2, we can easily see \(\tau _3, \tau _{\mathrm {stop}} =+\infty \) with high probability since the expected reward \(\mu _5\) is equal to the threshold \(\xi \). Moreover, APT-G fails to work completely, since it prefers to pull an arm whose expected reward is closest to the threshold \(\xi \) and selects the arm with mean \(\mu _5\) almost all the times. In fact, Tables 2, 3 show that APT-G cannot identify even one good arm within 100,000 arm-pulls whereas HDoC and LUCB-G can identify some good arms reasonably even in such a case.

As shown in Tables 23, the performance of HDoC is almost the same as that of LUCB-G in Medical 2, where the expectations of the arms are very close to each other, taking the variance \(\sigma ^2\) into consideration. Figure 1 shows the result of an experiment to investigate the behavior of HDoC and LUCB-G for Medical 2 in more detail, where \(\tau _1,\tau _2\) are plotted for (possibly unrealistically) small \(\delta \). Here “Lower bound” in the figure is the asymptotic lower bound \(\sum _{i=1}^{\lambda }\frac{2\sigma ^2 \log (1/\delta )}{{\varDelta }_i^2}\) of \(\tau _{\lambda }\) for normal distributions (see Theorem 1 and Remark 1). Since the result of HDoC asymptotically approaches to the lower bound, the \(\mathrm {O}(\log \frac{1}{\delta })\) term of the sample complexity of HDoC is almost optimal, and the results show that the effect of \(\mathrm {O}(\log \log \frac{1}{\delta })\) term is not negligible for practical acceptance error rates such as \(\delta =0.05\) and 0.005.

Proof of Theorem 1

In this section, we prove Theorem 1 based on the following proposition on the expected number of samples to distinguish two sets of reward distributions.

Proposition 1

(Lemma 1 in Kaufmann et al. 2016) Let \(\nu \) and \(\nu '\) be two bandit models with K arms such that for all i, the distributions \(\nu _i\) and \(\nu _i'\) are mutually absolutely continuous. For any almost-surely finite stopping time \(\sigma \) and event \({\mathcal {E}}\),

$$\begin{aligned} \sum _{i=1}^K \mathbb {E}[N_i (\sigma )]\mathrm {KL}(\nu _i,\nu _i')\ge d(\mathbb {P}_\nu [{\mathcal {E}}],\mathbb {P}_{\nu '}[{\mathcal {E}}])\,,\end{aligned}$$

where \(\mathrm {KL}(\nu _i, \nu _j)\) is the Kullback-Leibler divergence between distributions \(\nu _i\) and \(\nu _j\), and \(d(x,y) = x\log (x/y)+(1-x)\log ((1-x)/(1-y))\) is the binary relative entropy, with convention that \(d(0,0)=d(1,1)=0\).

Standard proofs on the best arm identification problems set \({\mathcal {E}}\) as an event such that \(\mathbb {P}[{\mathcal {E}}]\ge 1-\delta \) under any \(\delta \)-PAC algorithm. On the other hand, we leave \(\mathbb {P}[{\mathcal {E}}]\) to range from 0 to 1 and establish a lower bound as a minimization problem over \(\mathbb {P}[{\mathcal {E}}]\).

Proof of Theorem 1

Fix \(j\in [m]\) and consider a set of Bernoulli distributions \(\{\nu _i'\}\) with expectations \(\{\mu _i'\}\) given by

$$\begin{aligned} \mu _i'= {\left\{ \begin{array}{ll} \xi -\epsilon \,,&{}\text{ if } i=j,\\ \mu _i \,,&{}\text{ if } i\in [K]\setminus \{j\}. \end{array}\right. } \end{aligned}$$

Let \({\mathcal {E}}_j=\{j\in \{ {\hat{a}}_i\}_{i=1}^{\min \{\lambda ,{\hat{m}}\} } \}\) and \(p_j=\mathbb {P}\left[ j\in \{{\hat{a}}_i\}_{i=1}^{\min \{\lambda ,{\hat{m}}\} } \right] \) under \(\{\nu _i\}\). Since j is not a good arm under \(\{\nu _i'\}\), we obtain from Proposition 1 that

$$\begin{aligned} \mathbb {E}[N_j]d_j&\ge d(p_j,\min \{\delta ,p_j\}) \\&=\max \biggl \{ p_j \log \frac{1}{\min \{\delta ,p_j\}}-h(p_j) +(1-p_j)\log \frac{1}{1-\min \{\delta ,p_j\}},0 \biggr \} \\&\ge \max \left\{ p_j\log \frac{1}{\min \{\delta ,p_j\}}-\log 2,\,0\right\} \\&\ge \max \left\{ p_j\log \frac{1}{\delta }-\log 2,\,0\right\} \,,\end{aligned}$$

where we set \(d_i=d(\mu _i,\xi -\epsilon )\) and \(h(p)=-p\log p-(1-p)\log (1-p)\le \log 2\) is the binary entropy function.

Here note that

$$\begin{aligned} \sum _{i=1}^m p_i&= \mathbb {E}_{\nu }\left[ | [m]\cap \{{\hat{a}}_i\}_{i=1}^{\min \{\lambda ,{\hat{m}}\}}|\right] \\&\ge \lambda \mathbb {P}_{\nu }[ \{\{{\hat{a}}_{i}\}_{i=1}^{\min \{\lambda ,{\hat{m}}\}}\subset [m]\},\,{\hat{m}}\ge \lambda ]\ge \lambda (1-\delta ) \end{aligned}$$

under any \((\lambda ,\delta )\)-PAC algorithm. Thus we have

$$\begin{aligned} \sum _{i=1}^K\mathbb {E}[N_i] \ge \sum _{i=1}^m\mathbb {E}[N_i]\ge C^*\,,\end{aligned}$$

where \(C^*\) is the optimal value of the optimization problem

$$\begin{aligned} (\mathrm {P_1}) \quad \mathrm {minimize\;}\sum _{i=1}^m \frac{1}{d_i}\max \left\{ p_i\log \frac{1}{\delta }-\log 2,\,0\right\} ,\, \text { subject to }&\sum _{i=1}^mp_i\ge \lambda (1-\delta )\,,\\&0\le p_i\le 1\,,\, \forall i\in [m], \end{aligned}$$

which is equivalent to the linear programming problem

$$\begin{aligned} (\mathrm {P_2})\quad \mathrm {minimize\;}\sum _{i=1}^m \frac{x_i}{d_i}, \quad \text { subject to }&\sum _{i=1}^mp_i\ge \lambda (1-\delta )\,,\\&x_i\ge p_i\log \frac{1}{\delta }-\log 2\,,\quad \forall i\in [m]\,,\\&0\le p_i\le 1\,,\quad x_i\ge 0\,,\quad \forall i\in [m]\,.\end{aligned}$$

The dual problem of \((\mathrm {P}_2)\) is given by

$$\begin{aligned} (\mathrm {P}_2')\quad \mathrm {maximize\;}&\lambda (1-\delta )\alpha -(\log 2)\sum _{i=1}^m\beta _i-\sum _{i=1}^m \gamma _i \\ \qquad \text { subject to }&\beta _i\le \frac{1}{d_i}\,,\quad \forall i\in [m]\,,\\&\alpha - \beta _i\log \frac{1}{\delta }-\gamma _i\le 0\,,\quad \forall i\in [m]\,,\\&\alpha ,\beta _i,\gamma _i\ge 0\,,\quad \forall i\in [m]\,.\end{aligned}$$

Here consider the feasible solution of \((\mathrm {P}_2')\) given by

$$\begin{aligned} \alpha&=\frac{1}{d_{\lambda }}\log \frac{1}{\delta }\,,\qquad \beta _i= {\left\{ \begin{array}{ll} \frac{1}{d_i},&{}i\le \lambda ,\\ \frac{1}{d_{\lambda }},&{}i> \lambda , \end{array}\right. } \qquad \gamma _i= {\left\{ \begin{array}{ll} \left( \frac{1}{d_{\lambda }}-\frac{1}{d_i}\right) \log \frac{1}{\delta },&{}i\le \lambda ,\\ 0,&{}i> \lambda , \end{array}\right. } \end{aligned}$$

which attains the objective function

$$\begin{aligned}&\frac{\lambda (1-\delta )}{d_{\lambda }}\log \frac{1}{\delta } -(\log 2)\left( \sum _{i\le \lambda }\frac{1}{d_i} +\frac{m-\lambda }{d_{\lambda }} \right) -\sum _{i\le \lambda } \left( \frac{1}{d_{\lambda }}-\frac{1}{d_i}\right) \log \frac{1}{\delta } \\&\quad = \sum _{i\le \lambda }\left( \frac{1}{d_i}\log \frac{1}{\delta }-\frac{\log 2}{d_i}\right) -\frac{\lambda \delta }{d_{\lambda }}\log \frac{1}{\delta } -\frac{(m-\lambda )\log 2}{d_{\lambda }} \\&\quad \ge \sum _{i\le \lambda }\frac{1}{d_i}\log \frac{1}{2\delta } -\frac{\lambda }{d_{\lambda }} -\frac{(m-\lambda )\log 2}{d_{\lambda }} \quad \text {by} \sup _{0<\delta \le 1} \delta \log (1/\delta ) = 1/\mathrm {e}<1 \\&\quad \ge \sum _{i\le \lambda }\frac{1}{d_i}\log \frac{1}{2\delta } -\frac{m}{d_{\lambda }} \,.\end{aligned}$$

Since the objective function of a feasible solution for the dual problem \((\mathrm {P}_2')\) of \((\mathrm {P}_2)\) is always smaller than the optimal value \(C^*\) of \((\mathrm {P}_2)\), we have

$$\begin{aligned} \sum _{i=1}^m\mathbb {E}[N_i]&\ge C^* \\&\ge \sum _{i\le \lambda }\frac{1}{d_i}\log \frac{1}{2\delta } -\frac{m}{d_{\lambda }} \\&= \sum _{i\le \lambda }\frac{1}{d(\mu _i,\xi -\epsilon )}\log \frac{1}{2\delta } -\frac{m}{d(\mu _{\lambda },\xi -\epsilon )}\,.\end{aligned}$$

We complete the proof by letting \(\epsilon \downarrow 0\). \(\square \)

Conclusion

In this paper, we considered and discussed a new multi-armed bandit problem called good arm identification (GAI). The objective of GAI is to minimize not only the total number of samples to identify all good arms but also the number of samples until identifying \(\lambda \) good arms for each \(\lambda =1,2,\ldots \), where a good arm is an arm whose expected reward is greater than or equal to threshold \(\xi \). Even though GAI, which is a pure-exploration problem, does not face the exploration-exploitation dilemma of reward, GAI encounters a new kind of dilemma: the exploration-exploitation dilemma of confidence. We derived a lower bound on the sample complexity of GAI, developed an efficient algorithm, HDoC, and then we theoretically showed the sample complexity of HDoC almost matches the lower bound. We also experimentally demonstrated that HDoC outperforms algorithms based on other pure-exploration problems in the three settings based on the thresholding bandit and two settings based on the dose-finding problem in the clinical trials.