Good Arm Identification via Bandit Feedback

We consider a novel stochastic multi-armed bandit problem called {\em good arm identification} (GAI), where a good arm is defined as an arm with expected reward greater than or equal to a given threshold. GAI is a pure-exploration problem that a single agent repeats a process of outputting an arm as soon as it is identified as a good one before confirming the other arms are actually not good. The objective of GAI is to minimize the number of samples for each process. We find that GAI faces a new kind of dilemma, the {\em exploration-exploitation dilemma of confidence}, which is different difficulty from the best arm identification. As a result, an efficient design of algorithms for GAI is quite different from that for the best arm identification. We derive a lower bound on the sample complexity of GAI that is tight up to the logarithmic factor $\mathrm{O}(\log \frac{1}{\delta})$ for acceptance error rate $\delta$. We also develop an algorithm whose sample complexity almost matches the lower bound. We also confirm experimentally that our proposed algorithm outperforms naive algorithms in synthetic settings based on a conventional bandit problem and clinical trial researches for rheumatoid arthritis.


Introduction
The stochastic multi-armed bandit (MAB) problem is one of the most fundamental problems for sequential decisionmaking under uncertainty (Sutton & Barto, 1998). It is regarded as a subfield of reinforcement learning in which an agent aims to acquire a policy to select the best-rewarding action via trial and error. In the stochastic MAB problem, a single agent repeatedly plays K slot machines called arms, where an arm generates a stochastic reward when pulled. At each round t, the agent pulls arm i ∈ [K] = 1 University of Tokyo 2 RIKEN 3 Johnson & Johnson 4 Hokkaido University.
One of the most classic MAB formulations is the cumulative regret minimization (Lai & Robbins, 1985;Auer et al., 2002), where the agent tries to maximize the cumulative reward over the fixed number of trials. In this setting, the agent faces the exploration-exploitation dilemma of reward, where the exploration means that the agent pulls an arm other than the currently best arm to find better arms, and the exploitation indicates that the agent pulls the currently best arm to increase the cumulative reward. The related frameworks can be widely applied to various real-world problems such as clinical trials (Grieve & Krams, 2005;Genovese et al., 2013;Choy et al., 2013;Curtis et al., 2015;Liu et al., 2017) and personalized recommendations (Tang et al., 2015).
Another classic branch of the MAB problem is the best arm identification (Kaufmann et al., 2016;Kalyanakrishnan et al., 2012), which is a pure-exploration problem that the agent tries to identify the best arm a * = arg max i∈{1,2,...,K} µ i . So far, the conceptual idea of the best arm identification has also been successfully applied to many kinds of real-world problems (Koenig & Law, 1985;Schmidt et al., 2006;Zhou et al., 2014;Jun et al., 2016). Recently, the thresholding bandit problem was proposed (Locatelli et al., 2016) as a variant of pure-exploration MAB formulations. In the thresholding bandit problem, the agent tries to correctly partition all the K arms into good arms and bad arms, where a good arm is defined as an arm whose expected reward is greater than or equal to a given threshold ξ > 0, and a bad arm is defined as an arm whose expected reward is lower than the threshold ξ. However, in practice, neither correctly partitioning all the K arms nor exactly identifying the very best arm is always needed; rather, finding some of reasonably good arms as fast as possible is often more useful.
Take a problem of personalized recommendations for example. The objective is to increase our profit by sending direct emails recommending personalized items. In this problem, timely recommendation is a key, because the best sellers in the past are not necessarily the best sellers in the future. Now, there arise three troubles if this problem is formulated as the best arm identification or the thresholding bandit problem. First, an inflation of exploration costs could break out when the purchase probabilities of the multiple best sellers are much close with each other. Although this trouble can be partly relaxed by the ǫ-best arm identification (Even-Dar et al., 2006), in which an arm with expectation greater than or equal to max i∈ [K] µ i − ǫ is also acceptable, the tolerance parameter ǫ has to be set very conservatively. Second, recommending even the best sellers is not a good idea if the "best" purchase probability is too small considering the advertising costs. Third, it needlessly increases exploration costs to partition all items into good (or profitable) items and bad (or not profitable) items, if it is enough to find only some good items to increase our profit. For the above reasons, the formulation of the personalized recommendation problem as the best arm identification or the thresholding bandit problem is not necessarily effective.
Similar troubles also occur in clinical trials for finding drugs (Kim et al., 2011) or for finding appropriate doses of a drug (Grieve & Krams, 2005;Genovese et al., 2013;Choy et al., 2013;Curtis et al., 2015;Liu et al., 2017), where the number of patients is extremely limited. In such a case, it is vitally important to find some drugs or doses with satisfactory effect as fast as possible rather than either to classify all drugs or doses into satisfactory ones and others or to identify the exactly best ones.
In this paper, we propose a new bandit framework named good arm identification (GAI), where a good arm is defined as an arm whose expected reward is greater than or equal to a given threshold. We formulate GAI as a pureexploration problem in the fixed confidence setting, which is often considered in conventional pure-exploration problems. In the fixed confidence setting, an acceptance error rate δ is fixed in advance, and we minimize the number of pulling arms needed to assure the correctness of the output with probability greater than or equal to 1 − δ. In GAI, a single agent repeats a process of outputting an arm as soon as the agent identifies it as a good one with error probability at most δ. If it is found that there remain no good arms, then the agent stops working. Although the agent does not face the exploration-exploitation dilemma of reward since GAI is a pure-exploration problem, the agent suffers from a new kind of dilemma, that is the exploration-exploitation dilemma of confidence, where the exploration means that the agent pulls other arms than the currently best one that may be easier to confirm to be good, and the exploitation indicates that the agent pulls the currently best arm to increase the confidence on the goodness.
To address the dilemma of confidence, we propose a Hybrid algorithm for the Dilemma of Confidence (HDoC). The sampling strategy of HDoC is based on the upper confidence bound (UCB) algorithm for the cumulative regret minimization (Auer et al., 2002), and the identification rule (that is, the criterion to output an arm as a good one) of HDoC is based on the lower confidence bound (LCB) for the best arm identification (Kalyanakrishnan et al., 2012). In addition, we show that a lower bound on the sample complexity for GAI is Ω(λ log 1 δ ), and HDoC can find λ good arms within O λ log 1 δ + (K − λ) log log 1 δ samples. This result suggests that HDoC is superior to naive algorithms based on conventional pure-exploration problems, because they require O K log 1 δ samples. For the personalized recommendation problem, the GAI approach is more appropriate, because the agent can quickly identify good items since the agent only focuses on finding good items rather than identifying the best item (as in the best arm identification) and bad items (as in the thresholding bandit). Certainly, there exists a possibility that the recommended item does not possess the best purchase probabilities. However, that does not necessarily matter when customers' interests and item repositories undergo frequent changes, because identifying the exactly best item requires too many samples, and thus we cannot do that in practice. In addition, thanks to the absolute comparison, not the relative comparison, the inflation of exploration costs does not break out even if the purchase probabilities are close to each other, and then the agent can refrain from recommending items when the purchase probabilities are too small.
Our contributions can be summarized as four folds. First, we formulate a novel pure-exploration problem called GAI and find there is a new kind of dilemma, that is, the exploration-exploitation dilemma of confidence. Second, we derive a lower bound for GAI in the fixed confidence setting. Third, we propose the HDoC algorithm and show that an upper bound on the sample complexity of HDoC almost matches the lower bound. Fourth, we experimentally demonstrate that HDoC outperforms two naive algorithms derived from other pure exploration problems in synthetic settings based on the thresholding bandit problem (Locatelli et al., 2016) and the clinical trial researches for rheumatoid arthritis (Genovese et al., 2013;Choy et al., 2013;Curtis et al., 2015;Liu et al., 2017).

Good Arm Identification
In this section, we first formulate GAI as a pure exploration problem in the fixed confidence setting. Next, we derive a lower bound on the sample complexity for GAI. We give the notation list in Table 1.

Problem Formulation
Let K be the number of arms, ξ ∈ (0, 1) be a threshold and δ > 0 be an acceptance error rate. Each arm i ∈ [K] = {1, 2, . . . , K} is associated with Bernoulli distribution ν i with mean µ i . The parameters {µ i } K i=1 are unknown to the agent. We define a good arm as an arm whose expected reward is greater than or equal to threshold ξ. The number of good arms is denoted by m which is unknown to the agent and, without loss of generality, we assume an indexing of the arms such that µ1 ≥ µ2 ≥ · · · ≥ µm ≥ ξ ≥ µm+1 ≥ · · · ≥ µK.
The agent is naturally unaware of this indexing. At each round t, the agent pulls an arm a(t) ∈ [K] and receives an i.i.d. reward drawn from distribution ν a(t) . The agent outputs an arm when it is identified as a good one. The agent repeats this process until there remain no good arms, where the stopping time is denoted by τ stop . To be more precise, the agent outputsâ 1 ,â 2 , . . . ,âm as good arms (which are different from each other) at rounds τ 1 , τ 2 , . . . , τm, respectively, wherem is the number of arms that the agent outputs as good ones. The agent stops working after outputting ⊥ (NULL) at round τ stop when the agent finds that there remain no good arms. If all arms are identified as good ones, then the agent stops after outputtingâ K and ⊥ together at the same round. For λ >m we define τ λ = τ stop . Now, we introduce the definitions of (λ, δ)-PAC (Probably Approximately Correct) and δ-PAC. Definition 1 ((λ, δ)-PAC). An algorithm satisfying the following conditions is called (λ, δ)-PAC: if there are at least λ good arms then P[{m < λ} ∪ i∈{â1,â2,...,â λ } {µ i < ξ}] ≤ δ and if there are less than λ good arms then The agent aims to minimize {τ 1 , τ 2 , . . . , τ stop } simultaneously by a δ-PAC algorithm. On the other hand, the minimization of τ stop corresponds to the thresholding bandit if we consider the fixed confidence setting.
As we can easily see from these definitions, the condition for a (λ, δ)-PAC algorithm is weaker than that for a δ-PAC algorithm. Thus, there is a possibility that we can construct a good algorithm to minimize τ λ by using a (λ, δ)-PAC algorithm rather than a δ-PAC algorithm if a specific value of λ is considered. Nevertheless, we will show that the lower bound on τ λ for (λ, δ)-PAC algorithms can be achieved by a δ-PAC algorithm without knowledge of λ.

Lower Bound on the Sample Complexity
We give a lower bound on the sample complexity for GAI. This proof is given in Section 5. Theorem 1. Under any (λ, δ)-PAC algorithm, if there are m ≥ λ good arms, then where d(x, y) = x log(x/y)+(1−x) log((1−x)/(1−y)) is the binary relative entropy, with convention that d(0, 0) = d(1, 1) = 0. µ i Expected reward of arm i (unknown). µ i (t) Empirical mean of the rewards of arm i by the end of round t. µ i,n Empirical mean of the rewards when arm i has been pulled n times. N i (t) Number of samples of arm i which has been pulled by the end of round t. τ λ Round that agent identifies λ good arms. τ stop Round that agent outputs ⊥ (NULL).
This lower bound on the sample complexity for GAI is given in terms of top-λ expectations {µ i } λ i=1 . In the next section we confirm that this lower bound is tight up to the logarithmic factor O(log 1 δ ).

Algorithms
In this section, we first consider naive algorithms based on other pure-exploration problems. Next, we propose an algorithm for GAI and bound its sample complexity from above. Pseudo codes of all the algorithms are described in Algorithm 1. These algorithms can be decomposed into two components: a sampling strategy and an identification criterion. A sampling strategy is a policy to decide which arm the agent pulls. An identification criterion is a policy for the agent to decide whether arms are good or bad. All the algorithms adopt the same identification criterion of Lines 5-11 in Algorithm 1, which is based on the Lower Confidence Bound (LCB) for the best arm identification (Kalyanakrishnan et al., 2012). See Remark 3 at the end of Section 3.2 for other choices of identification criteria.

Naive Algorithms
We consider two naive algorithms: the Lower and Upper Confidence Bounds algorithm for GAI (LUCB-G), which is based on the LUCB algorithm for the best arm identification (Kalyanakrishnan et al., 2012) and the Anytime Parameter-free Thresholding algorithm for GAI (APT-G), which is based on the APT algorithm for the thresholding bandit problem (Locatelli et al., 2016). In both algorithms, the sampling strategy is the same as the original algorithms. These algorithms sample all arms at the same order O log 1 δ .

Proposed Algorithm
We propose a Hybrid algorithm for the Dilemma of Confidence (HDoC). The sampling strategy of HDoC is based on the UCB score of the cumulative regret minimization (Auer et al., 2002). As we will see later, the algorithm stops within t = O(log 1 δ ) rounds with high probability. Thus, the second term of the UCB score of HDoC in (2) is , whereas that of LUCB-G in (3) is . Therefore, the HDoC algorithm pulls the currently best arm more frequently than LUCB-G, which means that HDoC puts more emphasis on exploitation than exploration.
The correctness of the output of the HDoC algorithm can be verified by the following theorem, whose proof is given in Appendix A.

Theorem 2. The HDoC algorithm is δ-PAC.
This theorem means that the HDoC algorithm outputs a bad arm with probability at most δ.
Next we give an upper bound on the sample complexity of HDoC. We bound the sample complexity in terms of Algorithm 1 HDoC / LUCB-G / APT-G HDoC: Pull armâ * = arg max i∈Aμi (t) for .
(2) LUCB-G: Pull armâ * = arg max i∈A µ i (t) for . ( . APT-G: Pull armâ * = arg min i∈A β i (t) for Outputâ * as a good arm. 7: We prove this theorem in Appendix B. The following corollary is straightforward from this theorem.
i from Pinsker's inequality and its coefficient two cannot be improved. Thus we see that the upper bound in (4) in Corollary 1 is almost optimal in view of the lower bound in Theorem 1 for sufficiently small δ. The authors believe that the coefficient 2∆ 2 i can be improved to d(µ i , ξ) by the techniques in the KL-UCB algorithm (Kullback-Leibler UCB, Cappé et al., 2012) and the Thompson sampling algorithm (Agrawal & Goyal, 2012), although we use the sampling strategy based on the UCB algorithm (Auer et al., 2002) for simplicity of the analysis. Eq. (6) means that the sample complexity of E[τ λ ] scales with O(λ log 1 δ + (K − λ) log log 1 δ ) for moderately small δ, which is contrasted with the sample complexity O(K log 1 δ ) for the best arm identification (Kaufmann et al., 2016). Furthermore, we see from (5) and (7) that the HDoC algorithm reproduces the optimal sample complexity for the thresholding bandits (Locatelli et al., 2016).
Remark 1. We can easily extend GAI in a Bernoulli setting to GAI in a Gaussian setting with known variance σ 2 . In the proofs of Theorems 2 and 3, we used the assumption of the Bernoulli reward only in Hoeffding's inequality expressed as whereμ i,n is the empirical mean of the rewards when arm i has been pulled n times. When each reward follows a Gaussian distribution with variance σ 2 , the distribution of the empirical mean is evaluated as 2σ 2 by Cramér's inequality. By this replacement the score of and the score for identifying good arms becomes µ i (t) = in a Gaussian setting given variance σ 2 , while the score of APT-G in a Gaussian setting is the same as the score of APT-G in a Bernoulli setting.
Remark 2. Theorem 2 and the evalution of τ stop in Theorem 3 do not depend on the sampling strategy and only use the fact that the identification criterion is given by Lines 5-11 in Algorithm 1. Thus, these results still hold even if we use the LUCB-G and APT-G algorithms.
Remark 3. The evaluation of the error probability is based on the union bound over all rounds t ∈ N, and the identification criterion in Lines 5-11 in Algorithm 1 is designed for this evaluation. The use of the union bound does not worsen the asymptotic analysis for δ → 0 and we use this identification criterion to obtain a simple sample complexity bound. On the other hand, it is known that the empirical performance can be considerably improved by, for example, the bound based on the law of iterated logarithm in Jamieson et al. (2014) that can avoid the union bound. We can also use an identification criterion based on such a bound to improve empirical performance but this does not affect the result of relative comparison since we use the same identification criterion between algorithms with different sampling strategies.

Gap between Lower and Upper Bounds
As we can see from Theorem 3 and its proof, an arm i > λ (that is, an arm other than top-λ ones) is pulled roughly O( log log(1/δ) ∆ λ,i ) times until HDoC outputs λ good arms. On the other hand, the lower bound in Theorem 1 only considers O(log 1 δ ) term and does not depend on arms i > λ. Therefore, in the case where (K − λ) is very large compared to 1 δ (more specifically, in the case of K − λ = Ω log(1/δ) log log(1/δ) ), there still exists a gap between the lower bound in (1) and the upper bound in (6). Furthermore, the bound in (6) becomes meaningless when ∆ λ,λ+1 ≈ 0. In fact, the O(log log 1 δ ) term for small ∆ λ,λ+1 is not negligible in some cases as we will see experimentally in Section 4.
To fill this gap, it is necessary to consider the following difference between the cumulative regret minimization and GAI. Let us consider the case of pulling two good arms with the same expected rewards. In the cumulative regret minimization, which of these two arms is pulled makes no difference in the reward and, for example, it suffices for pulling these two arms alternately. On the other hand in GAI, the agent should output one of these good arms as fast as possible; hence, it is desirable to pull one of these equivalent arms with a biased frequency. However, the bias in the numbers of samples between seemingly equivalent arms increases the risk to miss an actually better arm and this dilemma becomes a specific difficulty in GAI. The proposed algorithm, HDoC, is not designed to cope with this difficulty and, improving O(log log 1 δ ) term from this viewpoint is important future work.

Numerical Experiments
In this section we experimentally compare the performance of HDoC with that of LUCB-G and APT-G. In all experiments, each arm is pulled five times as burn-in and the results are the averages over 1,000 independent runs.

Threshold Settings
We consider three settings named Threshold 1-3, which are based on Experiment 1-2 in Locatelli et al. (2016) and Experiment 4 in Mukherjee et al. (2017).

Medical Settings
We also consider two medical settings of dose-finding in clinical trials as GAI. In general, the dose of a drug is quite important. Although high doses are usually more effective than low doses, low doses can be effective than high doses because high doses often cause bad side effects. Therefore, it is desirable to list various doses of a drug with satisfactory effect, which can be formulated as GAI. We considered two instances of the dose-finding problem based on Genovese et al. (2013) and Liu et al. (2017) as Medical 1-2, respectively, specified as follows. In both settings, the threshold ξ corresponds to the satisfactory effect.
Here, µ 1 , µ 2 , . . . , µ 7 represent the positive effect 1 of placebo, the dose of GSK654321 0.03, 0.3, 10, 20 and 30 mg/kg, respectively, where GSK654321 (Liu et al., 2017) is a developing drug with nonlinear dose-response, which is based on the real drug GSK315234 (Choy et al., 2013). The expected reward indicates change from the baseline in 1 The original values (smaller than zero) in Liu et al. (2017) represent the negative effect and we inverted the sign to denote the positive effect.

Results
First we compare HDoC, LUCB-G and APT-G for acceptance error rates δ = 0.05, 0.005. Tables 2 and 3 show the averages and standard deviations of τ 1 , τ 2 , . . . , τ λ and τ stop for these algorithms. In most settings, HDoC outperforms LUCB-G and APT-G. In particular, the number of samples required for APT-G is very large compared to those required for HDoC or LUCB-G, and the stopping times of all algorithms are close as discussed in Remark 2. The results verify that HDoC addresses GAI more efficiently than LUCB-G or APT-G.
In Medical 2, we can easily see τ 3 , τ stop = +∞ with high probability since the expected reward µ 5 is equal to the threshold ξ. Moreover, APT-G fails to work completely, since it prefers to pull an arm whose expected reward is closest to the threshold ξ and selects the arm with mean µ 5 almost all the times. In fact, Tables 2-3 show that APT-G cannot identify even one good arm within 100,000 armpulls whereas HDoC and LUCB-G can identify some good arms reasonably even in such a case.
As shown in Tables 2-3, the performance of HDoC is almost the same as that of LUCB-G in Medical 2, where the expectations of the arms are very close to each other, taking the variance σ 2 into consideration. Figure 1 shows the result of an experiment to investigate the behavior of HDoC and LUCB-G for Medical 2 in more detail, where τ 1 , τ 2 are plotted for (possibly unrealistically) small δ. Here "Lower bound" in the figure is the asymptotic lower bound λ i=1 of τ λ for normal distributions (see Theorem 1 and Remark 1). Since the result of HDoC asymptotically approaches to the lower bound, the O(log 1 δ ) term of the sample complexity of HDoC is almost optimal, and the results show that the effect of O(log log 1 δ ) term is not negligible for practical acceptance error rates such as δ = 0.05 and 0.005.

Proof of Theorem 1
In this section, we prove Theorem 1 based on the following proposition on the expected number of samples to distinguish two sets of reward distributions. Proposition 1 (Lemma 1 in Kaufmann et al., 2016). Let ν and ν ′ be two bandit models with K arms such that for all i, the distributions ν i and ν ′ i are mutually absolutely continuous. For any almost-surely finite stopping time σ and event E,  Table 3. Averages and standard deviations of arm-pulls over 1000 independent runs in Threshold 1-3 and Medical 1-2 for δ = 0.005. where KL(ν i , ν j ) is the Kullback-Leibler divergence between distributions ν i and ν j , and d(x, y) = x log(x/y) + (1 − x) log((1 − x)/(1 − y)) is the binary relative entropy, with convention that d(0, 0) = d(1, 1) = 0.
Standard proofs on the best arm identification problems set E as an event such that P[E] ≥ 1 − δ under any δ-PAC algorithm. On the other hand, we leave P[E] to range from 0 to 1 and establish a lower bound as a minimization problem over P[E].
Here note that under any (λ, δ)-PAC algorithm. Thus we have where C * is the optimal value of the optimization problem

Good Arm Identification via Bandit Feedback
which is equivalent to the linear programming problem The dual problem of (P 2 ) is given by Here consider the feasible solution of (P ′ 2 ) given by which attains the objective function Since the objective function of a feasible solution for the dual problem (P ′ 2 ) of (P 2 ) is always smaller than the optimal value C * of (P 2 ), we have We complete the proof by letting ǫ ↓ 0.

Conclusion
In this paper, we considered and discussed a new multiarmed bandit problem called good arm identification (GAI). The objective of GAI is to minimize not only the total number of samples to identify all good arms but also the number of samples until identifying λ good arms for each λ = 1, 2, . . . , where a good arm is an arm whose expected reward is greater than or equal to threshold ξ. Even though GAI, which is a pure-exploration problem, does not face the exploration-exploitation dilemma of reward, GAI encounters a new kind of dilemma: the explorationexploitation dilemma of confidence. We derived a lower bound on the sample complexity of GAI, developed an efficient algorithm, HDoC, and then we theoretically showed the sample complexity of HDoC almost matches the lower bound. We also experimentally demonstrated that HDoC outperforms algorithms based on other pure-exploration problems in the three settings based on the thresholding bandit and two settings based on the dose-finding problem in the clinical trials.

A. Proof of Theorem 2
In this appendix we prove Theorem 2 based on the following lemma. Lemma 1.
Next we consider the case that the number of good arms m is less than λ and show Since there are at most m < λ good arms, the event {m ≥ λ} implies that j . Thus, in the same way as (10) we have which proves (11).

B. Proof of Theorem 3
In this appendix, we prove Theorem 3 based on the following lemmas, and we define T = K max i∈[K] ⌊n i ⌋.
Lemma 2. If n ≥ n i then Proof. We only show (12) We write c = (∆ i − ǫ) 2 ≤ 1 in the following for notational simplicity. Then we can express n ≥ n i as n = 1 c log 4t K/δ c for some t > log Then We obtain the lemma since t > log 5 √ K/δ c satisfies (14).
Proof. For the first term of (15) we have Since the event {a(t) = i, N i (t) = n} occurs for at most one t ∈ N we have By combining (16) and (17)  n i + 1 2ǫ 2 .