Skip to main content
Log in

Multi-armed bandits with dependent arms

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

We study a variant of the multi-armed bandit problem (MABP) which we call as MABs with dependent arms. Multiple arms are grouped together to form a cluster, and the reward distributions of arms in the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm i not only reveals information about its own reward distribution, but also about all arms belonging to the same cluster. This “correlation” among the arms complicates the exploration–exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the principle of optimism in the face of uncertainty (Lattimore and Szepesvári in Bandit algorithms, Cambridge University Press, 2020), which know the clusters, and hence utilize these additional side observations appropriately while performing exploration–exploitation trade-off. We show that the regret of our algorithms grows as \(O(K\log T)\), where K is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that does not utilize these dependencies, the regret scales as \(O(M\log T)\), where M is the number of arms. When \(K\ll M\), i.e. there is a lot of dependencies among arms, our proposed algorithm drastically reduces the dependence of regret on the number of arms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Code Availability

The code is available at the following link: https://github.com/fangliu0302/ClusterBandit

Notes

  1. instance-dependent regret.

  2. The relative gap between the lower bound and regret of UCB-D vanishes as \(K\rightarrow \infty\).

  3. See Appendix 2 for more details.

References

  • Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In: Advances in Neural Information Processing Systems, (pp. 2312–2320)

  • Akshay D Kamath, S.G. (2016). Cs 395t: Sublinear algorithms, lecture notes. https://www.cs.utexas.edu/~ecprice/courses/sublinear/notes/lec12.pdf

  • Atan, O., Tekin, C., & Schaar, M. (2015). Global multi-armed bandits with Hölder continuity. In: Artificial Intelligence and Statistics, (pp. 28–36)

  • Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422.

    MathSciNet  Google Scholar 

  • Awerbuch, B., & Kleinberg, R. (2008). Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1), 97–114.

    Article  MathSciNet  Google Scholar 

  • Ayoub, R. (1974). Euler and the zeta function. The American Mathematical Monthly, 81(10), 1067–1086.

    Article  MathSciNet  Google Scholar 

  • Berry, D.A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments (monographs on statistics and applied probability). (vol. 5(71-87), pp. 7–7). Chapman and Hall.

  • Binette, O. (2019). A note on reverse pinsker inequalities. IEEE Transactions on Information Theory, 65(7), 4094–4096. https://doi.org/10.1109/TIT.2019.2896192

    Article  MathSciNet  Google Scholar 

  • Bouneffouf, D., Parthasarathy, S., Samulowitz, H., & Wistuba, M. (2019). Optimal exploitation of clustering and history information in multi-armed bandit. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, (pp. 2016–2022). https://doi.org/10.24963/ijcai.2019/279

  • Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.

    Book  Google Scholar 

  • Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721

  • Buccapatnam, S., Eryilmaz, A., & Shroff, N.B. (2014). Stochastic bandits with side observations on networks. In: The 2014 ACM international conference on Measurement and modeling of computer systems, (pp. 289–300)

  • Carlsson, E., Dubhashi, D., & Johansson, F.D. (2021). Thompson sampling for bandits with clustered arms. In: Zhou ZH (ed) Proceedings of the thirtieth international joint conference on artificial intelligence, IJCAI-21, International joint conferences on artificial intelligence organization, (pp. 2212–2218). https://doi.org/10.24963/ijcai.2021/305,main Track

  • Caron, S., Kveton, B., Lelarge, M., & Bhagat, S. (2012). Leveraging side observations in stochastic bandits. arXiv preprint arXiv:1210.4839

  • Cesa-Bianchi, N., Gentile, C., & Zappella, G. (2013). A gang of bandits. Advances in Neural Information Processing Systems 26

  • Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear payoff functions. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, (pp. 208–214).

  • Combes, R., Magureanu, S., & Proutiere, A. (2017). Minimal exploration in structured stochastic bandits. In: Advances in Neural Information Processing Systems, (pp. 1763–1771)

  • Cover, T. M. (1999). Elements of information theory. John Wiley & Sons.

    Google Scholar 

  • Gai, Y., Krishnamachari, B., & Jain, R. (2012). Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5), 1466–1478.

    Article  Google Scholar 

  • Garivier, A., & Cappé, O. (2011). The kl-ucb algorithm for bounded stochastic bandits and beyond. In: Proceedings of the 24th annual conference on learning theory, (pp. 359–376).

  • Gentile, C., Li, S., & Zappella, G. (2014). Online clustering of bandits. In: International conference on machine learning, PMLR, pp 757–765

  • Gentile, C., Li, S., Kar, P., Karatzoglou, A., Zappella, G., & Etrue, E. (2017). On context-dependent clustering of bandits. In: International conference on machine learning, PMLR, (pp. 1253–1262).

  • Gittins, J., Glazebrook, K., & Weber, R. (2011). Multi-armed bandit allocation indices. John Wiley & Sons.

    Book  Google Scholar 

  • Götze, F., Sambale, H., & Sinulis, A. (2019). Higher order concentration for functions of weakly dependent random variables

  • Gupta, S., Joshi, G., & Yagan, O. (2018). Exploiting correlation in finite-armed structured bandits. arXiv preprint arXiv:1810.08164

  • Gupta, S., Joshi, G., & Yağan, O. (2020). Correlated multi-armed bandits with a latent random source. ICASSP 2020–2020 IEEE international conference on acoustics (pp. 3572–3576). IEEE: Speech and Signal Processing (ICASSP).

  • Kakade, S., & Tewari, A. (2008). Cmsc 35900 (spring 2008) learning theory, lecture notes: Massart’s finite class lemma and growth function. https://ttic.uchicago.edu/~tewari/lectures/lecture10.pdf

  • Kontorovich, A. (2014). Concentration in unbounded metric spaces and algorithmic stability. In: International conference on machine learning, (pp. 28–36)

  • Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22.

    Article  MathSciNet  Google Scholar 

  • Langford, J., & Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In: Advances in neural information processing systems, (pp. 817–824).

  • Lattimore, T., & Munos, R. (2014). Bounded regret for finite-armed structured bandits. In: Advances in neural information processing systems, (pp. 550–558).

  • Lattimore, T., & Szepesvari, C. (2017). The end of optimism? an asymptotic analysis of finite-armed linear bandits. In: Artificial intelligence and statistics, PMLR, (pp. 728–737)

  • Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.

    Book  Google Scholar 

  • Ledoux, M., & Talagrand, M. (2013). Probability in banach spaces: Isoperimetry and processes. Springer Science & Business Media

  • Li, L., Chu, W., Langford, J., & Schapire, R.E. (2010). A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th international conference on World Wide Web, (pp. 661–670).

  • Mannor, S., & Shamir, O. (2011). From bandits to experts: On the value of side-observations. In: Advances in neural information processing systems, (pp. 684–692)

  • Miao, Y. (2010). Concentration inequality of maximum likelihood estimator. Applied Mathematics Letters, 23(10), 1305–1309.

    Article  MathSciNet  Google Scholar 

  • Pandey, S., Chakrabarti, D., & Agarwal, D. (2007). Multi-armed bandit problems with dependent arms. In: Proceedings of the 24th international conference on machine learning, (pp. 721–728).

  • Resnick, S. (2019). A probability path. Springer.

    Google Scholar 

  • Rudin, W. (2006). Real and complex analysis. Tata McGraw-hill education.

  • Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2), 395–411.

    Article  MathSciNet  Google Scholar 

  • Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., et al. (2018). A tutorial on thompson sampling. Foundations and Trends ®in Machine Learning, 11(1), 1–96.

    Article  Google Scholar 

  • Vaswani, S., Schmidt, M., & Lakshmanan, L. (2017). Horde of Bandits using Gaussian Markov Random Fields. In: Singh A, Zhu J (eds) Proceedings of the 20th international conference on artificial intelligence and statistics, PMLR, proceedings of machine learning research, (vol 54, pp. 690–699). https://proceedings.mlr.press/v54/vaswani17a.html

  • Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge University Press.

    Google Scholar 

  • Wang, Z., Zhou, R., & Shen, C. (2018a). Regional multi-armed bandits. In: International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, PMLR, Proceedings of Machine Learning Research, (vol. 84, pp. 510–518)

  • Wang, Z., Zhou, R., & Shen, C. (2018b). Regional multi-armed bandits with partial informativeness. IEEE Transactions on Signal Processing, 66(21), 5705–5717.

  • Yang, X., Liu, X., & Wei, H. (2022). Concentration inequalities of mle and robust mle. arXiv preprint arXiv:2210.09398

  • Yang, Y. (2016). Ece598: Information-theoretic methods in high-dimensional statistics. http://www.stat.yale.edu/~yw562/teaching/598/lec14.pdf

Download references

Funding

Rahul Singh’s research was partially funded by the Science and Engineering Research Board under the project SRG/2021/002308. Ness Shroff was partially funded by the National Science Foundation under the projects CNS-1901057, CNS- 2007231, CNS-1618520, and CNS-1409336.

Author information

Authors and Affiliations

Authors

Contributions

RS and FL contributed to the theoretical analysis, algorithmic formulation and simulations. YS and NS supervised the development of the research and provided feedback at all the stages of the process till the final draft of the manuscript.

Corresponding author

Correspondence to Rahul Singh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent for publication

The authors of this manuscript consent to its publication.

Additional information

Editor: Hendrik Blockeel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Proof of Theorem 2 (concentration of \({\hat{\theta }}(n)\))

Throughout this proof, we drop the subscript \({\mathcal {C}}\) since the discussion is only for a single fixed cluster \({\mathcal {C}}\). Denote \({\mathcal {S}}_1:= \{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}\) to be the set of rewards obtained by n pulls of arms in \({\mathcal {C}}\). Consider the function \(\xi\) defined as follows,

$$\begin{aligned} \xi ( \{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}} ) := \sup _{\theta \in \varTheta } \Bigg | L(\theta ) -\frac{D( \theta ^{\star }|| \theta ) }{n} \Bigg |. \end{aligned}$$
(45)

We begin by deriving a few preliminary results that will be utilized while proving the main result.

Lemma 2

The function \(\xi\) is a Lipschitz continuous function of the rewards obtained, i.e., for two sample-paths \(\omega _1,\omega _2\) we have that,

$$\begin{aligned} | \xi (\omega _1) - \xi (\omega _2) | \le L_p \Vert {\mathcal {S}}_1(\omega _1) - {\mathcal {S}}_2(\omega _2) \Vert , \end{aligned}$$
(46)

where \(L_p>0\).

Proof

From Assumption 2 we have that the log-likelihood ratio \(\frac{f_i(r,\theta ^{\star }) }{f_i(r,\theta )}\) is a Lipschitz continuous function of \(\theta\). The proof then follows since Lipschitz continuity is preserved upon averaging, and also when two Lipschitz continuous functions are composed. \(\square\)

We now derive an upper-bound on the expectation of \(\xi\).

Lemma 3

We have

$$\begin{aligned} {\mathbb {E}}(\xi ) \le \frac{L_f \cdot \text{ diam }(\varTheta ) \sqrt{\pi }}{\sqrt{n}}, \end{aligned}$$

\(L_f\) is as in (9).

Proof

Let \({\mathcal {S}}_2:= \{{\tilde{r}}_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}\) be an independent copy of \({\mathcal {S}}_1= \{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}\). We then have that

$$\begin{aligned} {\mathbb {E}}(\xi )&= {\mathbb {E}}_{{\mathcal {S}}_1} \sup _{\theta \in \varTheta } \Bigg | {\mathbb {E}}_{{\mathcal {S}}_2}\left( \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i}\log \frac{f_i(r_{i,t},\theta ^{\star })}{f_i(r_{i,t},\theta )} - \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i}\log \frac{f_i({\tilde{r}}_{i,t},\theta ^{\star })}{f_i({\tilde{r}}_{i,t},\theta )} \Big | {\mathcal {S}}_1 \right) \Bigg | \nonumber \\&\le {\mathbb {E}}\sup _{\theta \in \varTheta } \Bigg | \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i}\log \frac{f_i(r_{i,t},\theta ^{\star }) }{f_i(r_{i,t},\theta ) } - \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i}\log \frac{f_i({\tilde{r}}_{i,t},\theta ^{\star }) }{f_i({\tilde{r}}_{i,t},\theta ) } \Bigg |, \end{aligned}$$
(47)

where the inequality follows from Jensen’s inequality Rudin (2006). Let \(\{\epsilon _{i,t}:t\in [1,n_i]\}_{i\in {\mathcal {C}}}\) be a sequence of i.i.d. random variables that assume binary values \(\{1,-1\}\) with a probability .5 each.

Let \({\mathcal {N}}( L_f diam(\varTheta ),\alpha )\) denote an \(\alpha\)-covering. The inequality (47) then yields us

$$\begin{aligned} {\mathbb {E}}(\xi )&\le 2{\mathbb {E}}\sup _{\theta \in \varTheta } \Bigg | \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i} \epsilon _{i,t}\log \frac{f_i(r_{i,t},\theta ^{\star })}{f_i(r_{i,t},\theta )} \Bigg |\nonumber \\&\le 8 \int \limits _{0}^{L_f \text {diam}(\varTheta ) } \sqrt{\frac{\log {\mathcal {N}}( L_f diam(\varTheta ),\alpha ) }{n}}\textrm{d}\alpha \nonumber \\&\le L_f \text {diam}(\varTheta ) \sqrt{\frac{\pi }{n}}, \end{aligned}$$
(48)

where the first inequality follows by using a symmetrization argument that is similar to (Wain-wright, 2019, p. 107), while the second inequality follows from Lemma 6, and the third inequality follows by bounding the covering number by using a volume bound (Akshay, 2016; Yang, 2016; Wainwright, 2019). \(\square\)

We now derive a concentration result for \(\xi\) around its mean.

Lemma 4

We have the following concentration result for \(\xi\),

$$\begin{aligned} {\mathbb {P}}\left( | \xi - {\mathbb {E}}(\xi ) | > x \right) \le \exp \left( -\frac{n x^{2}}{2 L^2_p \sigma ^2} \right) , \end{aligned}$$
(49)

where \(\xi\) is as in (45), \(L_p\) is the Lipschitz constant associated with \(\xi\) as in (46), \(\sigma\) is the sub-Gaussianity parameter associated with the rewards as in (8) and n is the number of times arms from \({\mathcal {C}}\) are sampled.

Proof

It was shown in Lemma 2 that \(\xi\) is a \(L_p\) Lipschitz function of \(\{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}\). Under Assumption 2 the rewards \(r_{i,t}\) are sub-Gaussian and hence satisfy (8). The relation (49) then follows from (Kontorovich, 2014, Theorem 1). \(\square\)

After having derived preliminary results, we are now in a position to prove the main result, i.e., Theorem 2.

Proof (Theorem 2)

Consider the normalized and shifted likelihood function \(L_{{\mathcal {C}}}(\cdot )\) as given in (32). Within this proof we let \(x>0\).

We obtain the following after using the results of Lemmas 3 and 4,

$$\begin{aligned} {\mathbb {P}}\left( \sup _{\theta \in \varTheta } \Bigg | L_{{\mathcal {C}}}(\theta ) -\frac{D( \theta ^{\star }_{{\mathcal {C}}} || \theta ) }{n} \Bigg | \ge \frac{B_1}{\sqrt{n}} + x \right) \le \exp \left( -\frac{n x^{2}}{2 L^2_p \sigma ^2} \right) , \end{aligned}$$
(50)

where \(B_1 = L_f \cdot \text {diam}(\varTheta ) \sqrt{\pi }\), \(x>0\), and \(L_f\) is as in (9). Thus, we have the following on a set that has a probability greater than \(\exp \left( -\frac{n x^{2}}{2\,L^2_p \sigma ^2} \right)\),

$$\begin{aligned} \Bigg | L(\theta ^{\star }) -\frac{D( \theta ^{\star }|| \theta ^{\star }) }{n} \Bigg |&\le \frac{B_1}{\sqrt{n}} + x,\end{aligned}$$
(51)
$$\begin{aligned} \Bigg | L({\hat{\theta }}(n)) -\frac{D( \theta ^{\star }|| {\hat{\theta }}(n)) }{n} \Bigg |&\le \frac{B_1}{\sqrt{n}} + x. \end{aligned}$$
(52)

The above yields us

$$\begin{aligned} L(\theta ^{\star })&\le \frac{B_1}{\sqrt{n}} + x, \end{aligned}$$
(53)
$$\begin{aligned} \text { and } L({\hat{\theta }}(n))&\ge \frac{D( \theta ^{\star }|| {\hat{\theta }}(n)) }{n} - \left( \frac{B_1}{\sqrt{n}} + x\right) . \end{aligned}$$
(54)

Moreover, since \({\hat{\theta }}(n)\) minimizes the loss function, we also have

$$\begin{aligned} L({\hat{\theta }}(n)) \le L(\theta ^{\star }). \end{aligned}$$

After substituting (53) and (54) into the above inequality, we obtain the following,

$$\begin{aligned} \frac{D( \theta ^{\star }|| {\hat{\theta }}(n)) }{n} \le 2\left( \frac{B_1}{\sqrt{n}} + x\right) . \end{aligned}$$

This proves that the estimate \({\hat{\theta }}_{{\mathcal {C}}}(n)\) satisfies the following

$$\begin{aligned} {\mathbb {P}}\left( \frac{D( \theta ^{\star }_{{\mathcal {C}}} || {\hat{\theta }}_{{\mathcal {C}}}(n) )}{n} > 2\left( \frac{B_1}{\sqrt{n}} + x \right) \right) \le \exp \left( -\frac{n x^{2}}{2 L^2_p \sigma ^2} \right) , \end{aligned}$$
(55)

where \(x>0\). To see (34), note that under Assumption 1 we have

$$\begin{aligned} D(\theta ^{\star }|| {\hat{\theta }})\ge \left( \min _{j\in {\mathcal {C}}} \ell b_{(j,i)} \right) KL_i(\theta ^{\star }|| {\hat{\theta }}). \end{aligned}$$

 (34) then follows by substituting this inequality into (55).

To see (35), we note that the vector which describes the number of plays of each arm in \({\mathcal {C}}\), can assume atmost \(N_{{\mathcal {C}}}(t)^{|{\mathcal {C}}|}\) values; this follows since the number of plays of each arm can assume values in the set \([0,N_{{\mathcal {C}}}(t)]\). The result then follows by combining the result (34) for non-adaptive plays with union bound. \(\square\)

Appendix 2: Some auxiliary results

The following result is utilized while analyzing the regret of UCB-D.

Lemma 5

Consider the confidence balls \({\mathcal {O}}_{{\mathcal {C}}}(t)\) (24) computed by UCB-D algorithm at time t. Let all the confidence balls hold true at time t, i.e. we have that \(\theta ^{\star }_{{\mathcal {C}}}\in {\mathcal {O}}_{{\mathcal {C}}}(t),~\forall {\mathcal {C}}\). Consider a cluster \({\mathcal {C}}\), and let \(i\in {\mathcal {C}}\) be a sub-optimal arm. Then, the UCB-D algorithm plays it only if

$$\begin{aligned} N_{{\mathcal {C}}_i}(t) \le \frac{\kappa \log t}{ \left( \varSigma _i \psi ^{-1}_i\left( \frac{\varDelta _i}{2}\right) \right) ^{2} }, \end{aligned}$$

where \(\psi ^{-1}_i,\varSigma _i\) are as in (5) and (26) respectively.

Proof

Since \(\theta ^{\star }_{{\mathcal {C}}} \in {\mathcal {O}}_{{\mathcal {C}}}(t)\), it follows from (24) that

$$\begin{aligned} \frac{1}{N_{{\mathcal {C}}}(t) }\sum _{j\in {\mathcal {C}}} N_j(t) KL_j({\hat{\theta }}_{{\mathcal {C}}}(t) || \theta ^{\star }_{{\mathcal {C}}} )\le d_{{\mathcal {C}}}(t),~\forall {\mathcal {C}}. \end{aligned}$$
(56)

It follows from Assumption 1 that \(\forall \theta _{1},\theta _2\in \varTheta\) and arms \(i,j\in {\mathcal {C}}\), we have the following

$$\begin{aligned} KL_j(\theta _1||\theta _2) \ge \ell b_{(j,i)} KL_i(\theta _1||\theta _2). \end{aligned}$$
(57)

Upon substituting the above inequality into (56), and letting the cluster of interest be \({\mathcal {C}}_i\), we obtain the following

$$\begin{aligned} KL_i({\hat{\theta }}_{{\mathcal {C}}_i}(t) || \theta ^{\star }_{{\mathcal {C}}_i} )&\le \varSigma ^{-1}_i d_{{\mathcal {C}}_{i} }(t), \end{aligned}$$
(58)

from which it follows that

$$\begin{aligned} \mu _i({\hat{\theta }}_{{\mathcal {C}}_i}(t)) \le \mu _i + {\overline{\psi }}_i\left( \frac{d_{{\mathcal {C}}_i}(t)}{\varSigma _i}\right) . \end{aligned}$$
(59)

Similarly, it follows from the definition of confidence ball \({\mathcal {O}}_{{\mathcal {C}}_i}(t)\) that

$$\begin{aligned} uc_i(t) \le \mu _i({\hat{\theta }}_{{\mathcal {C}}_i}(t)) + {\overline{\psi }}_i\left( \frac{d_{{\mathcal {C}}_i}(t)}{\varSigma _i } \right) . \end{aligned}$$
(60)

The above two inequalities yield,

$$\begin{aligned} {\overline{\psi }}_i\left( \frac{d_{{\mathcal {C}}_i}(t)}{\varSigma _i } \right) \ge \frac{uc_i(t) - \mu _i}{2}, \text { or},~ d_{{\mathcal {C}}_i}(t) \ge \varSigma _i ~\psi ^{-1}_i\left( \frac{uc_i(t) - \mu _i}{2} \right) . \end{aligned}$$
(61)

Under our assumption UCB-D algorithm plays arm i at time t, so that we have

$$\begin{aligned} uc_i(t) \ge uc_{i^{\star }}(t) \ge \mu _{i^{\star }}, \end{aligned}$$

which gives,

$$\begin{aligned} uc_i(t) - \mu _i \ge \varDelta _i. \end{aligned}$$

Substituting the above into (61), we obtain the following,

$$\begin{aligned} d_{{\mathcal {C}}_i}(t) \ge \varSigma _i \psi ^{-1}_i\left( \frac{\varDelta _i}{2} \right) . \end{aligned}$$
(62)

Since \(d_{{\mathcal {C}}_i}(t)= \sqrt{\kappa \frac{\log t}{N_{{\mathcal {C}}_i}(t)}}\), the above reduces to

$$\begin{aligned} \sqrt{\kappa \frac{\log t}{N_{{\mathcal {C}}_i}(t)}} \ge \varSigma _i \psi ^{-1}_i\left( \frac{\varDelta _i}{2} \right) , \text{ or } N_{{\mathcal {C}}_i}(t) \le \frac{\kappa \log t}{ \left( \varSigma _i \psi ^{-1}_i\left( \frac{\varDelta _i}{2}\right) \right) ^{2} }. \end{aligned}$$
(63)

This completes the proof. \(\square\)

Lemma 6

Consider a set \(A\subset {\mathbb {R}}^{n}\) that satisfies \(\Vert a\Vert \le D, \forall a\in A\). Let \(\{\epsilon _i\}_{i=1}^{n}\) be i.i.d. and assume values \(1,-1\) with probability .5 each. We then have that

$$\begin{aligned} {\mathbb {E}}\left( \sup _{a\in A} \Big |\frac{1}{n} \sum _{i=1}^{n} \epsilon _i a_i \Big | \right) \le \frac{1}{\sqrt{n}}\int _{0}^{D} \sqrt{\log {\mathcal {N}}(\alpha ,A) }~\textrm{d}\alpha , \end{aligned}$$

where \({\mathcal {N}}(\alpha ,A)\) denotes the minimum number of balls of radius \(\alpha\) that are required to cover the set A.

Proof

Within this proof, we let D denote the diameter of the set A. Consider a decreasing sequence of numbers \(\alpha _n = 2^{-n} D,~n=1,2,\ldots\). Let \({\bar{A}}\) be closure of A. Let \(Cov_{n}\subset {\bar{A}}\) be an \(\alpha _n\) cover of the set A, and moreover let the cover formed by \(Cov_{n+1}\) be a refinement of \(Cov_n\). Fix an \(a\in A\), and consider the sequence \(\hat{a}_n\), where we have that \({\hat{a}}_n\) is the point in the set \(Cov_n\) that is closest to a. Clearly, \(\Vert a-{\hat{a}}_n\Vert \le \alpha _n\), and also \(\Vert {\hat{a}}_{n}-{\hat{a}}_{n+1}\Vert \le \alpha _{n+1}\). Let \(\epsilon\) be the vector \(\left( \epsilon _1,\epsilon _2,\ldots ,\epsilon _N\right)\). Since \(a ={\hat{a}}_0 + \left( \sum _{n=1}^{N} {\hat{a}}_{n} - {\hat{a}}_{n-1} \right) + a - {\hat{a}}_N\), we obtain the following,

$$\begin{aligned} {\mathbb {E}}\sup _{a\in A} \Big |\frac{1}{n} \sum _{i=1}^{n} \epsilon _i a_i \Big |&= {\mathbb {E}}\sup _{a\in {\bar{A}} } \frac{1}{n} \epsilon \cdot \left( {\hat{a}}_0 + \left( \sum _{n=1}^{N} {\hat{a}}_{n} - {\hat{a}}_{n-1} \right) + a - {\hat{a}}_N \right) \\&\le {\mathbb {E}}\sup _{a_n \in Cov_n, a_{n-1} \in Cov_{n-1}} \epsilon \cdot (a_{n}-a_{n-1}) + {\mathbb {E}}\sup _{a\in {\bar{A}} } \epsilon \cdot (a - {\hat{a}}_N)\\&\le \frac{1}{N}\sum _{n=1}^{N} \alpha _n \sqrt{\frac{2}{n}\log | Cov_n| | Cov_{n-1}| } +\alpha _N \\&\le \frac{1}{N}\sum _{n=1}^{N} \alpha _n \sqrt{\frac{2}{n}\log {\mathcal {N}}({\bar{A}},\alpha _n) } +\alpha _N \\&= \frac{1}{N}\sum _{n=1}^{N} 2(\alpha _n - \alpha _{n+1}) \sqrt{\frac{2}{n}\log {\mathcal {N}}({\bar{A}},\alpha _n) } +\alpha _N \\&\le 4 \int \limits _{\alpha _N}^{\alpha _0} \sqrt{\frac{2}{n}\log {\mathcal {N}}({\bar{A}},\alpha ) }~\textrm{d}\alpha +\alpha _N \\&\rightarrow 4\int \limits _{0}^{D} \sqrt{\frac{2}{n}\log {\mathcal {N}}({\bar{A}},\alpha ) }~\textrm{d}\alpha \text{ as } \alpha _N \rightarrow 0, \end{aligned}$$

where the first inequality follows from Massart’s Finite Class Lemma (Kakade & Tewari, 2008).\(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, R., Liu, F., Sun, Y. et al. Multi-armed bandits with dependent arms. Mach Learn 113, 45–71 (2024). https://doi.org/10.1007/s10994-023-06457-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06457-z

Keywords

Navigation