Multi-armed bandits with dependent arms

Singh, Rahul; Liu, Fang; Sun, Yin; Shroff, Ness

doi:10.1007/s10994-023-06457-z

Multi-armed bandits with dependent arms

Published: 20 November 2023

Volume 113, pages 45–71, (2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

Rahul Singh ORCID: orcid.org/0000-0003-0363-3666¹,
Fang Liu²,
Yin Sun³ &
…
Ness Shroff²

338 Accesses
1 Altmetric
Explore all metrics

Abstract

We study a variant of the multi-armed bandit problem (MABP) which we call as MABs with dependent arms. Multiple arms are grouped together to form a cluster, and the reward distributions of arms in the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm i not only reveals information about its own reward distribution, but also about all arms belonging to the same cluster. This “correlation” among the arms complicates the exploration–exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the principle of optimism in the face of uncertainty (Lattimore and Szepesvári in Bandit algorithms, Cambridge University Press, 2020), which know the clusters, and hence utilize these additional side observations appropriately while performing exploration–exploitation trade-off. We show that the regret of our algorithms grows as $O(K\log T)$, where K is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that does not utilize these dependencies, the regret scales as $O(M\log T)$, where M is the number of arms. When $K\ll M$, i.e. there is a lot of dependencies among arms, our proposed algorithm drastically reduces the dependence of regret on the number of arms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Infinitely Many-Armed Bandits with Unknown Value Distribution

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

The non-stationary stochastic multi-armed bandit problem

Article 30 March 2017

Code Availability

The code is available at the following link: https://github.com/fangliu0302/ClusterBandit

Notes

instance-dependent regret.
The relative gap between the lower bound and regret of UCB-D vanishes as $K\rightarrow \infty$.
See Appendix 2 for more details.

References

Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In: Advances in Neural Information Processing Systems, (pp. 2312–2320)
Akshay D Kamath, S.G. (2016). Cs 395t: Sublinear algorithms, lecture notes. https://www.cs.utexas.edu/~ecprice/courses/sublinear/notes/lec12.pdf
Atan, O., Tekin, C., & Schaar, M. (2015). Global multi-armed bandits with Hölder continuity. In: Artificial Intelligence and Statistics, (pp. 28–36)
Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422.
MathSciNet Google Scholar
Awerbuch, B., & Kleinberg, R. (2008). Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1), 97–114.
Article MathSciNet Google Scholar
Ayoub, R. (1974). Euler and the zeta function. The American Mathematical Monthly, 81(10), 1067–1086.
Article MathSciNet Google Scholar
Berry, D.A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments (monographs on statistics and applied probability). (vol. 5(71-87), pp. 7–7). Chapman and Hall.
Binette, O. (2019). A note on reverse pinsker inequalities. IEEE Transactions on Information Theory, 65(7), 4094–4096. https://doi.org/10.1109/TIT.2019.2896192
Article MathSciNet Google Scholar
Bouneffouf, D., Parthasarathy, S., Samulowitz, H., & Wistuba, M. (2019). Optimal exploitation of clustering and history information in multi-armed bandit. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, (pp. 2016–2022). https://doi.org/10.24963/ijcai.2019/279
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.
Book Google Scholar
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721
Buccapatnam, S., Eryilmaz, A., & Shroff, N.B. (2014). Stochastic bandits with side observations on networks. In: The 2014 ACM international conference on Measurement and modeling of computer systems, (pp. 289–300)
Carlsson, E., Dubhashi, D., & Johansson, F.D. (2021). Thompson sampling for bandits with clustered arms. In: Zhou ZH (ed) Proceedings of the thirtieth international joint conference on artificial intelligence, IJCAI-21, International joint conferences on artificial intelligence organization, (pp. 2212–2218). https://doi.org/10.24963/ijcai.2021/305,main Track
Caron, S., Kveton, B., Lelarge, M., & Bhagat, S. (2012). Leveraging side observations in stochastic bandits. arXiv preprint arXiv:1210.4839
Cesa-Bianchi, N., Gentile, C., & Zappella, G. (2013). A gang of bandits. Advances in Neural Information Processing Systems 26
Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear payoff functions. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, (pp. 208–214).
Combes, R., Magureanu, S., & Proutiere, A. (2017). Minimal exploration in structured stochastic bandits. In: Advances in Neural Information Processing Systems, (pp. 1763–1771)
Cover, T. M. (1999). Elements of information theory. John Wiley & Sons.
Google Scholar
Gai, Y., Krishnamachari, B., & Jain, R. (2012). Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5), 1466–1478.
Article Google Scholar
Garivier, A., & Cappé, O. (2011). The kl-ucb algorithm for bounded stochastic bandits and beyond. In: Proceedings of the 24th annual conference on learning theory, (pp. 359–376).
Gentile, C., Li, S., & Zappella, G. (2014). Online clustering of bandits. In: International conference on machine learning, PMLR, pp 757–765
Gentile, C., Li, S., Kar, P., Karatzoglou, A., Zappella, G., & Etrue, E. (2017). On context-dependent clustering of bandits. In: International conference on machine learning, PMLR, (pp. 1253–1262).
Gittins, J., Glazebrook, K., & Weber, R. (2011). Multi-armed bandit allocation indices. John Wiley & Sons.
Book Google Scholar
Götze, F., Sambale, H., & Sinulis, A. (2019). Higher order concentration for functions of weakly dependent random variables
Gupta, S., Joshi, G., & Yagan, O. (2018). Exploiting correlation in finite-armed structured bandits. arXiv preprint arXiv:1810.08164
Gupta, S., Joshi, G., & Yağan, O. (2020). Correlated multi-armed bandits with a latent random source. ICASSP 2020–2020 IEEE international conference on acoustics (pp. 3572–3576). IEEE: Speech and Signal Processing (ICASSP).
Kakade, S., & Tewari, A. (2008). Cmsc 35900 (spring 2008) learning theory, lecture notes: Massart’s finite class lemma and growth function. https://ttic.uchicago.edu/~tewari/lectures/lecture10.pdf
Kontorovich, A. (2014). Concentration in unbounded metric spaces and algorithmic stability. In: International conference on machine learning, (pp. 28–36)
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22.
Article MathSciNet Google Scholar
Langford, J., & Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In: Advances in neural information processing systems, (pp. 817–824).
Lattimore, T., & Munos, R. (2014). Bounded regret for finite-armed structured bandits. In: Advances in neural information processing systems, (pp. 550–558).
Lattimore, T., & Szepesvari, C. (2017). The end of optimism? an asymptotic analysis of finite-armed linear bandits. In: Artificial intelligence and statistics, PMLR, (pp. 728–737)
Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
Book Google Scholar
Ledoux, M., & Talagrand, M. (2013). Probability in banach spaces: Isoperimetry and processes. Springer Science & Business Media
Li, L., Chu, W., Langford, J., & Schapire, R.E. (2010). A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th international conference on World Wide Web, (pp. 661–670).
Mannor, S., & Shamir, O. (2011). From bandits to experts: On the value of side-observations. In: Advances in neural information processing systems, (pp. 684–692)
Miao, Y. (2010). Concentration inequality of maximum likelihood estimator. Applied Mathematics Letters, 23(10), 1305–1309.
Article MathSciNet Google Scholar
Pandey, S., Chakrabarti, D., & Agarwal, D. (2007). Multi-armed bandit problems with dependent arms. In: Proceedings of the 24th international conference on machine learning, (pp. 721–728).
Resnick, S. (2019). A probability path. Springer.
Google Scholar
Rudin, W. (2006). Real and complex analysis. Tata McGraw-hill education.
Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2), 395–411.
Article MathSciNet Google Scholar
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., et al. (2018). A tutorial on thompson sampling. Foundations and Trends ®in Machine Learning, 11(1), 1–96.
Article Google Scholar
Vaswani, S., Schmidt, M., & Lakshmanan, L. (2017). Horde of Bandits using Gaussian Markov Random Fields. In: Singh A, Zhu J (eds) Proceedings of the 20th international conference on artificial intelligence and statistics, PMLR, proceedings of machine learning research, (vol 54, pp. 690–699). https://proceedings.mlr.press/v54/vaswani17a.html
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge University Press.
Google Scholar
Wang, Z., Zhou, R., & Shen, C. (2018a). Regional multi-armed bandits. In: International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, PMLR, Proceedings of Machine Learning Research, (vol. 84, pp. 510–518)
Wang, Z., Zhou, R., & Shen, C. (2018b). Regional multi-armed bandits with partial informativeness. IEEE Transactions on Signal Processing, 66(21), 5705–5717.
Yang, X., Liu, X., & Wei, H. (2022). Concentration inequalities of mle and robust mle. arXiv preprint arXiv:2210.09398
Yang, Y. (2016). Ece598: Information-theoretic methods in high-dimensional statistics. http://www.stat.yale.edu/~yw562/teaching/598/lec14.pdf

Download references

Funding

Rahul Singh’s research was partially funded by the Science and Engineering Research Board under the project SRG/2021/002308. Ness Shroff was partially funded by the National Science Foundation under the projects CNS-1901057, CNS- 2007231, CNS-1618520, and CNS-1409336.

Author information

Authors and Affiliations

Department of ECE, Indian Institute of Science, Bengaluru, Karnataka, India
Rahul Singh
Department of ECE, Ohio State University, Columbus, OH, USA
Fang Liu & Ness Shroff
Department of ECE, Auburn University, Auburn, AL, USA
Yin Sun

Authors

Rahul Singh
View author publications
You can also search for this author in PubMed Google Scholar
Fang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ness Shroff
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RS and FL contributed to the theoretical analysis, algorithmic formulation and simulations. YS and NS supervised the development of the research and provided feedback at all the stages of the process till the final draft of the manuscript.

Corresponding author

Correspondence to Rahul Singh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent for publication

The authors of this manuscript consent to its publication.

Additional information

Editor: Hendrik Blockeel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Proof of Theorem 2 (concentration of ${\hat{\theta }}(n)$)

Throughout this proof, we drop the subscript ${\mathcal {C}}$ since the discussion is only for a single fixed cluster ${\mathcal {C}}$. Denote ${\mathcal {S}}_1:= \{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}$ to be the set of rewards obtained by n pulls of arms in ${\mathcal {C}}$. Consider the function $\xi$ defined as follows,

$$\begin{aligned} \xi ( \{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}} ) := \sup _{\theta \in \varTheta } \Bigg | L(\theta ) -\frac{D( \theta ^{\star }|| \theta ) }{n} \Bigg |. \end{aligned}$$

(45)

We begin by deriving a few preliminary results that will be utilized while proving the main result.

Lemma 2

The function $\xi$ is a Lipschitz continuous function of the rewards obtained, i.e., for two sample-paths $\omega _1,\omega _2$ we have that,

$$\begin{aligned} | \xi (\omega _1) - \xi (\omega _2) | \le L_p \Vert {\mathcal {S}}_1(\omega _1) - {\mathcal {S}}_2(\omega _2) \Vert , \end{aligned}$$

(46)

where $L_p>0$.

Proof

From Assumption 2 we have that the log-likelihood ratio $\frac{f_i(r,\theta ^{\star }) }{f_i(r,\theta )}$ is a Lipschitz continuous function of $\theta$. The proof then follows since Lipschitz continuity is preserved upon averaging, and also when two Lipschitz continuous functions are composed. $\square$

We now derive an upper-bound on the expectation of $\xi$.

Lemma 3

We have

$$\begin{aligned} {\mathbb {E}}(\xi ) \le \frac{L_f \cdot \text{ diam }(\varTheta ) \sqrt{\pi }}{\sqrt{n}}, \end{aligned}$$

$L_f$ is as in (9).

Proof

Let ${\mathcal {S}}_2:= \{{\tilde{r}}_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}$ be an independent copy of ${\mathcal {S}}_1= \{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}$. We then have that

$$\begin{aligned} {\mathbb {E}}(\xi )&= {\mathbb {E}}_{{\mathcal {S}}_1} \sup _{\theta \in \varTheta } \Bigg | {\mathbb {E}}_{{\mathcal {S}}_2}\left( \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i}\log \frac{f_i(r_{i,t},\theta ^{\star })}{f_i(r_{i,t},\theta )} - \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i}\log \frac{f_i({\tilde{r}}_{i,t},\theta ^{\star })}{f_i({\tilde{r}}_{i,t},\theta )} \Big | {\mathcal {S}}_1 \right) \Bigg | \nonumber \\&\le {\mathbb {E}}\sup _{\theta \in \varTheta } \Bigg | \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i}\log \frac{f_i(r_{i,t},\theta ^{\star }) }{f_i(r_{i,t},\theta ) } - \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i}\log \frac{f_i({\tilde{r}}_{i,t},\theta ^{\star }) }{f_i({\tilde{r}}_{i,t},\theta ) } \Bigg |, \end{aligned}$$

(47)

where the inequality follows from Jensen’s inequality Rudin (2006). Let $\{\epsilon _{i,t}:t\in [1,n_i]\}_{i\in {\mathcal {C}}}$ be a sequence of i.i.d. random variables that assume binary values $\{1,-1\}$ with a probability .5 each.

Let ${\mathcal {N}}( L_f diam(\varTheta ),\alpha )$ denote an $\alpha$-covering. The inequality (47) then yields us

$$\begin{aligned} {\mathbb {E}}(\xi )&\le 2{\mathbb {E}}\sup _{\theta \in \varTheta } \Bigg | \frac{1}{n}\sum _{i\in {\mathcal {C}}}\sum _{t=1}^{n_i} \epsilon _{i,t}\log \frac{f_i(r_{i,t},\theta ^{\star })}{f_i(r_{i,t},\theta )} \Bigg |\nonumber \\&\le 8 \int \limits _{0}^{L_f \text {diam}(\varTheta ) } \sqrt{\frac{\log {\mathcal {N}}( L_f diam(\varTheta ),\alpha ) }{n}}\textrm{d}\alpha \nonumber \\&\le L_f \text {diam}(\varTheta ) \sqrt{\frac{\pi }{n}}, \end{aligned}$$

(48)

where the first inequality follows by using a symmetrization argument that is similar to (Wain-wright, 2019, p. 107), while the second inequality follows from Lemma 6, and the third inequality follows by bounding the covering number by using a volume bound (Akshay, 2016; Yang, 2016; Wainwright, 2019). $\square$

We now derive a concentration result for $\xi$ around its mean.

Lemma 4

We have the following concentration result for $\xi$,

$$\begin{aligned} {\mathbb {P}}\left( | \xi - {\mathbb {E}}(\xi ) | > x \right) \le \exp \left( -\frac{n x^{2}}{2 L^2_p \sigma ^2} \right) , \end{aligned}$$

(49)

where $\xi$ is as in (45), $L_p$ is the Lipschitz constant associated with $\xi$ as in (46), $\sigma$ is the sub-Gaussianity parameter associated with the rewards as in (8) and n is the number of times arms from ${\mathcal {C}}$ are sampled.

Proof

It was shown in Lemma 2 that $\xi$ is a $L_p$ Lipschitz function of $\{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}$. Under Assumption 2 the rewards $r_{i,t}$ are sub-Gaussian and hence satisfy (8). The relation (49) then follows from (Kontorovich, 2014, Theorem 1). $\square$

After having derived preliminary results, we are now in a position to prove the main result, i.e., Theorem 2.

Proof (Theorem 2)

Consider the normalized and shifted likelihood function $L_{{\mathcal {C}}}(\cdot )$ as given in (32). Within this proof we let $x>0$.

We obtain the following after using the results of Lemmas 3 and 4,

$$\begin{aligned} {\mathbb {P}}\left( \sup _{\theta \in \varTheta } \Bigg | L_{{\mathcal {C}}}(\theta ) -\frac{D( \theta ^{\star }_{{\mathcal {C}}} || \theta ) }{n} \Bigg | \ge \frac{B_1}{\sqrt{n}} + x \right) \le \exp \left( -\frac{n x^{2}}{2 L^2_p \sigma ^2} \right) , \end{aligned}$$

(50)

where $B_1 = L_f \cdot \text {diam}(\varTheta ) \sqrt{\pi }$, $x>0$, and $L_f$ is as in (9). Thus, we have the following on a set that has a probability greater than $\exp \left( -\frac{n x^{2}}{2\,L^2_p \sigma ^2} \right)$,

$$\begin{aligned} \Bigg | L(\theta ^{\star }) -\frac{D( \theta ^{\star }|| \theta ^{\star }) }{n} \Bigg |&\le \frac{B_1}{\sqrt{n}} + x,\end{aligned}$$

(51)

$$\begin{aligned} \Bigg | L({\hat{\theta }}(n)) -\frac{D( \theta ^{\star }|| {\hat{\theta }}(n)) }{n} \Bigg |&\le \frac{B_1}{\sqrt{n}} + x. \end{aligned}$$

(52)

The above yields us

$$\begin{aligned} L(\theta ^{\star })&\le \frac{B_1}{\sqrt{n}} + x, \end{aligned}$$

(53)

$$\begin{aligned} \text { and } L({\hat{\theta }}(n))&\ge \frac{D( \theta ^{\star }|| {\hat{\theta }}(n)) }{n} - \left( \frac{B_1}{\sqrt{n}} + x\right) . \end{aligned}$$

(54)

Moreover, since ${\hat{\theta }}(n)$ minimizes the loss function, we also have

$$\begin{aligned} L({\hat{\theta }}(n)) \le L(\theta ^{\star }). \end{aligned}$$

After substituting (53) and (54) into the above inequality, we obtain the following,

$$\begin{aligned} \frac{D( \theta ^{\star }|| {\hat{\theta }}(n)) }{n} \le 2\left( \frac{B_1}{\sqrt{n}} + x\right) . \end{aligned}$$

This proves that the estimate ${\hat{\theta }}_{{\mathcal {C}}}(n)$ satisfies the following

$$\begin{aligned} {\mathbb {P}}\left( \frac{D( \theta ^{\star }_{{\mathcal {C}}} || {\hat{\theta }}_{{\mathcal {C}}}(n) )}{n} > 2\left( \frac{B_1}{\sqrt{n}} + x \right) \right) \le \exp \left( -\frac{n x^{2}}{2 L^2_p \sigma ^2} \right) , \end{aligned}$$

(55)

where $x>0$. To see (34), note that under Assumption 1 we have

$$\begin{aligned} D(\theta ^{\star }|| {\hat{\theta }})\ge \left( \min _{j\in {\mathcal {C}}} \ell b_{(j,i)} \right) KL_i(\theta ^{\star }|| {\hat{\theta }}). \end{aligned}$$

(34) then follows by substituting this inequality into (55).

To see (35), we note that the vector which describes the number of plays of each arm in ${\mathcal {C}}$, can assume atmost $N_{{\mathcal {C}}}(t)^{|{\mathcal {C}}|}$ values; this follows since the number of plays of each arm can assume values in the set $[0,N_{{\mathcal {C}}}(t)]$. The result then follows by combining the result (34) for non-adaptive plays with union bound. $\square$

Appendix 2: Some auxiliary results

The following result is utilized while analyzing the regret of UCB-D.

Lemma 5

Consider the confidence balls ${\mathcal {O}}_{{\mathcal {C}}}(t)$ (24) computed by UCB-D algorithm at time t. Let all the confidence balls hold true at time t, i.e. we have that $\theta ^{\star }_{{\mathcal {C}}}\in {\mathcal {O}}_{{\mathcal {C}}}(t),~\forall {\mathcal {C}}$. Consider a cluster ${\mathcal {C}}$, and let $i\in {\mathcal {C}}$ be a sub-optimal arm. Then, the UCB-D algorithm plays it only if

$$\begin{aligned} N_{{\mathcal {C}}_i}(t) \le \frac{\kappa \log t}{ \left( \varSigma _i \psi ^{-1}_i\left( \frac{\varDelta _i}{2}\right) \right) ^{2} }, \end{aligned}$$

where $\psi ^{-1}_i,\varSigma _i$ are as in (5) and (26) respectively.

Proof

Since $\theta ^{\star }_{{\mathcal {C}}} \in {\mathcal {O}}_{{\mathcal {C}}}(t)$, it follows from (24) that

$$\begin{aligned} \frac{1}{N_{{\mathcal {C}}}(t) }\sum _{j\in {\mathcal {C}}} N_j(t) KL_j({\hat{\theta }}_{{\mathcal {C}}}(t) || \theta ^{\star }_{{\mathcal {C}}} )\le d_{{\mathcal {C}}}(t),~\forall {\mathcal {C}}. \end{aligned}$$

(56)

It follows from Assumption 1 that $\forall \theta _{1},\theta _2\in \varTheta$ and arms $i,j\in {\mathcal {C}}$, we have the following

$$\begin{aligned} KL_j(\theta _1||\theta _2) \ge \ell b_{(j,i)} KL_i(\theta _1||\theta _2). \end{aligned}$$

(57)

Upon substituting the above inequality into (56), and letting the cluster of interest be ${\mathcal {C}}_i$, we obtain the following

$$\begin{aligned} KL_i({\hat{\theta }}_{{\mathcal {C}}_i}(t) || \theta ^{\star }_{{\mathcal {C}}_i} )&\le \varSigma ^{-1}_i d_{{\mathcal {C}}_{i} }(t), \end{aligned}$$

(58)

from which it follows that

$$\begin{aligned} \mu _i({\hat{\theta }}_{{\mathcal {C}}_i}(t)) \le \mu _i + {\overline{\psi }}_i\left( \frac{d_{{\mathcal {C}}_i}(t)}{\varSigma _i}\right) . \end{aligned}$$

(59)

Similarly, it follows from the definition of confidence ball ${\mathcal {O}}_{{\mathcal {C}}_i}(t)$ that

$$\begin{aligned} uc_i(t) \le \mu _i({\hat{\theta }}_{{\mathcal {C}}_i}(t)) + {\overline{\psi }}_i\left( \frac{d_{{\mathcal {C}}_i}(t)}{\varSigma _i } \right) . \end{aligned}$$

(60)

The above two inequalities yield,

$$\begin{aligned} {\overline{\psi }}_i\left( \frac{d_{{\mathcal {C}}_i}(t)}{\varSigma _i } \right) \ge \frac{uc_i(t) - \mu _i}{2}, \text { or},~ d_{{\mathcal {C}}_i}(t) \ge \varSigma _i ~\psi ^{-1}_i\left( \frac{uc_i(t) - \mu _i}{2} \right) . \end{aligned}$$

(61)

Under our assumption UCB-D algorithm plays arm i at time t, so that we have

$$\begin{aligned} uc_i(t) \ge uc_{i^{\star }}(t) \ge \mu _{i^{\star }}, \end{aligned}$$

which gives,

$$\begin{aligned} uc_i(t) - \mu _i \ge \varDelta _i. \end{aligned}$$

Substituting the above into (61), we obtain the following,

$$\begin{aligned} d_{{\mathcal {C}}_i}(t) \ge \varSigma _i \psi ^{-1}_i\left( \frac{\varDelta _i}{2} \right) . \end{aligned}$$

(62)

Since $d_{{\mathcal {C}}_i}(t)= \sqrt{\kappa \frac{\log t}{N_{{\mathcal {C}}_i}(t)}}$, the above reduces to

$$\begin{aligned} \sqrt{\kappa \frac{\log t}{N_{{\mathcal {C}}_i}(t)}} \ge \varSigma _i \psi ^{-1}_i\left( \frac{\varDelta _i}{2} \right) , \text{ or } N_{{\mathcal {C}}_i}(t) \le \frac{\kappa \log t}{ \left( \varSigma _i \psi ^{-1}_i\left( \frac{\varDelta _i}{2}\right) \right) ^{2} }. \end{aligned}$$

(63)

This completes the proof. $\square$

Lemma 6

Consider a set $A\subset {\mathbb {R}}^{n}$ that satisfies $\Vert a\Vert \le D, \forall a\in A$. Let $\{\epsilon _i\}_{i=1}^{n}$ be i.i.d. and assume values $1,-1$ with probability .5 each. We then have that

$$\begin{aligned} {\mathbb {E}}\left( \sup _{a\in A} \Big |\frac{1}{n} \sum _{i=1}^{n} \epsilon _i a_i \Big | \right) \le \frac{1}{\sqrt{n}}\int _{0}^{D} \sqrt{\log {\mathcal {N}}(\alpha ,A) }~\textrm{d}\alpha , \end{aligned}$$

where ${\mathcal {N}}(\alpha ,A)$ denotes the minimum number of balls of radius $\alpha$ that are required to cover the set A.

Proof

Within this proof, we let D denote the diameter of the set A. Consider a decreasing sequence of numbers $\alpha _n = 2^{-n} D,~n=1,2,\ldots$. Let ${\bar{A}}$ be closure of A. Let $Cov_{n}\subset {\bar{A}}$ be an $\alpha _n$ cover of the set A, and moreover let the cover formed by $Cov_{n+1}$ be a refinement of $Cov_n$. Fix an $a\in A$, and consider the sequence $\hat{a}_n$, where we have that ${\hat{a}}_n$ is the point in the set $Cov_n$ that is closest to a. Clearly, $\Vert a-{\hat{a}}_n\Vert \le \alpha _n$, and also $\Vert {\hat{a}}_{n}-{\hat{a}}_{n+1}\Vert \le \alpha _{n+1}$. Let $\epsilon$ be the vector $\left( \epsilon _1,\epsilon _2,\ldots ,\epsilon _N\right)$. Since $a ={\hat{a}}_0 + \left( \sum _{n=1}^{N} {\hat{a}}_{n} - {\hat{a}}_{n-1} \right) + a - {\hat{a}}_N$, we obtain the following,

$$\begin{aligned} {\mathbb {E}}\sup _{a\in A} \Big |\frac{1}{n} \sum _{i=1}^{n} \epsilon _i a_i \Big |&= {\mathbb {E}}\sup _{a\in {\bar{A}} } \frac{1}{n} \epsilon \cdot \left( {\hat{a}}_0 + \left( \sum _{n=1}^{N} {\hat{a}}_{n} - {\hat{a}}_{n-1} \right) + a - {\hat{a}}_N \right) \\&\le {\mathbb {E}}\sup _{a_n \in Cov_n, a_{n-1} \in Cov_{n-1}} \epsilon \cdot (a_{n}-a_{n-1}) + {\mathbb {E}}\sup _{a\in {\bar{A}} } \epsilon \cdot (a - {\hat{a}}_N)\\&\le \frac{1}{N}\sum _{n=1}^{N} \alpha _n \sqrt{\frac{2}{n}\log | Cov_n| | Cov_{n-1}| } +\alpha _N \\&\le \frac{1}{N}\sum _{n=1}^{N} \alpha _n \sqrt{\frac{2}{n}\log {\mathcal {N}}({\bar{A}},\alpha _n) } +\alpha _N \\&= \frac{1}{N}\sum _{n=1}^{N} 2(\alpha _n - \alpha _{n+1}) \sqrt{\frac{2}{n}\log {\mathcal {N}}({\bar{A}},\alpha _n) } +\alpha _N \\&\le 4 \int \limits _{\alpha _N}^{\alpha _0} \sqrt{\frac{2}{n}\log {\mathcal {N}}({\bar{A}},\alpha ) }~\textrm{d}\alpha +\alpha _N \\&\rightarrow 4\int \limits _{0}^{D} \sqrt{\frac{2}{n}\log {\mathcal {N}}({\bar{A}},\alpha ) }~\textrm{d}\alpha \text{ as } \alpha _N \rightarrow 0, \end{aligned}$$

where the first inequality follows from Massart’s Finite Class Lemma (Kakade & Tewari, 2008).$\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Singh, R., Liu, F., Sun, Y. et al. Multi-armed bandits with dependent arms. Mach Learn 113, 45–71 (2024). https://doi.org/10.1007/s10994-023-06457-z

Download citation

Received: 30 January 2021
Revised: 11 October 2023
Accepted: 16 October 2023
Published: 20 November 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s10994-023-06457-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-armed bandits with dependent arms

Abstract

Access this article

Similar content being viewed by others

Infinitely Many-Armed Bandits with Unknown Value Distribution

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

The non-stationary stochastic multi-armed bandit problem

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix 1: Proof of Theorem 2 (concentration of \({\hat{\theta }}(n)\))

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Proof (Theorem 2)

Appendix 2: Some auxiliary results

Lemma 5

Proof

Lemma 6

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-armed bandits with dependent arms

Abstract

Access this article

Similar content being viewed by others

Infinitely Many-Armed Bandits with Unknown Value Distribution

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

The non-stationary stochastic multi-armed bandit problem

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix 1: Proof of Theorem 2 (concentration of \({\hat{\theta }}(n)\))

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Proof (Theorem 2)

Appendix 2: Some auxiliary results

Lemma 5

Proof

Lemma 6

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation