Abstract
We study a variant of the multi-armed bandit problem (MABP) which we call as MABs with dependent arms. Multiple arms are grouped together to form a cluster, and the reward distributions of arms in the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm i not only reveals information about its own reward distribution, but also about all arms belonging to the same cluster. This “correlation” among the arms complicates the exploration–exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the principle of optimism in the face of uncertainty (Lattimore and Szepesvári in Bandit algorithms, Cambridge University Press, 2020), which know the clusters, and hence utilize these additional side observations appropriately while performing exploration–exploitation trade-off. We show that the regret of our algorithms grows as \(O(K\log T)\), where K is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that does not utilize these dependencies, the regret scales as \(O(M\log T)\), where M is the number of arms. When \(K\ll M\), i.e. there is a lot of dependencies among arms, our proposed algorithm drastically reduces the dependence of regret on the number of arms.
Similar content being viewed by others
Code Availability
The code is available at the following link: https://github.com/fangliu0302/ClusterBandit
Notes
instance-dependent regret.
The relative gap between the lower bound and regret of UCB-D vanishes as \(K\rightarrow \infty\).
See Appendix 2 for more details.
References
Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In: Advances in Neural Information Processing Systems, (pp. 2312–2320)
Akshay D Kamath, S.G. (2016). Cs 395t: Sublinear algorithms, lecture notes. https://www.cs.utexas.edu/~ecprice/courses/sublinear/notes/lec12.pdf
Atan, O., Tekin, C., & Schaar, M. (2015). Global multi-armed bandits with Hölder continuity. In: Artificial Intelligence and Statistics, (pp. 28–36)
Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422.
Awerbuch, B., & Kleinberg, R. (2008). Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1), 97–114.
Ayoub, R. (1974). Euler and the zeta function. The American Mathematical Monthly, 81(10), 1067–1086.
Berry, D.A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments (monographs on statistics and applied probability). (vol. 5(71-87), pp. 7–7). Chapman and Hall.
Binette, O. (2019). A note on reverse pinsker inequalities. IEEE Transactions on Information Theory, 65(7), 4094–4096. https://doi.org/10.1109/TIT.2019.2896192
Bouneffouf, D., Parthasarathy, S., Samulowitz, H., & Wistuba, M. (2019). Optimal exploitation of clustering and history information in multi-armed bandit. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, (pp. 2016–2022). https://doi.org/10.24963/ijcai.2019/279
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721
Buccapatnam, S., Eryilmaz, A., & Shroff, N.B. (2014). Stochastic bandits with side observations on networks. In: The 2014 ACM international conference on Measurement and modeling of computer systems, (pp. 289–300)
Carlsson, E., Dubhashi, D., & Johansson, F.D. (2021). Thompson sampling for bandits with clustered arms. In: Zhou ZH (ed) Proceedings of the thirtieth international joint conference on artificial intelligence, IJCAI-21, International joint conferences on artificial intelligence organization, (pp. 2212–2218). https://doi.org/10.24963/ijcai.2021/305,main Track
Caron, S., Kveton, B., Lelarge, M., & Bhagat, S. (2012). Leveraging side observations in stochastic bandits. arXiv preprint arXiv:1210.4839
Cesa-Bianchi, N., Gentile, C., & Zappella, G. (2013). A gang of bandits. Advances in Neural Information Processing Systems 26
Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear payoff functions. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, (pp. 208–214).
Combes, R., Magureanu, S., & Proutiere, A. (2017). Minimal exploration in structured stochastic bandits. In: Advances in Neural Information Processing Systems, (pp. 1763–1771)
Cover, T. M. (1999). Elements of information theory. John Wiley & Sons.
Gai, Y., Krishnamachari, B., & Jain, R. (2012). Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5), 1466–1478.
Garivier, A., & Cappé, O. (2011). The kl-ucb algorithm for bounded stochastic bandits and beyond. In: Proceedings of the 24th annual conference on learning theory, (pp. 359–376).
Gentile, C., Li, S., & Zappella, G. (2014). Online clustering of bandits. In: International conference on machine learning, PMLR, pp 757–765
Gentile, C., Li, S., Kar, P., Karatzoglou, A., Zappella, G., & Etrue, E. (2017). On context-dependent clustering of bandits. In: International conference on machine learning, PMLR, (pp. 1253–1262).
Gittins, J., Glazebrook, K., & Weber, R. (2011). Multi-armed bandit allocation indices. John Wiley & Sons.
Götze, F., Sambale, H., & Sinulis, A. (2019). Higher order concentration for functions of weakly dependent random variables
Gupta, S., Joshi, G., & Yagan, O. (2018). Exploiting correlation in finite-armed structured bandits. arXiv preprint arXiv:1810.08164
Gupta, S., Joshi, G., & Yağan, O. (2020). Correlated multi-armed bandits with a latent random source. ICASSP 2020–2020 IEEE international conference on acoustics (pp. 3572–3576). IEEE: Speech and Signal Processing (ICASSP).
Kakade, S., & Tewari, A. (2008). Cmsc 35900 (spring 2008) learning theory, lecture notes: Massart’s finite class lemma and growth function. https://ttic.uchicago.edu/~tewari/lectures/lecture10.pdf
Kontorovich, A. (2014). Concentration in unbounded metric spaces and algorithmic stability. In: International conference on machine learning, (pp. 28–36)
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22.
Langford, J., & Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In: Advances in neural information processing systems, (pp. 817–824).
Lattimore, T., & Munos, R. (2014). Bounded regret for finite-armed structured bandits. In: Advances in neural information processing systems, (pp. 550–558).
Lattimore, T., & Szepesvari, C. (2017). The end of optimism? an asymptotic analysis of finite-armed linear bandits. In: Artificial intelligence and statistics, PMLR, (pp. 728–737)
Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
Ledoux, M., & Talagrand, M. (2013). Probability in banach spaces: Isoperimetry and processes. Springer Science & Business Media
Li, L., Chu, W., Langford, J., & Schapire, R.E. (2010). A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th international conference on World Wide Web, (pp. 661–670).
Mannor, S., & Shamir, O. (2011). From bandits to experts: On the value of side-observations. In: Advances in neural information processing systems, (pp. 684–692)
Miao, Y. (2010). Concentration inequality of maximum likelihood estimator. Applied Mathematics Letters, 23(10), 1305–1309.
Pandey, S., Chakrabarti, D., & Agarwal, D. (2007). Multi-armed bandit problems with dependent arms. In: Proceedings of the 24th international conference on machine learning, (pp. 721–728).
Resnick, S. (2019). A probability path. Springer.
Rudin, W. (2006). Real and complex analysis. Tata McGraw-hill education.
Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2), 395–411.
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., et al. (2018). A tutorial on thompson sampling. Foundations and Trends ®in Machine Learning, 11(1), 1–96.
Vaswani, S., Schmidt, M., & Lakshmanan, L. (2017). Horde of Bandits using Gaussian Markov Random Fields. In: Singh A, Zhu J (eds) Proceedings of the 20th international conference on artificial intelligence and statistics, PMLR, proceedings of machine learning research, (vol 54, pp. 690–699). https://proceedings.mlr.press/v54/vaswani17a.html
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge University Press.
Wang, Z., Zhou, R., & Shen, C. (2018a). Regional multi-armed bandits. In: International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, PMLR, Proceedings of Machine Learning Research, (vol. 84, pp. 510–518)
Wang, Z., Zhou, R., & Shen, C. (2018b). Regional multi-armed bandits with partial informativeness. IEEE Transactions on Signal Processing, 66(21), 5705–5717.
Yang, X., Liu, X., & Wei, H. (2022). Concentration inequalities of mle and robust mle. arXiv preprint arXiv:2210.09398
Yang, Y. (2016). Ece598: Information-theoretic methods in high-dimensional statistics. http://www.stat.yale.edu/~yw562/teaching/598/lec14.pdf
Funding
Rahul Singh’s research was partially funded by the Science and Engineering Research Board under the project SRG/2021/002308. Ness Shroff was partially funded by the National Science Foundation under the projects CNS-1901057, CNS- 2007231, CNS-1618520, and CNS-1409336.
Author information
Authors and Affiliations
Contributions
RS and FL contributed to the theoretical analysis, algorithmic formulation and simulations. YS and NS supervised the development of the research and provided feedback at all the stages of the process till the final draft of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Consent for publication
The authors of this manuscript consent to its publication.
Additional information
Editor: Hendrik Blockeel.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Proof of Theorem 2 (concentration of \({\hat{\theta }}(n)\))
Throughout this proof, we drop the subscript \({\mathcal {C}}\) since the discussion is only for a single fixed cluster \({\mathcal {C}}\). Denote \({\mathcal {S}}_1:= \{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}\) to be the set of rewards obtained by n pulls of arms in \({\mathcal {C}}\). Consider the function \(\xi\) defined as follows,
We begin by deriving a few preliminary results that will be utilized while proving the main result.
Lemma 2
The function \(\xi\) is a Lipschitz continuous function of the rewards obtained, i.e., for two sample-paths \(\omega _1,\omega _2\) we have that,
where \(L_p>0\).
Proof
From Assumption 2 we have that the log-likelihood ratio \(\frac{f_i(r,\theta ^{\star }) }{f_i(r,\theta )}\) is a Lipschitz continuous function of \(\theta\). The proof then follows since Lipschitz continuity is preserved upon averaging, and also when two Lipschitz continuous functions are composed. \(\square\)
We now derive an upper-bound on the expectation of \(\xi\).
Lemma 3
We have
\(L_f\) is as in (9).
Proof
Let \({\mathcal {S}}_2:= \{{\tilde{r}}_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}\) be an independent copy of \({\mathcal {S}}_1= \{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}\). We then have that
where the inequality follows from Jensen’s inequality Rudin (2006). Let \(\{\epsilon _{i,t}:t\in [1,n_i]\}_{i\in {\mathcal {C}}}\) be a sequence of i.i.d. random variables that assume binary values \(\{1,-1\}\) with a probability .5 each.
Let \({\mathcal {N}}( L_f diam(\varTheta ),\alpha )\) denote an \(\alpha\)-covering. The inequality (47) then yields us
where the first inequality follows by using a symmetrization argument that is similar to (Wain-wright, 2019, p. 107), while the second inequality follows from Lemma 6, and the third inequality follows by bounding the covering number by using a volume bound (Akshay, 2016; Yang, 2016; Wainwright, 2019). \(\square\)
We now derive a concentration result for \(\xi\) around its mean.
Lemma 4
We have the following concentration result for \(\xi\),
where \(\xi\) is as in (45), \(L_p\) is the Lipschitz constant associated with \(\xi\) as in (46), \(\sigma\) is the sub-Gaussianity parameter associated with the rewards as in (8) and n is the number of times arms from \({\mathcal {C}}\) are sampled.
Proof
It was shown in Lemma 2 that \(\xi\) is a \(L_p\) Lipschitz function of \(\{r_{i,t}: t\in [1,n_i] \}_{i\in {\mathcal {C}}}\). Under Assumption 2 the rewards \(r_{i,t}\) are sub-Gaussian and hence satisfy (8). The relation (49) then follows from (Kontorovich, 2014, Theorem 1). \(\square\)
After having derived preliminary results, we are now in a position to prove the main result, i.e., Theorem 2.
Proof (Theorem 2)
Consider the normalized and shifted likelihood function \(L_{{\mathcal {C}}}(\cdot )\) as given in (32). Within this proof we let \(x>0\).
We obtain the following after using the results of Lemmas 3 and 4,
where \(B_1 = L_f \cdot \text {diam}(\varTheta ) \sqrt{\pi }\), \(x>0\), and \(L_f\) is as in (9). Thus, we have the following on a set that has a probability greater than \(\exp \left( -\frac{n x^{2}}{2\,L^2_p \sigma ^2} \right)\),
The above yields us
Moreover, since \({\hat{\theta }}(n)\) minimizes the loss function, we also have
After substituting (53) and (54) into the above inequality, we obtain the following,
This proves that the estimate \({\hat{\theta }}_{{\mathcal {C}}}(n)\) satisfies the following
where \(x>0\). To see (34), note that under Assumption 1 we have
(34) then follows by substituting this inequality into (55).
To see (35), we note that the vector which describes the number of plays of each arm in \({\mathcal {C}}\), can assume atmost \(N_{{\mathcal {C}}}(t)^{|{\mathcal {C}}|}\) values; this follows since the number of plays of each arm can assume values in the set \([0,N_{{\mathcal {C}}}(t)]\). The result then follows by combining the result (34) for non-adaptive plays with union bound. \(\square\)
Appendix 2: Some auxiliary results
The following result is utilized while analyzing the regret of UCB-D.
Lemma 5
Consider the confidence balls \({\mathcal {O}}_{{\mathcal {C}}}(t)\) (24) computed by UCB-D algorithm at time t. Let all the confidence balls hold true at time t, i.e. we have that \(\theta ^{\star }_{{\mathcal {C}}}\in {\mathcal {O}}_{{\mathcal {C}}}(t),~\forall {\mathcal {C}}\). Consider a cluster \({\mathcal {C}}\), and let \(i\in {\mathcal {C}}\) be a sub-optimal arm. Then, the UCB-D algorithm plays it only if
where \(\psi ^{-1}_i,\varSigma _i\) are as in (5) and (26) respectively.
Proof
Since \(\theta ^{\star }_{{\mathcal {C}}} \in {\mathcal {O}}_{{\mathcal {C}}}(t)\), it follows from (24) that
It follows from Assumption 1 that \(\forall \theta _{1},\theta _2\in \varTheta\) and arms \(i,j\in {\mathcal {C}}\), we have the following
Upon substituting the above inequality into (56), and letting the cluster of interest be \({\mathcal {C}}_i\), we obtain the following
from which it follows that
Similarly, it follows from the definition of confidence ball \({\mathcal {O}}_{{\mathcal {C}}_i}(t)\) that
The above two inequalities yield,
Under our assumption UCB-D algorithm plays arm i at time t, so that we have
which gives,
Substituting the above into (61), we obtain the following,
Since \(d_{{\mathcal {C}}_i}(t)= \sqrt{\kappa \frac{\log t}{N_{{\mathcal {C}}_i}(t)}}\), the above reduces to
This completes the proof. \(\square\)
Lemma 6
Consider a set \(A\subset {\mathbb {R}}^{n}\) that satisfies \(\Vert a\Vert \le D, \forall a\in A\). Let \(\{\epsilon _i\}_{i=1}^{n}\) be i.i.d. and assume values \(1,-1\) with probability .5 each. We then have that
where \({\mathcal {N}}(\alpha ,A)\) denotes the minimum number of balls of radius \(\alpha\) that are required to cover the set A.
Proof
Within this proof, we let D denote the diameter of the set A. Consider a decreasing sequence of numbers \(\alpha _n = 2^{-n} D,~n=1,2,\ldots\). Let \({\bar{A}}\) be closure of A. Let \(Cov_{n}\subset {\bar{A}}\) be an \(\alpha _n\) cover of the set A, and moreover let the cover formed by \(Cov_{n+1}\) be a refinement of \(Cov_n\). Fix an \(a\in A\), and consider the sequence \(\hat{a}_n\), where we have that \({\hat{a}}_n\) is the point in the set \(Cov_n\) that is closest to a. Clearly, \(\Vert a-{\hat{a}}_n\Vert \le \alpha _n\), and also \(\Vert {\hat{a}}_{n}-{\hat{a}}_{n+1}\Vert \le \alpha _{n+1}\). Let \(\epsilon\) be the vector \(\left( \epsilon _1,\epsilon _2,\ldots ,\epsilon _N\right)\). Since \(a ={\hat{a}}_0 + \left( \sum _{n=1}^{N} {\hat{a}}_{n} - {\hat{a}}_{n-1} \right) + a - {\hat{a}}_N\), we obtain the following,
where the first inequality follows from Massart’s Finite Class Lemma (Kakade & Tewari, 2008).\(\square\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Singh, R., Liu, F., Sun, Y. et al. Multi-armed bandits with dependent arms. Mach Learn 113, 45–71 (2024). https://doi.org/10.1007/s10994-023-06457-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06457-z