Skip to main content
Log in

Online Learning in Budget-Constrained Dynamic Colonel Blotto Games

  • Published:
Dynamic Games and Applications Aims and scope Submit manuscript

Abstract

In this paper, we study the strategic allocation of limited resources using a Colonel Blotto game (CBG) under a dynamic setting and analyze the problem using an online learning approach. In this model, one of the players is a learner who has limited troops to allocate over a finite time horizon, and the other player is an adversary. In each round, the learner plays a one-shot Colonel Blotto game with the adversary and strategically determines the allocation of troops among battlefields based on past observations. The adversary chooses its allocation action randomly from some fixed distribution that is unknown to the learner. The learner’s objective is to minimize its regret, which is the difference between the cumulative reward of the best mixed strategy and the realized cumulative reward by following a learning algorithm while not violating the budget constraint. The learning in dynamic CBG is analyzed under the framework of combinatorial bandits and bandits with knapsacks. We first convert the budget-constrained dynamic CBG to a path planning problem on directed graph. We then devise an efficient algorithm that combines a special combinatorial bandit algorithm for path planning problem and a bandits with knapsack algorithm to cope with the budget constraint. The theoretical analysis shows that the learner’s regret is bounded by a term sublinear in time horizon and polynomial in other parameters. Finally, we justify our theoretical results by carrying out simulations for various scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Algorithm 2
Fig. 2
Algorithm 3
Fig. 3

Similar content being viewed by others

Availability of Data and Materials

This declaration is not applicable. This article does not use any datasets.

Notes

  1. A static game is a one-shot game where all players act simultaneously. A game is dynamic if players are allowed to act multiple times based on the history of their strategies and observations.

  2. Here, \((\mathcal {L}_t(a_t, i_t)=r_t(a_t)+1- \frac{T}{B_0}w_{t,i_t} (a_t))\) is the Lagrangian function at round t, where the subscript t again suppresses the dependency of parameters on the adversary’s action.

References

  1. Agrawal S, Devanur NR, Li L (2016) An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In: Feldman V, Rakhlin A, Shamir O (eds) 29th Annual conference on learning theory, vol 49. Proceedings of machine learning research. Columbia University, New York, New York, USA, pp 4–18

  2. Agrawal S, Devanur NR (2014) Bandits with concave rewards and convex knapsacks. In: Proceedings of the fifteenth ACM conference on economics and computation. EC ’14. Association for Computing Machinery, New York, NY, USA, , pp 989–1006. https://doi.org/10.1145/2600057.2602844

  3. Agrawal S, Devanur NR (2016) Linear contextual bandits with knapsacks. In: Proceedings of the 30th international conference on neural information processing systems. NIPS’16. Curran Associates Inc., Red Hook, NY, USA, pp 3458–3467

  4. Ahmadinejad A, Dehghani S, Hajiaghayi M, Lucier B, Mahini H, Seddighin S (2019) From duels to battlefields: computing equilibria of Blotto and other games. Math Oper Res 44(4):1304–1325. https://doi.org/10.1287/moor.2018.0971

    Article  MathSciNet  MATH  Google Scholar 

  5. Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J Comput 32(1):48–77. https://doi.org/10.1137/S0097539701398375

    Article  MathSciNet  MATH  Google Scholar 

  6. Badanidiyuru A, Kleinberg R, Slivkins A (2013) Bandits with knapsacks. In: 2013 IEEE 54th annual symposium on foundations of computer science, Berkeley, California, USA, pp 207–216. IEEE

  7. Bartlett P, Dani V, Hayes T, Kakade S, Rakhlin A, Tewari A (2008) High-probability regret bounds for bandit online linear optimization. In: Proceedings of the 21st annual conference on learning theory-COLT 2008. Omnipress, , pp 335–342

  8. Behnezhad S, Dehghani S, Derakhshan M, Hajiaghayi M, Seddighin S (2023) Fast and simple solutions of Blotto games. Oper Res 71(2):506–516. https://doi.org/10.1287/opre.2022.2261

    Article  MathSciNet  Google Scholar 

  9. Behnezhad S, Dehghani S, Derakhshan M, Aghayi MTH, Seddighin S (2017) Faster and simpler algorithm for optimal strategies of Blotto game. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. AAAI’17. AAAI Press, San Francisco, California, USA, pp 369–375

  10. Borel E (1921) La théorie du jeu et les équations intégralesa noyau symétrique. Comptes rendus de l’Acad des Sci 173(1304–1308):58

    Google Scholar 

  11. Borel E (1953) The theory of play and integral equations with skew symmetric kernels. Econometrica 21(1):97–100

    Article  MathSciNet  MATH  Google Scholar 

  12. Cesa-Bianchi N, Lugosi G (2012) Combinatorial bandits. J Comput Syst Sci 78(5):1404–1422. https://doi.org/10.1016/j.jcss.2012.01.001

    Article  MathSciNet  MATH  Google Scholar 

  13. Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, Cambridge, England. https://doi.org/10.1017/CBO9780511546921

  14. Combes R, Talebi Mazraeh Shahi MS, Proutiere A, Lelarge M (2015) Combinatorial bandits revisited. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28. Curran Associates Inc, Montreal, Quebec, Canada

  15. Dani V, Hayes TP, Kakade SM (2007) The price of bandit information for online optimization. In: Proceedings of the 20th international conference on neural information processing systems. NIPS’07. Curran Associates Inc., Red Hook, NY, USA, pp 345–352

  16. Etesami SR (2021) Open-loop equilibrium strategies for dynamic influence maximization game over social networks. IEEE Control Syst Lett 6:1496–1500. https://doi.org/10.1109/LCSYS.2021.3116030

    Article  MathSciNet  Google Scholar 

  17. Etesami SR, Başar T (2019) Dynamic games in cyber-physical security: an overview. Dyn. Games Appl. 9(4):884–913. https://doi.org/10.1007/s13235-018-00291-y

    Article  MathSciNet  MATH  Google Scholar 

  18. Ferdowsi A, Sanjab A, Saad W, Basar T (2018) Generalized Colonel Blotto game. In: 2018 American Control Conference (ACC), Milwaukee, Wisconsin, USA, pp 5744–5749. https://doi.org/10.23919/ACC.2018.8431701

  19. Fréchet M (1953) Emile Borel, initiator of the theory of psychological games and its application. Econometrica 21(1):95

    Article  MathSciNet  MATH  Google Scholar 

  20. Fréchet M (1953) Commentary on the three notes of Emile Borel. Econometrica 21(1):118–124

    Article  MathSciNet  MATH  Google Scholar 

  21. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. https://doi.org/10.1006/jcss.1997.1504

    Article  MathSciNet  MATH  Google Scholar 

  22. Gross OA, Wagner RA (1950) A continuous Colonel Blotto game. RAND Corporation, Santa Monica, CA

    Google Scholar 

  23. Guan S, Wang J, Yao H, Jiang C, Han Z, Ren Y (2020) Colonel Blotto games in network systems: Models, strategies, and applications. IEEE Trans Netw Sci Eng 7(2):637–649. https://doi.org/10.1109/TNSE.2019.2904530

    Article  MathSciNet  Google Scholar 

  24. Gupta A, Schwartz G, Langbort C, Sastry SS, Başar T (2014) A three-stage Colonel Blotto game with applications to cyberphysical security. In: 2014 American Control Conference (ACC), Portland, Oregon, USA, pp 3820–3825. https://doi.org/10.1109/ACC.2014.6859164

  25. Hajimirsaadeghi M, Mandayam NB (2017) A dynamic Colonel Blotto game model for spectrum sharing in wireless networks. In: 2017 55th Annual Allerton conference on communication, control, and computing (Allerton), pp 287–294. https://doi.org/10.1109/ALLERTON.2017.8262750

  26. Hajimirsadeghi M, Sridharan G, Saad W, Mandayam NB (2016) Inter-network dynamic spectrum allocation via a Colonel Blotto game. In: 2016 Annual conference on information science and systems (CISS), pp 252–257. https://doi.org/10.1109/CISS.2016.7460510

  27. Hortala-Vallve R, Llorente-Saguer A (2012) Pure strategy Nash equilibria in non-zero sum Colonel Blotto games. Int J Game Theory 41(2):331–343

    Article  MathSciNet  MATH  Google Scholar 

  28. Immorlica N, Sankararaman KA, Schapire R, Slivkins A (2019) Adversarial bandits with knapsacks. In: 2019 IEEE 60th annual symposium on foundations of computer science (FOCS), pp 202–219. https://doi.org/10.1109/FOCS.2019.00022

  29. Immorlica N, Sankararaman KA, Schapire R, Slivkins A (2020) Adversarial bandits with knapsacks. arXiv:1811.11881

  30. Kovenock D, Roberson B (2012) Coalitional Colonel Blotto games with application to the economics of alliances. J Public Econ Theory 14(4):653–676. https://doi.org/10.1111/j.1467-9779.2012.01556.x

    Article  Google Scholar 

  31. Kovenock D, Roberson B (2021) Generalizations of the general lotto and Colonel Blotto games. Econ Theor 71(3):997–1032

    Article  MathSciNet  MATH  Google Scholar 

  32. Labib M, Ha S, Saad W, Reed JH (2015) A Colonel Blotto game for anti-jamming in the Internet of Things. In: 2015 IEEE global communications conference (GLOBECOM), pp 1–6. https://doi.org/10.1109/GLOCOM.2015.7417437

  33. Laslier J-F (2002) How two-party competition treats minorities. Rev Econ Design 7:297–307

    Article  MATH  Google Scholar 

  34. Laslier J-F, Picard N (2002) Distributive politics and electoral competition. J Econ Theory 103(1):106–130. https://doi.org/10.1006/jeth.2000.2775

    Article  MathSciNet  MATH  Google Scholar 

  35. Lattimore T, Szepesvári C (2020) Bandit algorithms. Cambridge University Press, Cambridge. https://doi.org/10.1017/9781108571401

  36. Li X, Sun C, Ye Y (2021) The symmetry between arms and knapsacks: a primal-dual approach for bandits with knapsacks. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning. Proceedings of machine learning research, vol 139, pp 6483–6492

  37. Min M, Xiao L, Xie C, Hajimirsadeghi M, Mandayam NB (2018) Defense against advanced persistent threats in dynamic cloud storage: a Colonel Blotto game approach. IEEE Internet Things J 5(6):4250–4261. https://doi.org/10.1109/JIOT.2018.2844878

    Article  Google Scholar 

  38. Roberson B (2006) The Colonel Blotto game. Econ Theor 29(1):1–24

    Article  MathSciNet  MATH  Google Scholar 

  39. Thomas C (2018) N-dimensional Blotto game with heterogeneous battlefield values. Econ Theor 65(3):509–544

    Article  MathSciNet  MATH  Google Scholar 

  40. von Neumann J, Fréchet M (1953) Communication on the Borel notes. Econometrica 21(1):124–127

  41. Vu DQ, Loiseau P, Silva A (2018) Efficient computation of approximate equilibria in discrete Colonel Blotto games. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, pp 519–526. https://doi.org/10.24963/ijcai.2018/72

  42. Vu DQ, Loiseau P, Silva A (2019) Combinatorial bandits for sequential learning in Colonel Blotto games. arXiv:1909.04912

  43. Vu DQ, Loiseau P, Silva A (2019) Combinatorial bandits for sequential learning in Colonel Blotto games. In: 2019 IEEE 58th conference on decision and control (CDC), pp 867–872. https://doi.org/10.1109/CDC40024.2019.9029186

  44. Zhang L, Wang Y, Han Z (2022) Safeguarding UAV-enabled wireless power transfer against aerial eavesdropper: a Colonel Blotto game. IEEE Wirel Commun Lett 11(3):503–507. https://doi.org/10.1109/LWC.2021.3133891

    Article  Google Scholar 

Download references

Funding

This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-23-1-0107 and the NSF CAREER Award under grant number EPCN-1944403.

Author information

Authors and Affiliations

Authors

Contributions

Vincent Leon wrote the main manuscript with support from S. Rasoul Etesami. All authors reviewed the manuscript.

Corresponding author

Correspondence to Vincent Leon.

Ethics declarations

Competing interests

We declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Ethical Approval

This declaration is not applicable. This article does not contain any human or animal studies.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Existing Algorithms

For the sake of completeness, in this appendix, we provide a detailed description of the existing algorithms (Algorithms 4–6) that we use as subroutines in our main algorithm.

Algorithm 4
figure d

Weight-Pushing Algorithm, [42, Algorithm 2]

Algorithm 5
figure e

Co-occurrence Matrix Computation Algorithm, [42, Algorithm 3]

Algorithm 6
figure f

Hedge, [21, Fig. 1]

Appendix B: Concentration Inequalities

Lemma 3

[Bernstein’s inequality for martingales, Lemma A.8 in [13]] Let \(Y_1, Y_2, \ldots \) be a martingale difference sequence (i.e., \({\mathbb {E}}[Y_t\vert Y_{t-1},\ldots ,Y_1]=0 \, \forall t \in {\mathbb {Z}}_+\)). Suppose that \(|Y_t |\le c\) and \({\mathbb {E}}[Y_t^2 \vert Y_{t-1}, \ldots , Y_1] \le v\) almost surely for all \(t \in {\mathbb {Z}}_+\). For any \(\delta > 0\),

$$\begin{aligned} \text {P }\left( \sum _{t=1}^T Y_t > \sqrt{2 Tv \ln (1/\delta )} + \frac{2}{3} c\ln (1/\delta ) \right) \le \delta . \end{aligned}$$

Lemma 4

[Azuma-Hoeffding’s inequality, Lemma A.7 in [13]] Let \(Y_1, Y_2, \ldots \) be a martingale difference sequence and \(\vert Y_t\vert \le c\) almost surely for all \(t \in {\mathbb {Z}}_+\). For any \(\delta > 0\),

$$\begin{aligned} \text {P }\left( \sum _{t=1}^T Y_t > \sqrt{2Tc^2 \ln (1/\delta )}\right) \le \delta . \end{aligned}$$

Appendix C: Proof of Theorem 1

The proof of Theorem 1 is built upon several results from combinatorial bandits [7, 12, 15]. In the following, we first state some useful lemmas from [12, 43], and then use them to prove Theorem 1.

Throughout the appendix, let \(\varvec{r}_t(\varvec{u}) = \varvec{l}_t^{\intercal } \varvec{u}\) be the reward by playing arm \(\varvec{u}\) in round t, and \({\hat{r}}_t(\varvec{u}) = \hat{\varvec{l}}_t^{\intercal } \varvec{u}\) be the estimated reward where \(\hat{\varvec{l}}_t\) is the unbiased cost estimator computed in Algorithm 2.

Lemma 5

Let \(\lambda ^*\) be the smallest nonzero eigenvalue of the co-occurrence matrix \(M(\mu )=\mathbb {E}_{\varvec{u}\sim \mu }[\varvec{u}\varvec{u}^{\intercal }]\) for the exploration distribution \(\mu \) in algorithm Edge. For all \(\varvec{u} \in {\mathcal {S}} \subseteq \{0,1\}^E\) and all \(t\in [T]\), the following relations hold:

  • \(\vert \vert \varvec{u}\vert \vert ^2 = \sum _{i=1}^{E} u_i^2 = n\)

  • \(\vert {\hat{r}}_t(\varvec{u})\vert = \vert \hat{\varvec{l}}_t^{\intercal } \varvec{u}\vert \le \frac{n}{\gamma \lambda ^*}\)

  • \(\varvec{u}^{\intercal } C_t^{-1} \varvec{u} \le \frac{n}{\gamma \lambda ^*}\)

  • \(\sum _{\varvec{u} \in \mathcal {S}} p_t(\varvec{u}) \varvec{u}^{\intercal } C_t^{-1} \varvec{u} \le E\)

  • \(\mathbb {E}_t[(\hat{\varvec{l}}_t^{\intercal } \varvec{u})^2] \le \varvec{u}^{\intercal } C_t^{-1} \varvec{u}\)

where in the last expression \(\mathbb {E}_t[\cdot ]{:}{=} \mathbb {E}[\cdot \vert \varvec{u}_{t-1},\ldots ,\varvec{u}_1]\).

Lemma 6

[12, Appendix A] By choosing \(\eta = \frac{\gamma \lambda ^*}{n}\) such that \(\eta \vert {\hat{r}}_t(\varvec{u})\vert \le 1\), for all \(\varvec{u}^* \in \mathcal {S}\),

$$\begin{aligned} \sum _{t=1}^T {\hat{r}}_t(\varvec{u}^*) - \frac{1}{\eta }\ln S&\le \frac{1}{1-\gamma } \sum _{t=1}^T \sum _{\varvec{u}\in \mathcal {S}} p_t(\varvec{u}) {\hat{r}}_t(\varvec{u}) \\&\quad + \frac{\eta }{1-\gamma } \sum _{t=1}^T \sum _{\varvec{u}\in \mathcal {S}} p_t(\varvec{u}){\hat{r}}_t(\varvec{u})^2\\&\quad - \frac{\gamma }{1-\gamma } \sum _{t=1}^T \sum _{\varvec{u}\in \mathcal {S}} {\hat{r}}_t(\varvec{u}) \mu (\varvec{u}). \end{aligned}$$
(C1)

Proof

This is a straightforward result of Eqs. (A.1)–(A.3) in [12] by flipping the sign of \(\eta \). \(\square \)

Lemma 6 provides a baseline for bounding the regret. We will proceed to bound each summation in (C1). The following lemma, derived from Bernstein’s inequality (Lemma 3), provides a high-probability bound on the left side of (C1).

Lemma 7

With probability at least \(1 - \delta \), for all \(\varvec{u} \in \mathcal {S}\), it holds that

$$\begin{aligned} \sum _{t=1}^T r_t(\varvec{u}) - \sum _{t=1}^T {\hat{r}}_t(\varvec{u}) \le \sqrt{2T \left( \frac{n}{\gamma \lambda ^*}\right) \ln (S/\delta )} + \frac{2}{3}\left( \frac{n}{\gamma \lambda ^*}+1\right) \ln (S/\delta ). \end{aligned}$$

Proof

Fix \(\varvec{u} \in \mathcal {S}\). Define \(Y_t = r_t(\varvec{u}) - {\hat{r}}_t(\varvec{u})\). Then \(\{Y_t\}_{t=1}^T\) is a martingale difference sequence. From Lemma 5, we know that \(\vert Y_t\vert \le \frac{n}{\gamma \lambda ^*}+1\). Let \(\mathbb {E}_t[Y_t^2] = \mathbb {E}[Y_t^2\vert Y_{t-1}, \ldots , Y_1]\). Then,

$$\begin{aligned} \mathbb {E}_t[Y_t^2] \le \mathbb {E}_t[(\hat{\varvec{l}}_t^{\intercal } \varvec{u})^2] \le \varvec{u}^{\intercal } C_t^{-1} \varvec{u} \le \frac{n}{\gamma \lambda ^*}. \end{aligned}$$

Using Bernstein’s inequality, with probability at least \(1-\delta /S\),

$$\begin{aligned} \sum _{t=1}^T Y_t \le \sqrt{2T \left( \frac{n}{\gamma \lambda ^*}\right) \ln (S/\delta )} + \frac{2}{3}\left( \frac{n}{\gamma \lambda ^*}+1\right) \ln (S/\delta ) \end{aligned}$$

The lemma now follows by using the above inequality and taking a union bound over all \(\varvec{u} \in \mathcal {S}\). \(\square \)

The following two lemmas obtained in [7] provide a high-probability bound on the first and second summands on the right side of (C1). The proofs of these lemmas, which are omitted here due to space limitation, use a direct application of Bernstein’s inequality and Azuma-Hoeffding’s inequality.

Lemma 8

[7, Lemma 6] With probability at least \(1-\delta \),

$$\begin{aligned}{} & {} \sum _{t=1}^T \sum _{\varvec{u}\in \mathcal {S}} p_t(\varvec{u}) {\hat{r}}_t(\varvec{u}) -\sum _{t=1}^T r_t(\varvec{u}_t) \le \left( \sqrt{E} + 1\right) \sqrt{2T\ln (1/\delta )}+\frac{4}{3}\ln (1/\delta )\left( \frac{n}{\gamma \lambda ^*}+1\right) . \end{aligned}$$

Lemma 9

[7, Lemma 8] With probability at least \(1 - \delta \),

$$\begin{aligned} \sum _{t=1}^T \sum _{\varvec{u}\in \mathcal {S}} p_t(\varvec{u}) {\hat{r}}_t(\varvec{u})^2 \le ET + \frac{n}{\gamma \lambda ^*}\sqrt{2T \ln (1/\delta )}. \end{aligned}$$

Now we are ready to complete the proof of Theorem 1. Using Lemma 7 and because \(\sum _{t=1}^T r_t(\varvec{u}) \ge 0\) for every \(\varvec{u}\in \mathcal {S}\), we can bound the last term on the right side of (C1) with probability at least \(1-\delta \) as

$$\begin{aligned} - \gamma \sum _{t=1}^T \sum _{\varvec{u}\in \mathcal {S}} {\hat{r}}_t(\varvec{u}) \mu (\varvec{u}) \le \gamma \sqrt{2T\left( \frac{n}{\gamma \lambda ^*}\right) \ln (S/\delta )}+ \frac{2}{3}\gamma \left( \frac{n}{\gamma \lambda ^*}+1\right) \ln (S/\delta ) \end{aligned}$$

Using Lemmas 7 to 9 and the above inequality in Lemma 6, with probability at least \(1 - 4\delta \), for all \(\varvec{u} \in \mathcal {S}\) we have,

$$\begin{aligned}&\sum _{t=1}^T r_t(\varvec{u}) - \sum _{t=1}^T r_t(\varvec{u}_t) \le \sqrt{2T \left( \frac{n}{\gamma \lambda ^*}\right) \ln (S/\delta )} + 3\left( \frac{n}{\gamma \lambda ^*}+1\right) \ln (S/\delta ) \\&\quad + (\sqrt{E} + 1) \sqrt{2T \ln (1/\delta )} + \frac{\gamma \lambda ^*}{n} ET + \sqrt{2T \ln (S/\delta )} + \gamma T. \end{aligned}$$

Finally, if we set

$$\begin{aligned} \gamma = \frac{n}{\lambda ^*}\sqrt{\frac{\ln S}{\left( \frac{n}{E\lambda ^*}+1\right) ET^{2/3}}}, \end{aligned}$$

the following regret bound can be obtained:

$$\begin{aligned} \sum _{t=1}^T r_t(\varvec{u})&- \sum _{t=1}^T r_t(\varvec{u}_t) \le \sqrt{2 \left( \frac{n}{E\lambda ^*}+1\right) ^{1/2} T^{4/3} E^{1/2} (\ln (S/\delta ))^{1/2}} \\&+ 3 \sqrt{\left( \frac{n}{E\lambda ^*}+1\right) E T^{2/3} \ln (S/\delta )} + 3 \ln (S/\delta ) \\&+ \sqrt{\left( \frac{n}{E\lambda ^*}+1\right) E T^{4/3} \ln S}+ (\sqrt{E}+1)\sqrt{2T \ln (1/\delta )} + \sqrt{2T \ln (S/\delta )} \\&= O\left( T^{2/3}\sqrt{\left( \frac{n}{E\lambda ^*}+1\right) E \ln (S/\delta )}\right) . \end{aligned}$$

Appendix D: Proof of Theorem 2

It has been shown in [21, 28] that the algorithm Hedge achieves the high-probability regret bound of

$$\begin{aligned} R_{\delta }(T) = O\left( \sqrt{T \ln (\vert A\vert /\delta )}\right) , \end{aligned}$$

where \(\vert A\vert \) denotes the cardinality of action set, which in our setting is the number of resources. Since we only have two types of resources, namely the time and the troops, we have \(\vert A\vert =2=O(1)\).

From Eq. (11), we have \(\vert \mathcal {L}_t^{\text {troop}}\vert \le \max \{2, \vert 1-c \vert \} \le 1+c\) for \(c \ge 1\). As a result, the actual reward \(r_t(\varvec{u})\), estimated reward \({\hat{r}}_t(\varvec{u})\), and hence the regret bound of Theorem 1 are all scaled up by at most a constant factor \(1+c\). Hence, in order to satisfy the assumption of Theorem 1 given in [12] that \(\eta \vert {\hat{r}}_t(\varvec{u})\vert \le 1\), we set

$$\begin{aligned} \eta = \frac{\gamma \lambda ^*}{(1+c)n} = \frac{1}{1+c} \sqrt{\frac{\ln S}{\left( \frac{n}{E\lambda ^*}+1\right) ET^{2/3}}}. \end{aligned}$$

On the other hand, from Theorem 1, the high-probability regret bound of algorithm Lagrange-Edge is at most

$$\begin{aligned} R_{\delta }(T) = O\left( T^{2/3}\sqrt{\left( \frac{n}{E\lambda ^*}+1\right) E \ln (S/\delta )}\right) . \end{aligned}$$

Now, using Lemma 1, and noting that \(B_0\) is scaled to \(\frac{B}{(cB/T)}=\frac{T}{c}\) such that \(O\left( \frac{T}{B_0}\right) =O(1)\), we obtain that with probability at least \(1-O(\delta T)\), it holds that

$$\begin{aligned} R(T)&\le O(1) \cdot \Bigg ( O\left( T^{2/3}\sqrt{\left( \frac{n}{E\lambda ^*}+1\right) E \ln (ST/\delta )}\right) + O\left( \sqrt{T \ln (T/\delta )}\right) \Bigg ) \\&= O\left( T^{2/3}\sqrt{\left( \frac{n}{E\lambda ^*}+1\right) E \ln (ST/\delta )}\right) . \end{aligned}$$

Finally, we recall that \(E = O(nm^2)\), \(m = O(B/T)\), and \(S = O\left( 2^{\min \{n-1,m\}}\right) \), such that \(\ln S \le O(m) = O(B/T)\). Substituting these relations into the above inequality we get

$$\begin{aligned} R(T)&\le O\left( T^{2/3}\sqrt{\left( \frac{T^2}{B^2 \lambda ^*}+1\right) n \left( \frac{B}{T}\right) ^3 \ln (T/\delta )}\right) \\&= O \left( T^{1/6}\sqrt{\frac{nB}{\lambda ^*} \ln \left( T/\delta \right) }\right) . \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Leon, V., Etesami, S.R. Online Learning in Budget-Constrained Dynamic Colonel Blotto Games. Dyn Games Appl (2023). https://doi.org/10.1007/s13235-023-00518-7

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13235-023-00518-7

Keywords

Navigation