Abstract
The multi-armed restless bandit framework allows to model a wide variety of decision-making problems in areas as diverse as industrial engineering, computer communication, operations research, financial engineering, communication networks etc. In a seminal work, Whittle developed a methodology to derive well-performing (Whittle’s) index policies that are obtained by solving a relaxed version of the original problem. However, the computation of Whittle’s index itself is a difficult problem and hence researchers focused on calculating Whittle’s index numerically or with a problem dependent approach. In our main contribution we derive an analytical expression for Whittle’s index for any Markovian bandit with both finite and infinite transition rates. We derive sufficient conditions for the optimal solution of the relaxed problem to be of threshold type, and obtain conditions for the bandit to be indexable, a property assuring the existence of Whittle’s index. Our solution approach provides a unifying expression for Whittle’s index, which we highlight by retrieving known indices from literature as particular cases. The applicability of finite rates is illustrated with the machine repairmen problem, and that of infinite rates by an example of communication networks where transmission rates react instantaneously to packet losses.
This is a preview of subscription content,
to check access.
Similar content being viewed by others
Notes
This can be shown by introducing so-called dummy bandits with zero cost and fixed state, see Verloop (2016).
References
Abbou A, Makis V (2019) Group maintenance: a restless bandits approach. INFORMS J Comput. https://doi.org/10.1287/ijoc.2018.0863
Altman E, Avrachenkov K, Garnaev A (2008) Generalized \(\alpha \)-fair resource allocation in wireless networks. In: 2008 47th IEEE conference on decision and control. IEEE, pp 2414–2419
Ansell PS, Glazebrook Kevin D, José N-M, O’Keeffe M (2003) Whittle’s index policy for a multi-class queueing system with convex holding costs. Math Methods Oper Res 57(1):21–39
Argon NT, Ding L, Kevin DG, Ziya S (2009) Dynamic routing of customers with general delay costs in a multiserver queuing system. Probab Eng Inf Sci 23(2):175–203
Avrachenkov K, Ayesta U, Doncel J, Jacko P (2013) Congestion control of TCP flows in internet routers by means of index policy. Comput Netw 57(17):3463–3478
Avrachenkov KE, Piunovskiy A, Zhang Y (2018) Impulsive control for G-AIMD dynamics with relaxed and hard constraints. In: Proceedings of IEEE CDC, pp 880–887
Ayer T, Zhang C, Bonifonte A, Spaulding AC, Chhatwal J (2019) Prioritizing hepatitis C treatment in US prisons. Oper Res 67:853–873
Bolch G, Greiner S, de Meer H, Trivedi KS (2006) Queueing networks and Markov chains. Wiley, New York
Borkar VS, Pattathil S (2017) Whittle indexability in egalitarian processor sharing systems. Ann Oper Res 1–21. https://doi.org/10.1007/s10479-017-2622-0
Borkar VS, Kasbekar GS, Pattathil S, Shetty P (2017a) Opportunistic scheduling as restless bandits. IEEE Trans Control Netw Syst 5:1952–1961
Borkar VS, Ravikumar K, Saboo K (2017b) An index policy for dynamic pricing in cloud computing under price commitments. Appl Math 44:215–245
Dufour F, Piunovskiy AB (2015) Impulsive control for continuous-time Markov decision processes. Adv Appl Probab 47(1):106–127
Dufour F, Piunovskiy AB (2016) Impulsive control for continuous-time Markov decision processes: a linear programming approach. Appl Math Optim 74(1):129–161. https://doi.org/10.1007/s00245-015-9310-8
Ford S, Atkinson MP, Glazebrook K, Jacko P (2019) On the dynamic allocation of assets subject to failure. Eur J Oper Res. https://doi.org/10.1016/j.ejor.2019.12.018
Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices. Wiley, New York
Glazebrook KD, Mitchell HM, Ansell PS (2005) Index policies for the maintenance of a collection of machines by a set of repairmen. Eur J Oper Res 165(1):267–284
Glazebrook KD, Kirkbride C, Ouenniche J (2009) Index policies for the admission control and routing of impatient customers to heterogeneous service stations. Oper Res 57(4):975–989
Graczová D, Jacko P (2014) Generalized restless bandits and the knapsack problem for perishable inventories. Oper Res 62(3):696–711
Jacko P (2010) Dynamic priority allocation in restless bandit models. Lambert Academic Publishing, Saarbrücken
Jacko P (2016) Resource capacity allocation to stochastic dynamic competitors: knapsack problem for perishable items and index-knapsack heuristic. Ann Oper Res 241(1–2):83–107
James T, Glazebrook K, Lin K (2016) Developing effective service policies for multiclass queues with abandonment: asymptotic optimality and approximate policy improvement. INFORMS J Comput 28(2):251–264
Larrañaga M, Boxma OJ, Núñez-Queija R, Squillante MS (2015) Efficient content delivery in the presence of impatient jobs. In: 2015 27th international conference on teletraffic congress (ITC 27). IEEE, pp 73–81
Larrañaga M, Ayesta U, Verloop IM (2016) Dynamic control of birth-and-death restless bandits: application to resource-allocation problems. IEEE ACM Trans Netw 24(6):3812–3825
Mo J, Walrand J (2000) Fair end-to-end window-based congestion control. IEEE ACM Trans Netw 5:556–567
Nino-Mora J (2002) Dynamic allocation indices for restless projects and queueing admission control: a polyhedral approach. Math Program 93(3):361–413
Niño-Mora J (2006) Restless bandit marginal productivity indices, diminishing returns, and optimal control of make-to-order/make-to-stock M/G/1 queues. Math Oper Res 31(1):50–84
Niño-Mora J (2007) Dynamic priority allocation via restless bandit marginal productivity indices. Top 15(2):161–198
Opp M, Glazebrook K, Kulkarni VG (2005) Outsourcing warranty repairs: dynamic allocation. Naval Res Logistics (NRL) 52(5):381–398
Papadimitriou CH, Tsitsiklis JN (1999) The complexity of optimal queuing network control. Math Oper Res 24(2):293–305
Pattathil S, Borkar VS, Kasbekar GS (2017) Distributed server allocation for content delivery networks. arXiv preprint arXiv:1710.11471
Piunovskiy A, Zhang Y (2020) On reducing a constrained gradual-impulsive control problem for a jump Markov model to a model with gradual control only. SIAM J Control Optim 58(1):192–214
Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Ruiz-Hernandez D (2008) Indexable restless bandits: index policies for some families of stochastic scheduling and dynamic allocation problems. VDM Verlag, Saarbrücken
Verloop IM (2016) Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann Appl Probab 26(4):1947–1995
Weber RR, Weiss G (1990) On an index policy for restless bandits. J Appl Probab 27(3):637–648
Whittle P (1988) Restless bandits: activity allocation in a changing world. J Appl Probab 25(A):287–298
Acknowledgements
We would like to thank Zhang Yi and Alexey Piunovskiy for helpful discussions on optimal impulse control. This research is partially supported by the French Agence Nationale de la Recherche (ANR) through the project ANR-15-CE25-0004 (ANR JCJC RACON) and by ANR-11-LABX-0040-CIMI within the program ANR-11-IDEX-0002-02. U. Ayesta has received funding from the Department of Education of the Basque Government through the Consolidated Research Group MATHMODE (IT1294-19).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proof of Propositions
In this section we provide the proofs of different propositions. For ease of notation, we removed the subscript k from all the proofs in this section.
1.1 A.1 Proof of Proposition 1
Proof
Since \({\mathcal {U}}_{REL}\) is non-empty, there exists a stationary optimal policy \(\phi ^*\) that optimally solves the subproblem (6) for a bandit. Define \(n^*= \min \{ m\in \{ 0,1,\ldots \}: S^{\phi ^*}(m)=1 \}\). This implies \(S^{\phi ^*}(m)=0~\forall ~m<n^*\) and \(S^{\phi ^*}(n^*)=1\). From the structure on the transition rates and jump probabilities in (i) of Proposition 1, we have \(q_k^1(N, N+i) = 0,~\forall ~i\ge 1\), \( \ q_k^0(N, N+i) = 0,~\forall ~i\ge 2\), and \(p_k^a(N, N+i) =0,~\forall ~i\ge 1, a =0, 1\). The above transition structure ensures that all the states \(m> n^*\) are transient. Hence \(\pi ^{\phi ^*}(m)=0 ~\forall ~ m >n^*\). Thus, the following holds under the optimal policy \(\phi ^*\):
and lump-sum cost under the optimal policy \(\phi ^*\) is given by:
From Markov chain theory, the average number of times state y is visited in the next decision epoch under action a given the current state x can be written as:
Given that the parameters satisfy (i) of Proposition 1, the lump-sum cost can equivalently be written as;
In the above expected lump-sum cost, the first term is the contribution in cost due to transition from states \(0,1,2, \cdots , n^*-1\) and the second term is that for the transition from state \(n^*\). Additionally, it exploits the fact that \(\pi ^{\phi ^*}(m)=0 ~\forall ~ m >n^*\). It follows from the expressions of the above expected costs that the long run average cost under the optimal policy \(\phi ^*\),
is the same as the long run average cost under a 0–1 type threshold policy with threshold \(n^*\),
Thus, a 0–1 type of threshold policy with threshold \(n^*\) is optimal when (i) is satisfied. The alternate rates (ii) can be proven to result in 0–1 type threshold optimality along the similar lines by considering the set \(\max \{ m\in \{ 0,1,\ldots \}: S^{\phi ^*}(m)=0 \}\). \(\square \)
1.2 A.2 Proof of Proposition 2
Proof
We will focus on 0–1 type of threshold policies throughout the proof. The case of threshold policies of type 1–0 can be proven similarly. Since an optimal solution of problem (6) is of threshold type for a given subsidy W, the optimal average cost will be \(g(W) := \min \nolimits _n{g^{(n)}(W)}\) where
We denote the minimizer of g(W) by n(W). Note that the function g(W) is a lower envelope of affine non-increasing functions of W due to the non-negative nature of \({\mathbb {E}}(f(\cdot ))\). It thus follows that g(W) is a concave non-increasing function.
It follows that the right derivative of g(W) in W is given by \(-{\mathbb {E}}(f(N^{n(W)}, S^{n(W)}(N^{n(W)})))\). Since g(W) is concave in W, the right derivative is non-increasing in W. Together with the fact \({\mathbb {E}}(f(N^n, S^n(N^n)))\) is strictly increasing in n, it hence follows that n(W) is non-decreasing in W. Since an optimal policy is of 0–1 threshold type, the set of states where it is optimal to be passive can be written as \(D(W) = \{m: m \le n(W)\}\). Since n(W) is non-decreasing, by definition this implies that bandit k is indexable. \(\square \)
1.3 A.3 Proof of Proposition 3
We will focus on 0–1 type of threshold policies throughout the proof. Let \({\tilde{W}}(n)\) be the value for subsidy such that the average cost under threshold policy n is equal to that under threshold policy \(n-1\). By using (6), we have \({\mathbb {E}}(T(N^n, S^n(N^n))) - {\tilde{W}}(n) {\mathbb {E}}(f(N^n, S^n(N^n))) = {\mathbb {E}}(T(N^{n-1}, S^{n-1}(N^{n-1}))) - {\tilde{W}}(n) {\mathbb {E}}(f(N^{n-1}, S^{n-1}(N^{n-1})))\). Hence, \({\tilde{W}}(n)\) is given by,
which is the same as (7). Since \({\tilde{W}}(n)\) is monotone, it can be verified by exploiting threshold optimality that \(g({\tilde{W}}(n)) = g^{(n)}({\tilde{W}}(n)) = g^{(n-1)}({\tilde{W}}(n))\). Similarly, \(g({\tilde{W}}(n-1)) = g^{(n-1)}({\tilde{W}}(n-1)) = g^{(n-2)}({\tilde{W}}(n-1))\). Further, monotonicity of \({\tilde{W}}(n)\) implies the following two possibilities:
-
1.
Non-decreasing nature, i.e., \({\tilde{W}}(n-1)\le {\tilde{W}}(n)\)
-
2.
Non-increasing nature, i.e., \({\tilde{W}}(n-1)\ge {\tilde{W}}(n)\)
But \({\tilde{W}}(n-1)\ge {\tilde{W}}(n)\) results in a contradiction from indexability and 0–1 type of threshold optimality. Thus, \({\tilde{W}}(n)\) has to be non-decreasing, i.e., \({\tilde{W}}(n-1)\le {\tilde{W}}(n)\).
It follows from indexability and 0–1 type of threshold optimality that for all \(W \le {\tilde{W}}(n)\), the set of states where it is optimal to be passive, D(W), satisfies \(D(W) \subseteq \{m:m\le n-1\}\). Again from indexability in a similar way, \(D(W) \supseteq {\{m:m\le n-1\}}\) for all \( W \ge {\tilde{W}}(n-1)\). Thus, for \( {\tilde{W}}(n-1) \le W \le {\tilde{W}}(n)\), \(\{m:m\le n-1\} \subseteq D(W) \subseteq \{m:m\le n-1\}\) which implies that threshold policy \(n-1\) is optimal for all \({\tilde{W}}(n-1) \le W \le {\tilde{W}}(n)\) and hence \(g(W) = g^{(n-1)}(W)\) for \({\tilde{W}}(n-1) \le W \le {\tilde{W}}(n)\). Hence, \({{\tilde{W}}}(n)\) is the smallest value of the subsidy such that activating the bandit in state n becomes optimal, that is, Whittle’s index is given by \(W(n) = {\tilde{W}}(n)\).
B Machine repairman problem
In this section, we provide the details to obtain the stationary distribution, prove indexability and derive Whittle’s index for two specific models of the machine repairman problem of Sect. 5.
1.1 B.1 Stationary distribution
In this section, we determine the stationary distribution under a 0–1 type of threshold policy n. Thus, action \(a=0\) is taken in states \(0,1,2,\cdots , n\) and action \(a=1\) in states \(n+1, n+2,\ldots \)
The transition diagram for the evolution of Markov chain is shown in Fig. 2.
The balance equations for the stationary distribution under the threshold policy n are given by
Using \(\sum \nolimits _{m=1}^{n+1}\pi ^n(m) = 1\), one obtains
where \(P_i = \prod \nolimits _{j=1}^ip_k(j)\) and \(p_k(j) = \frac{\lambda _k(j)}{\lambda _k(j) + \psi _k(j)}\); \(P_0 = 1\).
1.2 B.2 Indexability
Lemma B. 1
Machine k is indexable if the repair rates are non-decreasing in their state, i.e., \(r_k(n) \le r_k(n+1)~\forall ~n\), and \(r_k(1)>0\). In particular, all machines are indexable for state-independent repair rates.
Proof
From Proposition 3, it follows that machine k is indexable if \({\mathbb {E}}(f_k(N_k^{n}, S_k^{n}({N}_k^{n})))\) is strictly increasing in n. Recall that \(f_k(n, a) = {\mathbf {1}}_{\{ a = 0\}}\). Under the 0–1 type of threshold structure policy, with threshold n, we have
Thus, machine k is indexable if \(\sum \nolimits _{m=0}^{n}\pi _k^{n}(m)\) is strictly increasing in n. Since \(\pi _k^{n}(m)=0\) for \(m>n+1\), this is equivalent to proving that \(\pi _k^{n}(n+1)\) is strictly decreasing in n. From Eq. (16) and some algebra, we obtain that
Note that the denominator is strictly positive. Since \(r_k(n)\) is non-decreasing and \(r_k(1)>0\), the numerator is strictly negative. That is, the result follows. \(\square \)
1.3 B.3 Whittle’s index: Proof of Proposition 4
Since a 0–1 type of threshold policy is optimal, using Proposition 3, the Whittle index is given by Eq. (7), i.e.,
if (17) is non-decreasing. The expected cost under threshold policy n in the nominator is given by
Using the expression for the stationary distribution as derived in “Appendix B.1”, we obtain that the denominator of (17) simplifies to
where \(P_i = \prod \nolimits _{j=1}^ip_k(j)\) and \(p_k(j) = \frac{\lambda _k(j)}{\lambda _k(j) + \psi _k(j)}\); \(P_0 = 1\). After some algebra, we obtain that (17) simplifies to the one stated in Proposition 4.
1.4 B.4 Model 1: deterioration cost per unit
We consider now a particular case when there are no breakdowns. Thus, \(\psi _k(n_k) = 0\) and \(L_k^b(n_k) = 0\). This simplifies since \(p_k(j) = 1\) and \(P_i =1\), and hence the expression in Proposition 4 simplyfies to
If in addition \(r_k(n)= r_k\) for all n, we obtain (after some algebra) from Eq. (18) that
In addition,
which is negative when the \(C_k^d(n)\) is non-decreasing.
1.5 B.5 Model 2: lump-sum cost for breakdown
Here, we assume that \(C_k^{d}(n_k) = 0\), \(r_k(n)= r_k(n+1)=r_k\), \(L_k^r(n)= R_k, L_k^b(n) = B_k~\forall ~n\), and \(\psi _k(n)\) is an increasing sequence. From Proposition 4, Whittle’s index simplifies to (9). Hence, \(W_k(n) - W_k(n+1)\) simplifies to:
which is negative under the increasing breakdown rates assumption.
C Congestion control in TCP
In Sect. 6.1 we described a TCP model, where multiple users (flows) are trying to transmit packets through a bottleneck router as shown in Fig. 3.
A bottleneck router in TCP with multiple flows (Avrachenkov et al. 2013)
1.1 C.1 Stationary distribution
Under a 1–0 type threshold policy n, action \(a=1\) is taken in states \(0,1,2, \cdots , n\) and action \(a=0\) in states \(n+1, n+2,\ldots \) When action \(a=0\) is taken at state \(n+1\), the state instantaneously changes to \(S:=\max \{\lfloor \gamma .(n+1) \rfloor , 1\}\). Figure 4 shows the rates and its stationary distribution is given by
We then obtain
where \(S = \max \{\lfloor \gamma _k.(n+1) \rfloor , 1\}\). It can be easily argued that
Thus, \({\mathbb {E}}(f_k(N_k^{n}, S_k^{n}({N}_k^{n})))\) is strictly increasing in n and the result follows.
1.2 C.2 Expression of Whittle’s index: Proof of Lemma 6.1
Since 1–0 type of threshold policies are optimal, using Proposition 3, the Whittle index is given by
if Eq. (21) is non-increasing in n.
We have
which simplifies to
Together with (20) and Eq. (21), results in the Whittle index as stated in Lemma 6.1.
D Content delivery network
We consider here the content delivery network as described in Sect. 6.2, see also Fig. 5
1.1 D.1 Stationary distribution
Under a 0–1 type of threshold policy n, action \(a=0\) is taken in states \(0,1,2\ldots , n\) and action \(a=1\) in states \(n+1, n+2,\ldots \) The transition diagram is shown in Fig. 6.
The balance equations under this chain are
which together with the normalization condition \(\sum \nolimits _{i=0}^n\pi ^n(i) = 1\) results in the following stationary distribution:
where \(p(k+1, k+i) = \dfrac{\theta (k+1)\theta (k+2)\ldots \theta (k+i)}{\lambda (k+1)\lambda (k+2)\ldots \lambda (k+i)}~\forall ~i\ge 1\).
The summation term in denominator of (25) is strictly increasing in n if \(\lambda (n)\) is non-decreasing. Thus, \(\pi ^{n}{(n)}\) will be strictly decreasing in n under non-decreasing assumption on \(\lambda (n)\).
1.2 D.2 Whittle’s index: Proof of Lemma 6.2
The expected cost under threshold policy n is given by
Similarly, for the threshold policy \(n-1\)
From Proposition 3, we get the expression as stated in Lemma 6.2.
Rights and permissions
About this article
Cite this article
Ayesta, U., Gupta, M.K. & Verloop, I.M. On the computation of Whittle’s index for Markovian restless bandits. Math Meth Oper Res 93, 179–208 (2021). https://doi.org/10.1007/s00186-020-00731-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00186-020-00731-9