Skip to main content
Log in

On the computation of Whittle’s index for Markovian restless bandits

Mathematical Methods of Operations Research Aims and scope Submit manuscript

Cite this article

Abstract

The multi-armed restless bandit framework allows to model a wide variety of decision-making problems in areas as diverse as industrial engineering, computer communication, operations research, financial engineering, communication networks etc. In a seminal work, Whittle developed a methodology to derive well-performing (Whittle’s) index policies that are obtained by solving a relaxed version of the original problem. However, the computation of Whittle’s index itself is a difficult problem and hence researchers focused on calculating Whittle’s index numerically or with a problem dependent approach. In our main contribution we derive an analytical expression for Whittle’s index for any Markovian bandit with both finite and infinite transition rates. We derive sufficient conditions for the optimal solution of the relaxed problem to be of threshold type, and obtain conditions for the bandit to be indexable, a property assuring the existence of Whittle’s index. Our solution approach provides a unifying expression for Whittle’s index, which we highlight by retrieving known indices from literature as particular cases. The applicability of finite rates is illustrated with the machine repairmen problem, and that of infinite rates by an example of communication networks where transmission rates react instantaneously to packet losses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. This can be shown by introducing so-called dummy bandits with zero cost and fixed state, see Verloop (2016).

References

  • Abbou A, Makis V (2019) Group maintenance: a restless bandits approach. INFORMS J Comput. https://doi.org/10.1287/ijoc.2018.0863

    Article  MathSciNet  MATH  Google Scholar 

  • Altman E, Avrachenkov K, Garnaev A (2008) Generalized \(\alpha \)-fair resource allocation in wireless networks. In: 2008 47th IEEE conference on decision and control. IEEE, pp 2414–2419

  • Ansell PS, Glazebrook Kevin D, José N-M, O’Keeffe M (2003) Whittle’s index policy for a multi-class queueing system with convex holding costs. Math Methods Oper Res 57(1):21–39

    Article  MathSciNet  Google Scholar 

  • Argon NT, Ding L, Kevin DG, Ziya S (2009) Dynamic routing of customers with general delay costs in a multiserver queuing system. Probab Eng Inf Sci 23(2):175–203

    Article  Google Scholar 

  • Avrachenkov K, Ayesta U, Doncel J, Jacko P (2013) Congestion control of TCP flows in internet routers by means of index policy. Comput Netw 57(17):3463–3478

    Article  Google Scholar 

  • Avrachenkov KE, Piunovskiy A, Zhang Y (2018) Impulsive control for G-AIMD dynamics with relaxed and hard constraints. In: Proceedings of IEEE CDC, pp 880–887

  • Ayer T, Zhang C, Bonifonte A, Spaulding AC, Chhatwal J (2019) Prioritizing hepatitis C treatment in US prisons. Oper Res 67:853–873

    Article  MathSciNet  Google Scholar 

  • Bolch G, Greiner S, de Meer H, Trivedi KS (2006) Queueing networks and Markov chains. Wiley, New York

    Book  Google Scholar 

  • Borkar VS, Pattathil S (2017) Whittle indexability in egalitarian processor sharing systems. Ann Oper Res 1–21. https://doi.org/10.1007/s10479-017-2622-0

  • Borkar VS, Kasbekar GS, Pattathil S, Shetty P (2017a) Opportunistic scheduling as restless bandits. IEEE Trans Control Netw Syst 5:1952–1961

    Article  MathSciNet  Google Scholar 

  • Borkar VS, Ravikumar K, Saboo K (2017b) An index policy for dynamic pricing in cloud computing under price commitments. Appl Math 44:215–245

    MathSciNet  MATH  Google Scholar 

  • Dufour F, Piunovskiy AB (2015) Impulsive control for continuous-time Markov decision processes. Adv Appl Probab 47(1):106–127

    Article  MathSciNet  Google Scholar 

  • Dufour F, Piunovskiy AB (2016) Impulsive control for continuous-time Markov decision processes: a linear programming approach. Appl Math Optim 74(1):129–161. https://doi.org/10.1007/s00245-015-9310-8

    Article  MathSciNet  MATH  Google Scholar 

  • Ford S, Atkinson MP, Glazebrook K, Jacko P (2019) On the dynamic allocation of assets subject to failure. Eur J Oper Res. https://doi.org/10.1016/j.ejor.2019.12.018

    Article  MATH  Google Scholar 

  • Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices. Wiley, New York

    Book  Google Scholar 

  • Glazebrook KD, Mitchell HM, Ansell PS (2005) Index policies for the maintenance of a collection of machines by a set of repairmen. Eur J Oper Res 165(1):267–284

    Article  MathSciNet  Google Scholar 

  • Glazebrook KD, Kirkbride C, Ouenniche J (2009) Index policies for the admission control and routing of impatient customers to heterogeneous service stations. Oper Res 57(4):975–989

    Article  MathSciNet  Google Scholar 

  • Graczová D, Jacko P (2014) Generalized restless bandits and the knapsack problem for perishable inventories. Oper Res 62(3):696–711

    Article  MathSciNet  Google Scholar 

  • Jacko P (2010) Dynamic priority allocation in restless bandit models. Lambert Academic Publishing, Saarbrücken

    Google Scholar 

  • Jacko P (2016) Resource capacity allocation to stochastic dynamic competitors: knapsack problem for perishable items and index-knapsack heuristic. Ann Oper Res 241(1–2):83–107

    Article  MathSciNet  Google Scholar 

  • James T, Glazebrook K, Lin K (2016) Developing effective service policies for multiclass queues with abandonment: asymptotic optimality and approximate policy improvement. INFORMS J Comput 28(2):251–264

    Article  MathSciNet  Google Scholar 

  • Larrañaga M, Boxma OJ, Núñez-Queija R, Squillante MS (2015) Efficient content delivery in the presence of impatient jobs. In: 2015 27th international conference on teletraffic congress (ITC 27). IEEE, pp 73–81

  • Larrañaga M, Ayesta U, Verloop IM (2016) Dynamic control of birth-and-death restless bandits: application to resource-allocation problems. IEEE ACM Trans Netw 24(6):3812–3825

    Article  Google Scholar 

  • Mo J, Walrand J (2000) Fair end-to-end window-based congestion control. IEEE ACM Trans Netw 5:556–567

    Article  Google Scholar 

  • Nino-Mora J (2002) Dynamic allocation indices for restless projects and queueing admission control: a polyhedral approach. Math Program 93(3):361–413

    Article  MathSciNet  Google Scholar 

  • Niño-Mora J (2006) Restless bandit marginal productivity indices, diminishing returns, and optimal control of make-to-order/make-to-stock M/G/1 queues. Math Oper Res 31(1):50–84

    Article  MathSciNet  Google Scholar 

  • Niño-Mora J (2007) Dynamic priority allocation via restless bandit marginal productivity indices. Top 15(2):161–198

    Article  MathSciNet  Google Scholar 

  • Opp M, Glazebrook K, Kulkarni VG (2005) Outsourcing warranty repairs: dynamic allocation. Naval Res Logistics (NRL) 52(5):381–398

    Article  MathSciNet  Google Scholar 

  • Papadimitriou CH, Tsitsiklis JN (1999) The complexity of optimal queuing network control. Math Oper Res 24(2):293–305

    Article  MathSciNet  Google Scholar 

  • Pattathil S, Borkar VS, Kasbekar GS (2017) Distributed server allocation for content delivery networks. arXiv preprint arXiv:1710.11471

  • Piunovskiy A, Zhang Y (2020) On reducing a constrained gradual-impulsive control problem for a jump Markov model to a model with gradual control only. SIAM J Control Optim 58(1):192–214

    Article  MathSciNet  Google Scholar 

  • Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York

    MATH  Google Scholar 

  • Ruiz-Hernandez D (2008) Indexable restless bandits: index policies for some families of stochastic scheduling and dynamic allocation problems. VDM Verlag, Saarbrücken

    Google Scholar 

  • Verloop IM (2016) Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann Appl Probab 26(4):1947–1995

    Article  MathSciNet  Google Scholar 

  • Weber RR, Weiss G (1990) On an index policy for restless bandits. J Appl Probab 27(3):637–648

    Article  MathSciNet  Google Scholar 

  • Whittle P (1988) Restless bandits: activity allocation in a changing world. J Appl Probab 25(A):287–298

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank Zhang Yi and Alexey Piunovskiy for helpful discussions on optimal impulse control. This research is partially supported by the French Agence Nationale de la Recherche (ANR) through the project ANR-15-CE25-0004 (ANR JCJC RACON) and by ANR-11-LABX-0040-CIMI within the program ANR-11-IDEX-0002-02. U. Ayesta has received funding from the Department of Education of the Basque Government through the Consolidated Research Group MATHMODE (IT1294-19).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manu K. Gupta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Propositions

In this section we provide the proofs of different propositions. For ease of notation, we removed the subscript k from all the proofs in this section.

1.1 A.1 Proof of Proposition 1

Proof

Since \({\mathcal {U}}_{REL}\) is non-empty, there exists a stationary optimal policy \(\phi ^*\) that optimally solves the subproblem (6) for a bandit. Define \(n^*= \min \{ m\in \{ 0,1,\ldots \}: S^{\phi ^*}(m)=1 \}\). This implies \(S^{\phi ^*}(m)=0~\forall ~m<n^*\) and \(S^{\phi ^*}(n^*)=1\). From the structure on the transition rates and jump probabilities in (i) of Proposition 1, we have \(q_k^1(N, N+i) = 0,~\forall ~i\ge 1\), \( \ q_k^0(N, N+i) = 0,~\forall ~i\ge 2\), and \(p_k^a(N, N+i) =0,~\forall ~i\ge 1, a =0, 1\). The above transition structure ensures that all the states \(m> n^*\) are transient. Hence \(\pi ^{\phi ^*}(m)=0 ~\forall ~ m >n^*\). Thus, the following holds under the optimal policy \(\phi ^*\):

$$\begin{aligned} {\mathbb {E}}(C(N^{\phi ^*}, S^\phi (\vec {N}^{\phi ^*})))= & {} \sum _{m=0}^{n^*-1}C(m,0)\pi ^{\phi ^*}(m) + C(n^*,1)\pi ^{\phi ^*}(n^*),\\ {\mathbb {E}}(f(N^{\phi ^*}, S^\phi (\vec {N}^{\phi ^*})))= & {} \sum _{m=0}^{n^*-1}f(m,0)\pi ^{\phi ^*}(m) + f(n^*,1)\pi ^{\phi ^*}(n^*), \end{aligned}$$

and lump-sum cost under the optimal policy \(\phi ^*\) is given by:

$$\begin{aligned} {\mathbb {E}}(C^{\infty , \phi ^*}({N}^{\phi ^*}, S^{\phi ^*}(\vec {N}^{\phi ^*}))= & {} \sum _{{\tilde{n}}}\sum _{m}{\mathbb {E}}\left( q^{S^{\phi ^*}(\vec {N}^{\phi ^*})}(N^{\phi ^*}, {\tilde{n}}) \times {\mathcal {I}}({\tilde{n}}, S^{\phi ^*}(\vec {M}^{\phi ^*}(\vec {N}^{\phi ^*}, {\tilde{n}})))\right. \\&\left. \times { p^{S^{\phi ^*}(\vec {M}^{\phi ^*}(\vec {N}^{\phi ^*}, {\tilde{n}})}({\tilde{n}},m) \times L^\infty ({\tilde{n}}, m, S^{\phi ^*}(\vec {M}^{\phi ^*}(\vec {N}^{\phi ^*}, {\tilde{n}})) }\right) , \end{aligned}$$

From Markov chain theory, the average number of times state y is visited in the next decision epoch under action a given the current state x can be written as:

$$\begin{aligned} \lim _{N \rightarrow \infty }\frac{1}{N} \sum _{n=1}^N 1^a_{\{X_n=x,~ X_{n+1} = y\}}=\pi (x)q^a(x,y). \end{aligned}$$

Given that the parameters satisfy (i) of Proposition 1, the lump-sum cost can equivalently be written as;

$$\begin{aligned}&{\mathbb {E}}(C^{\infty , \phi ^*}({N}^{\phi ^*}, S^{\phi ^*}(\vec {N}^{\phi ^*}))\\&\quad = \sum _{n=0}^{n^*-1}\sum _{m=0}^n\left( p^0(n,m) L^\infty (n, m, 0)\left[ \sum _{k=0, k \ne n}^{n^* -1}q^0(k,n)\pi ^{\phi ^*}(k) + q^1(n^*,n)\pi ^{\phi ^*}(n^*)\right] \right) \\&\qquad + \sum _{m=0}^{n^*}\left( p^1(n^*,m) L^\infty (n^*, m, 1)\left[ \sum _{k=0}^{n^* -1}q^0(k,n^*)\pi ^{\phi ^*}(k)\right] \right) . \end{aligned}$$

In the above expected lump-sum cost, the first term is the contribution in cost due to transition from states \(0,1,2, \cdots , n^*-1\) and the second term is that for the transition from state \(n^*\). Additionally, it exploits the fact that \(\pi ^{\phi ^*}(m)=0 ~\forall ~ m >n^*\). It follows from the expressions of the above expected costs that the long run average cost under the optimal policy \(\phi ^*\),

$$\begin{aligned} {\mathbb {E}}(C(N^{\phi ^*}, S^{\phi ^*}(N^{\phi ^*})))~+{\mathbb {E}}(C^{\infty , \phi ^*}(N^{\phi ^*}, S^{\phi ^*}(N^{\phi ^*}))) - W {\mathbb {E}}(f(N^{\phi ^*}, S^{\phi ^*}(N^{\phi ^*}))), \end{aligned}$$

is the same as the long run average cost under a 0–1 type threshold policy with threshold \(n^*\),

$$\begin{aligned} {\mathbb {E}}(C(N^{n^*}, S^{n^*}(N^{n^*})))~+{\mathbb {E}}(C^{\infty , n^*}(N^{n^*}, S^{n^*}(N^{n^*}))) - W {\mathbb {E}}(f(N^{n^*}, S^{n^*}(N^{n^*}))). \end{aligned}$$

Thus, a 0–1 type of threshold policy with threshold \(n^*\) is optimal when (i) is satisfied. The alternate rates (ii) can be proven to result in 0–1 type threshold optimality along the similar lines by considering the set \(\max \{ m\in \{ 0,1,\ldots \}: S^{\phi ^*}(m)=0 \}\). \(\square \)

1.2 A.2 Proof of Proposition 2

Proof

We will focus on 0–1 type of threshold policies throughout the proof. The case of threshold policies of type 1–0 can be proven similarly. Since an optimal solution of problem (6) is of threshold type for a given subsidy W, the optimal average cost will be \(g(W) := \min \nolimits _n{g^{(n)}(W)}\) where

$$\begin{aligned} g^{(n)}(W) = {\mathbb {E}}(T(N^n, S^n(N^n))) - W {\mathbb {E}}(f(N^n, S^n(N^n))). \end{aligned}$$

We denote the minimizer of g(W) by n(W). Note that the function g(W) is a lower envelope of affine non-increasing functions of W due to the non-negative nature of \({\mathbb {E}}(f(\cdot ))\). It thus follows that g(W) is a concave non-increasing function.

It follows that the right derivative of g(W) in W is given by \(-{\mathbb {E}}(f(N^{n(W)}, S^{n(W)}(N^{n(W)})))\). Since g(W) is concave in W, the right derivative is non-increasing in W. Together with the fact \({\mathbb {E}}(f(N^n, S^n(N^n)))\) is strictly increasing in n, it hence follows that n(W) is non-decreasing in W. Since an optimal policy is of 0–1 threshold type, the set of states where it is optimal to be passive can be written as \(D(W) = \{m: m \le n(W)\}\). Since n(W) is non-decreasing, by definition this implies that bandit k is indexable. \(\square \)

1.3 A.3 Proof of Proposition 3

We will focus on 0–1 type of threshold policies throughout the proof. Let \({\tilde{W}}(n)\) be the value for subsidy such that the average cost under threshold policy n is equal to that under threshold policy \(n-1\). By using (6), we have \({\mathbb {E}}(T(N^n, S^n(N^n))) - {\tilde{W}}(n) {\mathbb {E}}(f(N^n, S^n(N^n))) = {\mathbb {E}}(T(N^{n-1}, S^{n-1}(N^{n-1}))) - {\tilde{W}}(n) {\mathbb {E}}(f(N^{n-1}, S^{n-1}(N^{n-1})))\). Hence, \({\tilde{W}}(n)\) is given by,

$$\begin{aligned} \frac{{\mathbb {E}}(T^n(N^n, S^n({N}^n))) - {\mathbb {E}}(T^{n-1}(N^{{n-1}}, S^{{n-1}}({N}^{{n-1}})))}{{\mathbb {E}}(f(N^{n}, S^n({N}^{n}))) - {\mathbb {E}}(f(N^{{n-1}}, S^{{n-1}}({N}^{{n-1}})))}, \end{aligned}$$

which is the same as (7). Since \({\tilde{W}}(n)\) is monotone, it can be verified by exploiting threshold optimality that \(g({\tilde{W}}(n)) = g^{(n)}({\tilde{W}}(n)) = g^{(n-1)}({\tilde{W}}(n))\). Similarly, \(g({\tilde{W}}(n-1)) = g^{(n-1)}({\tilde{W}}(n-1)) = g^{(n-2)}({\tilde{W}}(n-1))\). Further, monotonicity of \({\tilde{W}}(n)\) implies the following two possibilities:

  1. 1.

    Non-decreasing nature, i.e., \({\tilde{W}}(n-1)\le {\tilde{W}}(n)\)

  2. 2.

    Non-increasing nature, i.e., \({\tilde{W}}(n-1)\ge {\tilde{W}}(n)\)

But \({\tilde{W}}(n-1)\ge {\tilde{W}}(n)\) results in a contradiction from indexability and 0–1 type of threshold optimality. Thus, \({\tilde{W}}(n)\) has to be non-decreasing, i.e., \({\tilde{W}}(n-1)\le {\tilde{W}}(n)\).

It follows from indexability and 0–1 type of threshold optimality that for all \(W \le {\tilde{W}}(n)\), the set of states where it is optimal to be passive, D(W), satisfies \(D(W) \subseteq \{m:m\le n-1\}\). Again from indexability in a similar way, \(D(W) \supseteq {\{m:m\le n-1\}}\) for all \( W \ge {\tilde{W}}(n-1)\). Thus, for \( {\tilde{W}}(n-1) \le W \le {\tilde{W}}(n)\), \(\{m:m\le n-1\} \subseteq D(W) \subseteq \{m:m\le n-1\}\) which implies that threshold policy \(n-1\) is optimal for all \({\tilde{W}}(n-1) \le W \le {\tilde{W}}(n)\) and hence \(g(W) = g^{(n-1)}(W)\) for \({\tilde{W}}(n-1) \le W \le {\tilde{W}}(n)\). Hence, \({{\tilde{W}}}(n)\) is the smallest value of the subsidy such that activating the bandit in state n becomes optimal, that is, Whittle’s index is given by \(W(n) = {\tilde{W}}(n)\).

B Machine repairman problem

In this section, we provide the details to obtain the stationary distribution, prove indexability and derive Whittle’s index for two specific models of the machine repairman problem of Sect. 5.

1.1 B.1 Stationary distribution

In this section, we determine the stationary distribution under a 0–1 type of threshold policy n. Thus, action \(a=0\) is taken in states \(0,1,2,\cdots , n\) and action \(a=1\) in states \(n+1, n+2,\ldots \)

The transition diagram for the evolution of Markov chain is shown in Fig. 2.

Fig. 2
figure 2

Transition diagram under the threshold policy n for machine repairman problem

The balance equations for the stationary distribution under the threshold policy n are given by

$$\begin{aligned} \lambda (0)\pi ^n(0)= & {} {\psi (1)\pi ^n(1) + \psi (2)\pi ^n(2) + \cdots + \psi (n)\pi ^n(n)+ r(n+1)\pi ^n(n+1)}, \nonumber \\ \lambda (m)\pi ^n(m)= & {} (\lambda (m+1) + \psi (m+1))\pi ^n(m+1)~\text { for }m=0,~1,~2,\ldots ,~n-1,\nonumber \\ \lambda (n)\pi ^n(n)= & {} r(n+1)\pi ^n({n+1}). \end{aligned}$$
(15)

Using \(\sum \nolimits _{m=1}^{n+1}\pi ^n(m) = 1\), one obtains

$$\begin{aligned} \pi _k^{n_k}(m_k)= & {} \frac{P_{m_k}}{\lambda _k(m_k)\left( \sum \nolimits _{i=0}^{n_k}\frac{P_i}{\lambda _k(i)} +\frac{P_{n_k}}{r_k(n_k+1)}\right) }~\forall ~m_k=0,1,2,\ldots n_k,\nonumber \\ \pi _k^{n_k}{(n_k+1)}= & {} \frac{P_{n_k}}{r_k(n_k+1)\left( \sum \nolimits _{i=0}^{n_k}\frac{P_i}{\lambda _k(i)} +\frac{P_{n_k}}{r_k(n_k+1)}\right) }, \nonumber \\ \pi _k^{n_k}(m_k)= & {} 0 ~\forall ~m_k = n_k+2, \cdots \end{aligned}$$
(16)

where \(P_i = \prod \nolimits _{j=1}^ip_k(j)\) and \(p_k(j) = \frac{\lambda _k(j)}{\lambda _k(j) + \psi _k(j)}\); \(P_0 = 1\).

1.2 B.2 Indexability

Lemma B. 1

Machine k is indexable if the repair rates are non-decreasing in their state, i.e., \(r_k(n) \le r_k(n+1)~\forall ~n\), and \(r_k(1)>0\). In particular, all machines are indexable for state-independent repair rates.

Proof

From Proposition 3, it follows that machine k is indexable if \({\mathbb {E}}(f_k(N_k^{n}, S_k^{n}({N}_k^{n})))\) is strictly increasing in n. Recall that \(f_k(n, a) = {\mathbf {1}}_{\{ a = 0\}}\). Under the 0–1 type of threshold structure policy, with threshold n, we have

$$\begin{aligned} {\mathbb {E}}(f_k(N_k^{n}, S_k^{n}({N}_k^{n}))) =\sum \limits _{m=0}^{n}\pi _k^{n}(m). \end{aligned}$$

Thus, machine k is indexable if \(\sum \nolimits _{m=0}^{n}\pi _k^{n}(m)\) is strictly increasing in n. Since \(\pi _k^{n}(m)=0\) for \(m>n+1\), this is equivalent to proving that \(\pi _k^{n}(n+1)\) is strictly decreasing in n. From Eq. (16) and some algebra, we obtain that

$$\begin{aligned} \pi _k^{n}(n+1) - \pi _k^{n-1}(n) = \frac{\left( \frac{\lambda _k(n)(r_k(n) -r_k(n+1))-\psi _k(n)r_k(n+1)}{\lambda _k(n) + \psi _k(n)}\right) \sum \nolimits _{i=0}^{n-1}\frac{P_i}{\lambda _k(i)} - r_k(n+1)\frac{ P_{n}}{\lambda _k(n)}}{r_k(n)r_k(n+1)\left( \sum \nolimits _{i=0}^{n}\frac{P_i}{\lambda _k(i)} +\frac{P_{n}}{r_k(n+1)}\right) \left( \sum \nolimits _{i=0}^{n+1}\frac{P_i}{\lambda _k(i)} +\frac{P_{n+1}}{r_k(n+2)}\right) }. \end{aligned}$$

Note that the denominator is strictly positive. Since \(r_k(n)\) is non-decreasing and \(r_k(1)>0\), the numerator is strictly negative. That is, the result follows. \(\square \)

1.3 B.3 Whittle’s index: Proof of Proposition 4

Since a 0–1 type of threshold policy is optimal, using Proposition 3, the Whittle index is given by Eq. (7), i.e.,

$$\begin{aligned} W_k(n) = \frac{{\mathbb {E}}(C_k(N_k^n, S_k^n(N_k^n))) - {\mathbb {E}}(C_k(N_k^{n-1}, S_k^{n-1}(N_k^{n-1})))}{\sum \nolimits _{m=0}^n\pi _k^n(m)-\sum \nolimits _{m=0}^{n-1}\pi _k^{n-1}(m)}, \end{aligned}$$
(17)

if (17) is non-decreasing. The expected cost under threshold policy n in the nominator is given by

$$\begin{aligned} {\mathbb {E}}(C_k(N_k^n, S_k^n(N_k^n))) = \sum \limits _{m=1}^{n}\left[ \psi _k(m)L_k^b(m) + C_k^{d}(m)\right] \pi _k^n(m) + r_k(n+1)L_k^r(n+1)\pi _k^n(n+1). \end{aligned}$$

Using the expression for the stationary distribution as derived in “Appendix B.1”, we obtain that the denominator of (17) simplifies to

$$\begin{aligned} \sum \limits _{i=0}^{n} \pi _k^{n}(i) - \sum \limits _{i=0}^{n-1} \pi _k^{{n}-1}(i) = \frac{\frac{P_{n-1}}{r_k(n)}\sum \nolimits _{i=0}^{n}\frac{P_i}{\lambda _k(i)} - \frac{P_{n}}{r_k(n+1)}\sum \nolimits _{i=0}^{n-1}\frac{ P_i}{\lambda _k(i)}}{\left( \sum \nolimits _{i=0}^{n}\frac{P_i}{\lambda _k(i)} +\frac{P_{n}}{r_k(n+1)}\right) \left( \sum \nolimits _{i=0}^{n-1}\frac{P_i}{\lambda _k(i)} +\frac{P_{n-1}}{r_k(n)}\right) }, \end{aligned}$$

where \(P_i = \prod \nolimits _{j=1}^ip_k(j)\) and \(p_k(j) = \frac{\lambda _k(j)}{\lambda _k(j) + \psi _k(j)}\); \(P_0 = 1\). After some algebra, we obtain that (17) simplifies to the one stated in Proposition 4.

1.4 B.4 Model 1: deterioration cost per unit

We consider now a particular case when there are no breakdowns. Thus, \(\psi _k(n_k) = 0\) and \(L_k^b(n_k) = 0\). This simplifies since \(p_k(j) = 1\) and \(P_i =1\), and hence the expression in Proposition 4 simplyfies to

$$\begin{aligned} \frac{ \left( \sum \nolimits _{i=1}^{n} \frac{C_k^{d}(i)}{\lambda _k(i)} + L_k^{r}(n+1)\right) \left( \sum \nolimits _{i=0}^{n-1}\frac{1}{\lambda _k(i)} + \frac{1}{r_k(n)}\right) - \left( \sum \nolimits _{i=1}^{n-1} \frac{C_k^{d}(i)}{\lambda _k(i)} + {L_k^{r}(n)}\right) \left( \sum \nolimits _{i=0}^{n}\frac{1}{\lambda _k(i)} + \frac{1}{r_k(n+1)}\right) }{\frac{1}{r_k(n)}\sum \nolimits _{i=0}^{n}\frac{1}{\lambda _k(i)} - \frac{1}{r_k(n+1)}\sum \nolimits _{i=0}^{n-1}\frac{1}{\lambda _k(i)}}. \end{aligned}$$
(18)

If in addition \(r_k(n)= r_k\) for all n, we obtain (after some algebra) from Eq. (18) that

$$\begin{aligned} W_k(n)= r_k\left[ \sum \limits _{i=0}^{n-1}\frac{C_k^{d}(n) - C_k^{d}(i)}{\lambda _k(i)} + \frac{C_k^{d}(n) - r_k L_k^{r}}{r_k}\right] . \end{aligned}$$
(19)

In addition,

$$\begin{aligned} W_k(n) - W_k(n+1) = r_k \left[ (C_k^{d}(n) -C_k^{d}(n+1))\left( \sum \limits _{i=0}^n \frac{1}{\lambda _k(i)} + \frac{1}{r_k} \right) \right] , \end{aligned}$$

which is negative when the \(C_k^d(n)\) is non-decreasing.

1.5 B.5 Model 2: lump-sum cost for breakdown

Here, we assume that \(C_k^{d}(n_k) = 0\), \(r_k(n)= r_k(n+1)=r_k\), \(L_k^r(n)= R_k, L_k^b(n) = B_k~\forall ~n\), and \(\psi _k(n)\) is an increasing sequence. From Proposition 4, Whittle’s index simplifies to (9). Hence, \(W_k(n) - W_k(n+1)\) simplifies to:

$$\begin{aligned} W_k(n) - W_k(n+1) = \frac{r_kB_k\left( \frac{P_n}{r_k} +\sum \nolimits _{i=0}^{n}\frac{P_i}{\lambda _k(i)} \right) \left( \frac{1}{\psi (n+1) }-\frac{1}{\psi (n)}\right) }{\left( (1-p_k(n))\sum \nolimits _{i=0}^{n}\frac{P_i}{\lambda _k(i)} + \frac{p_k(n)P_n}{\lambda _k(n)} \right) \left( (1-p_k(n+1))\sum \nolimits _{i=0}^{n+1}\frac{P_i}{\lambda _k(i)} + \frac{p_k(n+1)P_{n+1}}{\lambda _k(n+1)}\right) }, \end{aligned}$$

which is negative under the increasing breakdown rates assumption.

C Congestion control in TCP

In Sect. 6.1 we described a TCP model, where multiple users (flows) are trying to transmit packets through a bottleneck router as shown in Fig. 3.

Fig. 3
figure 3

A bottleneck router in TCP with multiple flows (Avrachenkov et al. 2013)

1.1 C.1 Stationary distribution

Under a 1–0 type threshold policy n, action \(a=1\) is taken in states \(0,1,2, \cdots , n\) and action \(a=0\) in states \(n+1, n+2,\ldots \) When action \(a=0\) is taken at state \(n+1\), the state instantaneously changes to \(S:=\max \{\lfloor \gamma .(n+1) \rfloor , 1\}\). Figure 4 shows the rates and its stationary distribution is given by

$$\begin{aligned} \pi ^n(m)= & {} 0;~m = 0,1,2,\ldots ,S-1,\\ \pi ^n(m)= & {} \frac{1}{n-S+1};~m = S, S+1, \ldots ,n. \end{aligned}$$
Fig. 4
figure 4

Transition diagram under the threshold policy ‘n’ for TCP congestion control problem

We then obtain

$$\begin{aligned} {\mathbb {E}}(f_k(N_k^{n}, S_k^{n}({N}_k^{n}))) =\sum \limits _{m=0}^{n}m\pi _k^{n}(m)= \frac{n^2 + n - S(S-1)}{2(n-S+1)}, \end{aligned}$$

where \(S = \max \{\lfloor \gamma _k.(n+1) \rfloor , 1\}\). It can be easily argued that

$$\begin{aligned} {\mathbb {E}}(f_k(N_k^{n}, S_k^{n}({N}_k^{n}))) - {\mathbb {E}}(f_k(N_k^{n-1}, S_k^{n-1}({N}_k^{n-1}))) = 1/2>0. \end{aligned}$$
(20)

Thus, \({\mathbb {E}}(f_k(N_k^{n}, S_k^{n}({N}_k^{n})))\) is strictly increasing in n and the result follows.

1.2 C.2 Expression of Whittle’s index: Proof of Lemma 6.1

Since 1–0 type of threshold policies are optimal, using Proposition 3, the Whittle index is given by

$$\begin{aligned} W_k(n) = \frac{{\mathbb {E}}(T_k^n(N_k^n, S_k^n({N}_k^n))) - {\mathbb {E}}(T_k^{n-1}(N_k^{{n-1}}, S_k^{{n-1}}({N}_k^{{n-1}})))}{{\mathbb {E}}(f_k(N_k^{n}, S_k^n({N}_k^{n}))) - {\mathbb {E}}(f_k(N_k^{{n-1}}, S_k^{{n-1}}({N}_k^{{n-1}})))}, \end{aligned}$$
(21)

if Eq. (21) is non-increasing in n.

We have

$$\begin{aligned} {\mathbb {E}}(T_k^n(N_k^n, S_k^n({N}_k^n))) = \sum _{m=0}^nC_k(m,1)\lambda _k \pi _k^n(m) + C_k(n+1,0)\lambda _k\pi _k^n(n+1), \end{aligned}$$

which simplifies to

$$\begin{aligned} \mathbb {E}(T_k^n(N_k^n, S_k^n({N}_k^n))) = {\left\{ \begin{array}{ll} \frac{\lambda _k\sum \nolimits _{m=S}^{n}(1-(1+m)^{1-\alpha })}{(n-S+1)(1-\alpha )} &{} \quad \text {if } \alpha \ne 1,\\ \frac{-\lambda _k\sum \nolimits _{m=S}^{n}\log (1+m)}{(n-S+1)} &{} \quad \text {if } \alpha = 1; \end{array}\right. } \end{aligned}$$

Together with (20) and Eq. (21), results in the Whittle index as stated in Lemma 6.1.

D Content delivery network

We consider here the content delivery network as described in Sect. 6.2, see also Fig. 5

Fig. 5
figure 5

Optimal clearing framework as single-armed restless bandit

1.1 D.1 Stationary distribution

Under a 0–1 type of threshold policy n, action \(a=0\) is taken in states \(0,1,2\ldots , n\) and action \(a=1\) in states \(n+1, n+2,\ldots \) The transition diagram is shown in Fig. 6.

Fig. 6
figure 6

Transition diagram under the threshold policy n in the content delivery network

The balance equations under this chain are

$$\begin{aligned} \pi ^n(0)\lambda (0)= & {} \pi ^n(1)\theta (1) + \lambda (n)\pi ^n(n), \end{aligned}$$
(22)
$$\begin{aligned} (\lambda (k) + \theta (k))\pi ^n(k)= & {} \lambda (k-1)\pi ^n({k-1})+ \theta (k+1)\pi ^n({k+1})\quad \text { for }\quad k=1,~2,\ldots , n-1, \end{aligned}$$
(23)
$$\begin{aligned} \lambda (n-1)\pi ^n({n-1})= & {} \theta (n)\pi ^n(n) + \lambda (n)\pi ^n(n), \end{aligned}$$
(24)

which together with the normalization condition \(\sum \nolimits _{i=0}^n\pi ^n(i) = 1\) results in the following stationary distribution:

$$\begin{aligned} \pi ^{n}(m)= & {} \frac{\pi ^{n}{(n)}\lambda (n)}{\lambda (m)}\left[ 1+\sum \limits _{i=1}^{n-m}p(m+1,m+i) \right] ~\forall ~m=0,1,2,\ldots n-1,\nonumber \\ \pi ^{n}{(n)}= & {} \left( 1+\sum \limits _{k=0}^{n-1}\frac{\lambda (n)}{\lambda (k)}\left[ 1+\sum \limits _{i=1}^{n-k}p(k+1,k+i)\right] \right) ^{-1}, \end{aligned}$$
(25)
$$\begin{aligned} \pi ^{n}(m)= & {} 0 ~\forall ~m = n+1, \cdots , \end{aligned}$$
(26)

where \(p(k+1, k+i) = \dfrac{\theta (k+1)\theta (k+2)\ldots \theta (k+i)}{\lambda (k+1)\lambda (k+2)\ldots \lambda (k+i)}~\forall ~i\ge 1\).

The summation term in denominator of (25) is strictly increasing in n if \(\lambda (n)\) is non-decreasing. Thus, \(\pi ^{n}{(n)}\) will be strictly decreasing in n under non-decreasing assumption on \(\lambda (n)\).

1.2 D.2 Whittle’s index: Proof of Lemma 6.2

The expected cost under threshold policy n is given by

$$\begin{aligned} {\mathbb {E}}(T^n(N^n, S^n({N}^n))) = \sum \limits _{i=1}^{n}(iC^h(i) + \theta (i)L^a(i))\pi ^n(i) + \lambda (n)L_s^\infty (n+1) \pi ^{n}(n). \end{aligned}$$

Similarly, for the threshold policy \(n-1\)

$$\begin{aligned} {\mathbb {E}}(T^{n-1}(N^{n-1}, S^{n-1}({N}^{n-1})))= & {} \sum \limits _{i=1}^{n-1}(iC^h(i) + \theta (i)L^a(i))\pi ^{n-1}(i)\\&+ \lambda (n-1)L_s^\infty (n) \pi ^{{n-1}}(n-1). \end{aligned}$$

From Proposition 3, we get the expression as stated in Lemma 6.2.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ayesta, U., Gupta, M.K. & Verloop, I.M. On the computation of Whittle’s index for Markovian restless bandits. Math Meth Oper Res 93, 179–208 (2021). https://doi.org/10.1007/s00186-020-00731-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00186-020-00731-9

Keywords

Navigation