A conservative index heuristic for routing problems with multiple heterogeneous service facilities

We consider a queueing system with N heterogeneous service facilities, in which admission and routing decisions are made when customers arrive and the objective is to maximize long-run average net rewards. For this type of problem, it is well-known that structural properties of optimal policies are difficult to prove in general and dynamic programming methods are computationally infeasible unless N is small. In the absence of an optimal policy to refer to, the Whittle index heuristic (originating from the literature on multi-armed bandit problems) is one approach which might be used for decision-making. After establishing the required indexability property, we show that the Whittle heuristic possesses certain structural properties which do not extend to optimal policies, except in some special cases. We also present results from numerical experiments which demonstrate that, in addition to being consistently strong over all parameter sets, the Whittle heuristic tends to be more robust than other heuristics with respect to the number of service facilities and the amount of heterogeneity between the facilities.


Introduction
Many routing problems involving multiple queueing facilities can be mathematically formulated as Markov decision processes (MDPs) and solved exactly using wellknown dynamic programming (DP) techniques such as value iteration and policy improvement (Bellman 1957;Howard 1960;Puterman 1994). Unfortunately, these techniques are usually of no practical use for solving problems which are modelled on real-world applications. The size and complexity of the state space that one must consider has been cited as one of the curses of dimensionality which usually impede exact solution attempts (Sutton and Barto 1998;Powell 2007).
To address this problem, there are various strategies that one might employ. Depending on the particular features of the problem under consideration, it may be possible to simplify the search by proving that optimal policies must have certain characteristics. However, even if this is possible, one may need to resort to the use of heuristics which can be relied upon to find near-optimal policies in a time-efficient manner. In this paper we discuss the use of 'index-based' heuristics and consider their application to a problem involving a Markovian queueing system with heterogeneous service facilities arranged in parallel, each with its own queue and multiple servers available.
We consider systems which can be modelled as shown in Fig. 1. Customers arriving at the 'routing point' must either be sent to one of the N service facilities, in which case they wait in the queue for that facility until a server becomes available, or rejected without receiving service. A fixed, facility-dependent reward is earned each time a customer is served, but holding costs are also incurred and depend linearly on the waiting times of customers. A detailed formulation is provided in Sect. 2.
The fact that we consider heterogeneous facilities, each with the ability to serve multiple customers at once, makes public service systems a natural application area for our work. In the context of healthcare systems, for example, a patient seeking an elective knee operation might be faced with a choice of different treatment providers which differ from each other with respect to the quality of treatment provided, the number of beds available, the expected length of stay and other factors. It is wellknown that for systems such as these, the performance of the system (measured, for example, by the long-run average net reward per unit time after deducting holding costs) is not optimized by allowing customers to make 'selfish' decisions based only on their own interests (Bell and Stidham 1983;Hassin and Haviv 2003;Haviv and Roughgarden 2007;Knight and Harper 2013;Shone et al. 2016;Knight et al. 2017). Instead, customers must be directed to follow an optimal (sometimes referred to as a 'socially optimal') policy which takes into account the effects of decisions on customers who have yet to arrive. It is the need to find, or approximate, a socially optimal policy that we focus on in this paper.
For routing problems involving multiple service facilities in parallel, optimal routing policies do not necessarily have simple characterizations. "Join the Shortest Queue" (JSQ) rules often apply to systems in which all facilities are identical (see Winston 1977;Weber 1978;Hordijk and Koole 1990;Menich and Serfozo 1991;Koole et al. 1999), but even in these cases an optimal policy may have counter-intuitive properties (Whitt 1986). In the case of heterogeneous facilities, some compelling results have been obtained for single-server queues (Hordijk and Koole 1992;Ha 1997), but in general the analysis is much more difficult. Consequently, previous research has tended to focus on developing heuristic (sub-optimal) routing policies. For example, policies based on applying one step of policy iteration to a 'Bernoulli splitting' policy have been shown to perform strongly in various contexts (Krishnan 1990;Sassen et al. 1997;Ansell et al. 2003a;Bhulai and Koole 2003;Argon et al. 2009;Hyytia et al. 2012).
The heuristic policy to be developed in this paper has its origins in the work of Gittins (1979) and Whittle (1988) on deriving 'dynamic allocation indices' for multiarmed bandit processes. Informally speaking, an index-based policy is a policy which associates a certain, easily-computable score or index to the various possible decision options in any given system state, and then chooses the option with the highest index. In order for an index-based policy to be applicable to a particular problem, the property of indexability must first be established. Although this property is not trivial to prove in general, it has been proven to hold in various problems involving queueing or inventory control (see Ansell et al. 2003b;Archibald et al. 2009;Glazebrook et al. 2009Glazebrook et al. , 2011Glazebrook et al. , 2014Hodge and Glazebrook 2011;Nino-Mora 2012;Larranaga et al. 2016).
The modus operandi of the particular type of index policy that we focus on in this paper, which we shall refer to as the Whittle index heuristic, was first described in general terms by Whittle (1988) and has been shown to yield strong-performing policies in various types of problems involving dynamic resource allocation. Nino-Mora (2002) studied a similar problem to ours, with the evolution of 'service facilities' described by birth-death processes, and derived general expressions for the indices upon which decisions are based. Argon et al. (2009) also considered a routing problem in which customers are routed to single-server facilities. Their model does not include admission control, but does include more general cost structures and different customer types. Some recent applications for the Whittle heuristic in the literature include egalitarian processor sharing systems (Borkar and Pattathil 2017), scheduling for time-varying channels (Aalto et al. 2016), partially observed binary Markov chains (Borkar 2017), scheduling of information transmissions (Hsu 2018), allocation of scarce resources to a large number of jobs (Li et al. 2020) and allocation of assets prone to failure (Ford et al. 2020).
In the context of the routing problem that we consider in this paper, an advantage of using a Whittle index heuristic is that we are able to show that the resulting index policies possess certain intuitively 'nice' structural properties which usually cannot be proven to hold for optimal policies. Since the heuristic policies appear to perform very strongly in numerical experiments, this provides justification for searching for optimal policies within a reduced class of solutions which possess these 'nice' properties (this is a topic of ongoing work). The main contributions of our paper are as follows: -We confirm (Sect. 3) that the service facilities in our problem are indexable and derive expressions for the Whittle indices in terms of the system parameters. -We show (Sect. 4) that the positive recurrent state space under an arbitrary optimal stationary policy is bounded between two finite sets, one of which is derived using the Whittle indices. -We show (Sect. 4) that the Whittle heuristic policy becomes asymptotically optimal in light-traffic and heavy-traffic limits. -We prove (Sect. 4) that the Whittle heuristic policy possesses various other 'intuitive' structural properties and provide counter-examples to show that these may not hold in general for optimal policies. -We introduce (Sect. 5) an alternative index-based heuristic based on one-step policy improvement, and show that it possesses asymptotic light-traffic (but not heavy-traffic) optimality. -We present (Sect. 6) results from numerical experiments to compare the performance of the Whittle heuristic policy with those of optimal policies (where feasible) and alternative heuristic policies.
In the next section we provide a detailed formulation of the multiple-facility routing problem considered throughout this paper.

Problem formulation
We consider a queueing system with N service facilities, each of which has its own queue and a first-come-first-served (FCFS) queue discipline. Customers arrive from outside the system according to a Poisson process with demand rate λ > 0. Newlyarrived customers may proceed to any one of the N service facilities or, alternatively, exit the system immediately without incurring any cost or reward (referred to as balking). Thus, there are N + 1 possible destinations for any individual customer. An individual facility i ∈ {1, 2, . . . , N } possesses c i identical service channels, and service times at any channel of facility i are exponentially distributed with mean μ −1 i > 0. The system earns a fixed reward α i > 0 for each customer who completes service at facility i ∈ {1, 2, . . . , N }, but also incurs a linear holding cost β i > 0 per unit time for each customer waiting at the facility (whether in the queue or in service). We also assume that α i > β i /μ i for each i ∈ {1, 2, . . . , N } in order to ensure that rewards adequately compensate customers for their own expected service costs (otherwise we would have redundancy in the system).
The system is fully observable, in the sense that the number of customers present at each facility is always known and can be used to inform decision-making. We do not make any assumptions as to whether decisions are made by customers themselves or whether they are directed by a central controller, since this depends on the physical context of the problem and does not affect our results in this paper. If customers make their own decisions, then an optimal policy can be regarded as that which arises from customers co-operating with each other to maximize their collective welfare.
Let S = {(x 1 , x 2 , . . . , x N ) : x i ∈ N 0 for each i ∈ {1, 2, . . . , N }} denote the state space of the system, where x i is the number of customers present at facility i (including those being served). A system state is denoted by a vector x ∈ S. We also use x i+ (resp. x i− ) to denote the state which is identical to x except that one extra customer (resp. one fewer customer) is present at facility i. That is: where e i is the ith vector in the standard orthonormal basis of R N . The sum of the infinitesimal transition rates under any state x ∈ S is bounded above by λ + N i=1 c i μ i . We can therefore apply the process of uniformization (Lippman 1975;Serfozo 1979) to formulate the system as a discrete-time Markov Decision Process (MDP) in which the action space, transition probabilities and single-step reward function are defined as follows: -Under any state x ∈ S, the action space is An action a ∈ A represents the decision of the next customer to arrive in the system. If a = 0 then the customer balks, and if a = i for some i ∈ {1, 2, . . . , N } then the customer goes to facility i.
-The transition probability of going from state x ∈ S to state y ∈ S \ {x} in a single discrete time step, given that action a ∈ A has been chosen, is denoted by p(x, a, y), where and is the discrete-time step size. Note that, following uniformization, we assume that at most one random event (either an arrival or a service completion) may occur in a single discrete time step. A 'self-transition' from state x ∈ S to itself may occur if no random event takes place, and the relevant transition probability is where I denotes the indicator function.
-Under state x ∈ S, rewards are being earned at a total rate of N i=1 α i min(x i , c i )μ i and holding costs are being incurred at a rate N i=1 β i x i . We therefore formulate the single-step reward function as For simplicity we will assume throughout this paper that Δ = λ + N i=1 c i μ i −1 = 1, since the units of time are arbitrary. We define an optimal policy as a decision-making rule θ which maximizes the long-run average net reward per unit time, defined as where x n denotes the state of the system at the nth discrete time step (with x 0 as the initial state). Let w a (x) denote an individual customer's expected net reward for choosing action a ∈ {0, 1, . . . , N } under state x ∈ S. If the action chosen is to join some facility i ∈ {1, . . . , N } at which x i customers are already present, then the expected waiting if a ∈ {1, 2, . . . , N } and x a < c a , α a − β a (x a + 1)/(c a μ a ), if a ∈ {1, 2, . . . , N } and x a ≥ c a , 0, if a = 0.
(4) Also, letθ denote the 'selfish' (myopic) policy which operates in such a way that any customer arriving in the system chooses the action a ∈ {0, 1, . . . , N } which maximizes w a (x), with ties broken arbitrarily except that we assume balking (a = 0) is chosen only if w i (x) < 0 for all i ∈ {1, 2, . . . , N }. By considering the inequalities we can show that the maximum number of customers at facility i under policyθ is where · denotes the floor function. Hence, the set of positive recurrent states underθ isS, wherẽ It is proved in (Shone et al. 2016) (Theorem 2) that there always exists a stationary policy θ * : S → A which maximizes (3). Furthermore, if S θ * denotes the set of positive recurrent states under a particular optimal stationary policy θ * , then This may be described as the 'containment property' of socially optimal policies.

The Whittle index heuristic
The main theoretical contributions of our paper (to follow in Sect. 4) rely upon the service facilities having a property referred to as indexability, and the resulting development of a heuristic routing policy based on optimal admission policies for individual facilities. Indexability is not necessarily a trivial property to prove in general settings, and sufficient conditions for this property to hold have been well-studied in the literature; see, for example, Bertsimas and Nino-Mora (1996) and Nino-Mora (2001, 2002 for the development of polyhedral methods. As noted by Glazebrook et al. (2009), however, it is often possible to provide simple, direct proofs of indexability in specific problems where the index for resource i (or, in our case, facility i) has a natural interpretation as a fair charge for utilization. Fortunately, in our model it is straightforward to establish the indexability property using a simple geometrical argument. Full details follow later in this section, but first we introduce the Lagrangian relaxation upon which the index heuristic is based. Let Θ denote the class of stationary policies under which our N -facility queueing system is stable and has a stationary distribution; that is, if θ ∈ Θ then the distribution {π θ (x)} x∈S exists, where π θ (x) is the steady-state probability of the system being in state x ∈ S under θ and x∈S π θ (x) = 1. We note that although λ can be arbitrarily large, Θ is always non-empty since it includes the trivial policy which chooses to balk at all states in S.
For each policy θ ∈ Θ and facility i ∈ {1, 2, . . . , N }, let η i (θ ) denote the effective queue-joining rate per unit time at facility i under θ (i.e. the long-run average number of customers joining facility i per unit time), and let L i (θ ) denote the long-run average number of customers present at facility i under θ . Then the long-run average reward g θ under policy θ is independent of the initial state x 0 and may be expressed in the form Closed-form expressions for η i (θ ) and L i (θ ) are unattainable in general when N ≥ 2, but the expression on the right-hand side of (7) will nevertheless prove useful. We note that under any policy θ ∈ Θ, the sum of the effective queue-joining rates η i (θ ) at the various facilities must be bounded above by the system demand rate. That is: Following Whittle (1988), we consider a Lagrangian relaxation of our original problem involving an expanded class of stationary policies Θ which are at liberty to 'break' the natural physical restrictions of the system by sending a newly-arrived customer to any subset of the set of facilities {1, 2, . . . , N }. That is, for each state x ∈ S, the action θ(x) chosen by a policy θ ∈ Θ satisfies where P ({1, 2, . . . , N }) is the power set (i.e. the set of all subsets, including the empty set) of {1, 2, . . . , N }. Conceptually, one now considers a new optimization problem in which the option is available to produce 'copies' of each customer who arrives, and send these copies to any number of facilities (at most one copy per facility). For each state x ∈ S, θ(x) is the set of facilities which, under the policy θ ∈ Θ , receive (a copy of) a new customer if an arrival occurs under state x.
We incorporate the constraint (8) in a Lagrangian fashion by lettingĝ(W ) denote the optimal expected long-run average reward for the new (relaxed) problem, defined asĝ where W ∈ R is a Lagrange multiplier. Clearly, any policy θ belonging to the class of policies Θ for the original problem may be represented by a policy θ in the new class Θ for which the cardinality of θ (x) is either 1 or 0 at all states x ∈ S. Hence, for W ≥ 0, where g * = sup θ∈Θ g θ is the optimal expected long-run average reward in the original problem. One can re-write (9) in an equivalent form: Then, as in Glazebrook et al. (2009) (for example), one obtains a facility-wise decomposition:ĝ where, for each facility i ∈ {1, 2, . . . , N }, Here, Θ i (for i = 1, 2, . . . , N ) is a class of stationary policies which choose either to accept a customer (denoted by 1) or reject (denoted by 0) at any given state. Since the relaxation of the problem allows newly-arrived customers to be sent to any subset of the N facilities, the decision of whether or not to admit a customer at some facility i ∈ {1, 2, . . . , N } can be made independently of the decisions made in regard to the other facilities j = i. It follows that an optimal solution to the relaxed N -facility problem can be found by solving N independent single-facility admission control problems. For each facility i ∈ {1, 2, . . . , N }, the corresponding single-facility problem involves customers arriving according to a Poisson process with a demand rate λ (the same demand rate as for the N -facility problem), c i service channels, and exponentially-distributed service times with mean μ −1 i . The holding cost is β i per customer per unit time, but importantly the reward for service is now α i − W as opposed to α i . Hence, it is natural to interpret W as an extra charge for admitting a customer; see Fig. 2.
The single-facility problem described above exactly fits the formulation of Sect. 2 (with N = 1), except that the reward α i − W may not be positive. Interpreting Theorem 2 from Shone et al. (2016) in the context of a single-facility problem, we can be assured that there exists an average reward optimal threshold policy. This means that a customer arriving at the facility balks if and only if the system state x equals or exceeds the integer threshold n, i.e. x ≥ n. We will refer to such a policy as an n-threshold policy.  Let θ * i denote an optimal threshold policy at facility i, given an entry charge W . This means that θ * i chooses an action a ∈ {0, 1} in response to an input (x, W ) ∈ N 0 × R, where x is the observed state and W is the entry charge. Re-interpreting the summary measures η i (·) and L i (·) so that they are now functions of policies θ i belonging to the set Θ i associated with the single-facility problem, it follows that , the optimal average reward for facility i.
Next, we recall that the facility i ∈ {1, 2, . . . , N } was arbitrary in this discussion and let θ * 1 , θ * 2 , . . . , θ * N be optimal threshold policies at the various facilities. Also, let θ * be a stationary policy belonging to the expanded class Θ which operates in such a way that, for each state x ∈ S, That is, each time a new customer arrives, they are sent to all of the facilities i ∈ {1, 2, . . . , N } at which the optimal threshold policy θ * i would choose to accept a customer. By the previous arguments, θ * attains average reward optimality in the relaxed version of the problem.
In order to derive the Whittle heuristic for the original N -facility problem, it remains to establish the connection between this heuristic and the optimal solutions for the relaxed version of the problem discussed thus far. The Whittle heuristic relies upon the notion of indexability of a service facility, which we define [in line with Whittle (1988)] as follows: For each facility i ∈ {1, 2, . . . , N }, let T * i (W ) denote the smallest integer n such that an n-threshold policy achieves average reward optimality in a single-facility problem with demand rate λ, c i service channels, service rate μ i , holding cost β i and reward for service α i − W . Then we can show that the indexability property holds for facility i if and only if T * i (W ) satisfies the following properties: In the event that the indexability property holds, we refer to the critical value W i (x) as the Whittle index for facility i and state x. Although indexability is not trivial to prove in general, the property has been shown to hold in various problems involving queueing or inventory control (see Nino-Mora 2002;Ansell et al. 2003b;Archibald et al. 2009;Glazebrook et al. 2009;Argon et al. 2009;Hodge and Glazebrook 2011 and references therein). The next result confirms that the facilities in our problem are indexable, and also provides an expression for the Whittle index W i (x) in terms of the system parameters. Proof of the lemma can be found in Appendix A.
where π i (y, T ) denotes the steady-state probability of facility i being in state y ∈ N 0 , given that a threshold of T is applied.
We note that convenient formulae for π i (y, T ) are available from finite-buffer M/M/c queueing theory (see, for example, Gross and Harris 1998): Several remarks should be made at this point. Firstly, equation (13) can be found in a more general form in Corollary 7.1 of Nino-Mora (2012). Secondly, the expression x) which appears on the right-hand side of (13) is simply is the expected number of customers present at facility i given a threshold of x. In the special case where x < c i , Little's formula , and hence we obtain Thus, the Whittle index at states x < c i is equal to a customer's expected net reward for joining facility i. As a further remark, suppose we have a single-server facility (c i = 1). Then it is straightforward to apply results for finite-buffer M/M/1 queues in order to obtain where ρ i = λ/μ i . This is analogous to the equation (7.3) given in Nino-Mora (2002), except that their result is given in the context of minimizing holding costs (without a reward for service). A similar result can also be found by considering equation (18) in Argon et al. (2009) and setting (in their notation) α = 0, β = λ/μ = ρ and c(i) = (i + 1)h/μ. In the light of Definition 1, we can obtain an optimal policy θ * for the relaxed N -facility problem by specifying its decision at state x ∈ S as follows: As observed in Argon et al. (2009) and Glazebrook et al. (2009), the fact that an optimal solution to the relaxed problem may be described using the Whittle indices makes it logical to propose a heuristic policy for the original N -facility problem, which involves sending any new customer who arrives under state x ∈ S to a facility i which maximizes W i (x i ), or choosing to balk if none of the W i (x i ) values are positive. The optimality of such a policy cannot be guaranteed, but its intuitive justification lies in the fact that W i (x i ), when positive, is a measure of the amount by which the 'charge for admission' W would need to be increased before the optimal policy θ * for the relaxed problem would choose not to admit a customer to facility i. Thus, W i (x i ) may be regarded somewhat crudely as a measure of the margin by which one would be 'in favor' of having an extra customer present at facility i. A similar interpretation is that W i (x i ) is a 'fair charge' for admitting a customer to facility i when there are x i customers already present.
The Whittle index heuristic policy θ [W ] (hereafter referred to as the Whittle policy) for our original N -facility routing problem is defined below.
Definition 2 (Whittle index policy) At any given state x ∈ S, the Whittle index policy θ [W ] chooses an action as follows: where W i (x) is defined in (13). In cases where two or more facilities attain the maximum in (15), it will be assumed that a decision is made according to some pre-determined ranking order of the N facilities.
We note that, for any state x = (x 1 , . . . , x N ) ∈ S, there is an equivalence between the following three statements: 3. Any optimal stationary policy for a single-facility problem with parameters corresponding to those of facility i ∈ {1, 2, . . . , N }, is a threshold policy with threshold greater than x i .
We will make use of this equivalence in several of our later proofs.
In the next section we investigate the similarities and differences between the Whittle index policy and an optimal policy which maximizes (3).

Structural and asymptotic properties of the index heuristic
Suppose we have a stationary policy, θ * , which is optimal under the average reward criterion. In this section we will present several counter-examples to show that θ * may possess surprising and counter-intuitive structural properties. Indeed, there is little that can be proved about θ * in general. However, it is possible to show that the positive recurrent state space under θ * may be bounded by two finite sets. Let the sets S • and S θ * be defined as follows: Also, letS be the 'selfish' state space defined in (5). The following relationship may be proved to hold for any optimal stationary policy θ * : Indeed, the fact that S θ * ⊆S has been proved in Shone et al. (2016) (Lemma 6). By using this result and also showing that optimal policies never choose to balk if there is an idle server available at one of the N facilities, the lower bound S • ⊆ S θ * can be established. A full proof can be found in Appendix B. We refer to the property S • ⊆ S θ * as the non-idling property of optimal policies.
We have not stated (17) as a theorem because it can be regarded as a corollary of a stronger result, which follows next. It is possible to use the structural properties of the Whittle index policy to obtain an improved lower bound for S θ * . Throughout the rest of this section we will use θ [W ] (x) ∈ {0, 1, . . . , N } to denote the action chosen by the Whittle policy θ [W ] in response to an observed state x ∈ S, and we will also use S W to denote the set of states in S which are positive recurrent under θ [W ] . The following lemma is needed: Lemma 2 Let θ * be an optimal stationary policy. Then, for any x ∈ S θ * , That is, the Whittle policy θ [W ] chooses to balk at any state x ∈ S θ * where θ * chooses to balk.
Proof of the lemma is established using dynamic programming recursions and can also be achieved via a sample path argument. The details can be found in Appendix C. Essentially, one can show that if balking is chosen at some state x ∈ S θ * by the optimal policy θ * , then balking would also be chosen by an optimal threshold policy in a single-facility problem involving any of the facilities i ∈ {1, 2, . . . , N } at the state with x i customers present (where x i is the i th component of the state x in the N -facility problem). Since the Whittle policy θ [W ] makes decisions by considering each of the N facilities operating in isolation, this is sufficient to establish the result.
Our next theorem states that the Whittle index policy θ [W ] is conservative in comparison to an optimal stationary policy θ * .
Theorem 1 (Conservativity of the Whittle policy) For any optimal stationary policy θ * , we have Proof The containment property S θ * ⊆S is already known. It follows that there must exist some state x ∈ S θ * at which θ * chooses to balk; otherwise, an unbroken sequence of customer arrivals (without any service completions) would cause the process to pass outsideS under θ * . Let z ∈ S θ * be a state at which θ * chooses to balk, and let That is, S z is the set of states inS which satisfy the componentwise inequality x ≤ z. Since z ∈ S θ * , it follows that all states in S z are also included in S θ * , since they are accessible from z via service completions. Hence, S z ⊆ S θ * . On the other hand, since balking is chosen by θ * at z, it follows from Lemma 2 that balking is also chosen at z by the Whittle policy θ [W ] . By definition of the Whittle policy, this implies that W i (z i ) ≤ 0 for all i ∈ {1, 2, . . . , N }. Therefore it is impossible for any state x / ∈ S z to be accessible from state 0 (the empty system state) under the Whittle policy, since this would require joining some facility i ∈ {1, 2, . . . , N } to be chosen at a state y ∈ S z with y i = z i and hence W i (y i ) ≤ 0. It follows that S W ⊆ S z ⊆ S θ * .
To complete the proof, it remains only to show that S • ⊆ S W . In Sect. 3 it was shown that, for any facility (14)). Recall that our model assumes α i − β i /μ i > 0; otherwise, facility i would be redundant. Hence, θ [W ] cannot choose to balk at any state with x i < c i for some i ∈ {1, 2, . . . , N }. Since S W is contained in S θ * (and hence finite), it then follows that there exists a state x with x i ≥ c i for all i ∈ {1, 2, . . . , N } which is positive recurrent under θ [W ] (indeed, such a state must be accessible from 0 via an unbroken sequence of customer arrivals). Hence, all states in S • are also positive recurrent under θ [W ] .
Next, we turn our attention to the asymptotic properties of the Whittle index policy as λ becomes either very small or very large. Since θ [W ] is a heuristic policy, its optimality cannot be proved in general, but the next theorem establishes that the Whittle policy achieves (asymptotic) optimality in a light-traffic limit, and also in a heavy-traffic limit.

Theorem 2 (Optimality of the Whittle policy in light-traffic and heavy-traffic limits)
Let g [W ] (λ) be the long-run average reward attained by the Whittle policy θ [W ] (λ) given a demand rate λ > 0, and let g * (λ) be the corresponding value under an optimal policy. Then: 1. θ [W ] is asymptotically optimal in a light-traffic limit. That is: 2. θ [W ] is optimal in a heavy-traffic limit. That is: Proof of Theorem 2 can be found in Appendix D. In the light-traffic case, it suffices to show that the Whittle policy makes optimal decisions at the state with no customers present, since the decisions chosen at other states essentially become unimportant in the limiting scenario. In the heavy-traffic case, the proof is accomplished by showing that the Whittle heuristic directs customers to balk if and only if all servers are busy at all facilities, and that this results in the system residing continuously in a state which maximizes the single-step reward function r (x).
Theorem 2 relies upon the fact that optimal policies become quite simplistic in the limiting cases as λ → 0 and λ → ∞. For general values of λ, however, optimal policies can be quite intricate. The remaining results in this section identify structural properties of the Whittle policy θ [W ] which do not necessarily hold under an optimal policy.
First, we consider monotonicity. We will generalize our previous notation and use S θ to denote the set of states which are positive recurrent under an arbitrary stationary policy θ .
Theorem 3 (Monotonicity) Suppose N ≥ 2, and let Θ [M] denote the class of all stationary policies θ which satisfy the following three monotonicity properties: Then: 2. If N = 2 and c 1 = c 2 = 1 then there exists an optimal policy in Θ [M] ,

In general, Θ [M] is not guaranteed to include an optimal policy.
Proof It is trivial to show that the Whittle policy θ [W ] possesses the monotonicity properties (a)-(c), since this is a direct consequence of the index-based nature of the policy. We therefore begin with Statement 2, which relates to a special case of our model with only two facilities and a single server at each. In general, any optimal policy must be associated with a constant g * and a function h satisfying the wellknown average reward optimality equations: Importantly, the function h satisfying (19) is unique up to an additive constant (see Puterman 1994). The proof of Statement 2 depends on showing that, in the special case under consideration, h satisfies three properties defined as follows: These properties can be established using inductive arguments based on value iteration, and the existence of an optimal policy in Θ [M] then follows. For full details, please refer to Appendix E.
We can use value iteration to confirm the existence of a unique optimal stationary policy θ * for this system. The positive recurrent state space under this policy is S θ * = {(x 1 , x 2 ) ∈ N 2 0 : x 1 ≤ 2 and x 2 ≤ 2}, i.e. it includes 9 states. However, the decision of θ * at state (0, 0) is to join Facility 2, whereas the decision at (1, 0) is to join Facility 1. This contravenes monotonicity property (b) stated in the theorem, so the proof is complete.
Fellow researchers may be interested to know that we have been unable to find either a proof or a counter-example to show whether or not monotonicity property (a) is guaranteed to hold under an optimal policy θ * (indeed, this property may be meaningless if it can be shown that θ * (x) = 0 ⇒ x i+ / ∈ S θ * ). In addition, we have been unable to find either a proof or a counter-example to show whether or not an optimal policy satisfying all three properties (a)-(c) is guaranteed to exist if N ≥ 3 and c i = 1 for all i ∈ {1, 2, . . . , N }.
Next, we consider how the size of the positive recurrent state space changes as the demand rate λ varies. Intuitively, one might suppose that strong-performing policies should become more conservative as λ increases. The next theorem shows that this is indeed the case for the Whittle policy θ [W ] , but not for optimal policies in general.
Then, given any two demand rates λ 1 , λ 2 with λ 1 > λ 2 > 0: Proof Since the Whittle index policy is derived from the properties of optimal admission policies for single-facility problems, Statement 1 is actually implied by Statement 2. However, Statement 2 is somewhat non-trivial to prove. We have used a dynamic programming argument to establish this result, and the details can be found in Appendix F.
Next, we examine a property related to the distribution of balking states under a stationary policy. It is natural to suppose that, under a strong-performing policy θ , the positive recurrent state space S θ should take the form of a cuboid in N dimensions. Indeed, if S θ is finite, then the cuboid property is implied by the existence of a unique state in S θ at which balking is chosen. The next result states that the Whittle policy θ [W ] must have a unique recurrent balking state, but this is not necessarily true for an optimal stationary policy.
Theorem 5 (Unique recurrent balking state) Suppose N ≥ 2, and let Θ [B] denote the set of all stationary policies θ for which the set of positive recurrent states S θ includes a unique state at which balking is chosen. Then: 2. If N = 2 and c 1 = c 2 = 1 then there exists an optimal policy in Θ [B] , 3. In general, Θ [B] is not guaranteed to include an optimal policy. Proof The proof of Statement 1 is trivial, since the only state x ∈ S W at which θ [W ] chooses to balk is the state with The proof of Statement 2 relies upon the properties of concavity, submodularity and diagonal submissiveness for the function h satisfying the Eq. 19 in a system with two single-server facilities. These properties were established (for the N = 2, c 1 = c 2 = 1 case) in the proof of Theorem 3. Details of how these properties imply a unique balking state can be found in Ha (1997) (Theorem 3). Ha's results are given in the context of a make-to-stock production system with two products and a single server. He defines a 'base stock policy' as a policy for which production is stopped if and only if all products have inventory at or above their specified base stock levels; this is analogous to a policy with a unique balking state in our model. We also note that Ha considers a minimization problem as opposed to a maximization problem, and the value function in his model has the converse properties of convexity, supermodularity and diagonal dominance. However, the arguments in his proof can be translated to our setting in an obvious way.
For this system, the unique optimal policy θ * found using value iteration chooses to balk at the states (13, 10, 14) and (12,11,14), both of which are positive recurrent under θ * . Thus, the process operating under θ * is able to access two different 'balking states', implying that θ * / ∈ Θ [B] .
Our final result in this section concerns a special case in which all of the N facilities share the same parameters (c i , μ i , α i and β i ). We refer to this as the 'homogeneous facilities' case. Like the previous three results, it highlights an intuitively 'sensible' structural property which is possessed by the Whittle policy θ [W ] , but not by optimal policies in general.  Table 1 An optimal decision-making structure for a system with homogeneous facilities x 1 = 0 1o r2 1 1 1 x 1 = 1 2 1o r2 1 1 x 1 = 2 2 2 1o r2 0 In general, Θ [C] is not guaranteed to include an optimal policy.
Proof As in Theorems 3 and 5, the proof of Statement 1 is trivial, since it is a direct consequence of the index-based nature of the Whittle index policy. We provide a counter-example to establish Statement 2. Consider a system with demand rate λ = 15 and two single-server facilities which share an identical set of parameters as follows: With these parameters, it transpires that the set of positive recurrent states S θ * associated with any optimal stationary policy must be either of dimension 3 × 4 or 4 × 3. The optimal decision-making structure is shown in Table 1.
If we restrict attention to stationary policies, then it can be seen from Table 1 that there must be a unique balking state at either (2, 3) or (3, 2). Therefore an optimal stationary policy will allow one of the two facilities to have up to three customers present, but not both. The state (3, 3) is not accessible from 0 (i.e. positive recurrent) under any optimal stationary policy. Since Table 1 accounts for all 8 optimal stationary policies in this system, we conclude that none of these are included in Θ [C] .
The results in this section have shown that the Whittle policy θ [W ] belongs to a class of policies which possess certain intuitively 'sensible' structural properties. However, the counter-examples have shown that an optimal policy need not necessarily be included in the same class, and therefore θ [W ] must be sub-optimal in some cases. In Sect. 6 we present the results of numerical experiments to evaluate the performance of the Whittle policy. These numerical results include comparisons with an alternative heuristic policy, which is developed in the next section.

An alternative heuristic policy
In this section we describe an alternative heuristic policy which is derived from the application of a single step of policy iteration to a 'static routing' or 'Bernoulli splitting' policy. Similar approaches have been used for other routing problems in the literature; see Krishnan (1990), Ansell et al. (2003b) and Argon et al. (2009) and references therein. The heuristic shares some similarities with the Whittle heuristic, in the sense that it requires the calculation of indices for the N individual facilities; however, the indices themselves are calculated in a completely different way from those derived in Sect. 3.
To begin, consider a randomized policy under which routing decisions are made according to a fixed probability distribution {σ a } N a=0 , where a belongs to the same action set A described in Sect. 2; hence, N a=0 σ a = 1. We refer to this type of policy as a static policy, since it does not have the ability to make decisions dynamically according to the system state. We will commit a slight abuse of notation and represent an arbitrary static policy by a vector Λ = (λ 1 , . . . , λ N ), where λ i = λσ i is the arrival rate for facility i (i ∈ {1, . . . , N }) and λ 0 = λσ 0 is the rate at which customers balk. We can then write the expected long-run average reward under this policy as where L i (λ i ) is the expected number of customers present at facility i, given that arrivals occur according to a Poisson process with rate λ i (here we are making use of the well-known 'Poisson splitting' property). It should be noted that L i (λ i ) is finite if and only if λ i < c i μ i , so we will define g Λ = −∞ for any policy Λ with λ i ≥ c i μ i for at least one facility i. Known results for M/M/c queues imply that L i (·) is a strictly convex function (see Grassmann 1983;Lee and Cohen 1983); hence, g Λ is strictly concave and there must be a unique policy Λ which maximizes g Λ over all static policies. We will use Λ * := (λ * 1 , . . . , λ * N ) to denote the unique optimal static policy. Here, 'optimal' means 'optimal among all static policies', not 'optimal over all policies'. In fact, it is easy to show that all static routing policies are sub-optimal if non-static policies which make state-dependent routing decisions are included as candidates. Nevertheless, it transpires that a strong-performing (heuristic) dynamic routing policy can be obtained by applying a single step of DP-style policy iteration to the optimal static policy Λ * .
To assist our development, let us define V (n) (x, Λ * ) as the expected total reward over n discrete time steps given that policy Λ * is followed and the initial state is x ∈ S. Note that, in our uniformized MDP described in Sect. 2, λ * i can be interpreted as the probability that a customer arrives and is sent to facility i at an arbitrary discrete time step under policy Λ * . Under the usual paradigm of policy iteration, we aim to choose an action a under state x which maximizes That is, we aim to maximize the improvement in the long-run expected total net reward that would result from choosing action a under state x at an arbitrary time step and then following the optimal static policy Λ * at all time steps thereafter, as opposed to simply following the policy Λ * at all times. Given that the implementation of policy Λ * results in individual facilities operating independently with their own Poisson arrival rates, it will be useful to write where V (n) i (x, λ * i ) is an expected finite-stage reward for facility i only, given x customers initially present and a Poisson demand rate λ * i . We will also define for each facility i and state x ∈ N 0 . Then, after some manipulations using (21)-(24), it can be shown that Hence, in order to obtain a dynamic routing policy via the application of a policy iteration step to an optimal static policy, we should make decisions in such a way that customers who arrive under a given state x = (x 1 , . . . , x N ) are directed to join the facility i which maximizes D i (x i , λ * i ) if this value is positive; otherwise, they should balk.
It can be seen from (23) that h i (x i , λ * i ) is equivalent to the well-known 'relative value function' which appears in the optimality equations and policy evaluation equations for average reward MDPs (see Puterman 1994). In our context, h i (x i , λ * i ) applies to facility i only and we can interpret the demand rate λ * i as the 'policy' for this facility. We have also defined state zero as the 'reference state' which the other states' values are compared against. The policy evaluation equations for facility i can be written where r i (x) and p i (x, y, λ * i ) are the obvious single-facility analogues of the rewards and transition probabilities defined in Sect. 2 and g i (λ * i ) is the long-run average reward for facility i. We can then calculate the D i (x, λ * i ) values by using the Eq. (26). Before proceeding, we note that if λ * i = 0 for a particular facility i then h i (x, λ * i ) and D i (x, λ * i ) are trivially equal to zero for all x ∈ N 0 , so we will only consider facilities i for which λ * i > 0. By setting x = 0 in the Eq. (26) and noting that r i (0) = 0 and h i (0, λ * i ) = 0, we obtain: In general, for integers x ∈ {1, . . . , c i − 1}, we have: Following simple manipulations, we obtain the recurrence relationship: By making recursive substitutions in (27) we then obtain, for x ∈ {0, 1, . . . , c i −1}: Similarly, for integers x ≥ c i , the recurrence relationship is: By using (28) and (29) and applying a simple inductive argument, one can then show that for x ≥ c i , we have: Let θ [B] denote the 'Bernoulli improvement' heuristic which is obtained by applying a step of policy iteration to the optimal static policy Λ * . The conclusion of this section is that θ [B] chooses actions as follows: where D i (x i , λ * i ) is defined in (28) (for x i ∈ {0, 1, . . . , c i − 1}) and (30) (for x ≥ c i ). We note that, from a practical point of view, implementation of θ [B] requires the initial solution of a convex optimization problem in order to obtain the optimal static policy Λ * . The values g i (λ * i ) (required for the computation of D i (x, λ * i )) are then readily obtained as functions of the λ * i .
For the special case c i = 1, we note that (28) and (30) reduce to This implies that θ [B] is more conservative than the selfish policyθ in a system with single servers at all facilities, sinceθ prefers joining facility i to balking at state The next theorem states that the Bernoulli improvement policy possesses the property of asymptotic light-traffic optimality which (according to Theorem 2) is also a feature of the Whittle policy θ [W ] .
Theorem 7 (Optimality of the Bernoulli improvement policy in a light-traffic limit) Let g Λ * (λ) and g [B] (λ) be the long-run average rewards attained by the optimal static policy Λ * and the Bernoulli improvement policy θ [B] respectively given a demand rate λ > 0, and let g * (λ) be the corresponding value under an optimal policy. Then Λ * and θ [B] are both asymptotically optimal in a light-traffic limit. That is: On the other hand, it is easy to find counter-examples to show that θ [B] does not possess the property of heavy-traffic optimality described (in the context of the Whittle policy θ [W ] ) in Theorem 2. In Appendix G we have provided a proof of Theorem 7 and also a counter-example to show the lack of heavy-traffic optimality for θ [B] .

Numerical results
In this section we report the results of a series of experiments involving more than 37,000 randomly-generated sets of system parameters. In order to evaluate the exact sub-optimality of a heuristic policy, it is necessary to evaluate the expected long-run average reward earned by the relevant policy and compare this with the optimal value g * associated with an average reward optimal policy. Usually, one would wish to carry out these tasks using dynamic programming algorithms, but this is only practical if the finite state spaceS is of relatively modest size. Of course, the Whittle policy described in Sect. 3 can easily be applied to systems in whichS is extremely large, but it is generally not feasible to evaluate the optimal value g * in such systems, and therefore the only comparisons of interest that can be made in 'large' systems are between the Whittle policy (whose performance must be approximated, using simulation) and with alternative heuristics such as the selfish policyθ and the Bernoulli improvement policy θ [B] described in Sects. 2 and 5 respectively. As such, this section is divided into two subsections: -In Sect. 6.1, systems of relatively modest size are considered. These are systems in which the size of |S| facilitates the efficient computation of the optimal average reward g * using DP algorithms, and also enables similar evaluations of the average rewards earned by the Whittle policy θ [W ] , the Bernoulli improvement policy θ [B] and the selfish policyθ. -In Sect. 6.2, 'large' systems are considered. These are systems in which the exact computation of g * is assumed to be infeasible, and the average rewards earned by θ [W ] are compared with those associated with alternative heuristic policies via simulation experiments.
For purposes of distinction, a 'modest-sized' system is defined in this section as a system in which 2 ≤ N ≤ 4 and the cardinality ofS is between 100 and 100,000. Although it is certainly possible to apply DP algorithms to systems of greater size than this, it is desirable to impose a relatively strict restriction on |S| in order to allow a large number of experiments to be carried out in a reasonable amount of time. The remainder of this section will proceed to present the results obtained from numerical experiments.

'Modest-sized' systems with 2 ≤ N ≤ 4
We conducted a series of numerical experiments involving 32,934 randomly-generated sets of system parameters. Details of the methods used to generate the parameters can be found in Appendix H. Table 2 shows 95% confidence intervals for the percentage suboptimality values recorded for each of the three heuristic policies θ [W ] , θ [B] andθ, (columns 3-5). The first row shows summary results for all 32,934 systems, and the next three rows show results for particular values of N . Both θ [W ] and θ [B] are consistently strong, with θ [W ] tending to be slightly stronger overall (within 1% of optimality on average). Indeed, θ [W ] was the best-performing of the three heuristics in about 65% of experiments. Noticeably, all of the heuristics tend to do better in the N = 2 case than in the N = 3 and N = 4 cases. In Sect. 6.2, we will present results for larger values of N .
be a measure of the relative traffic intensity for a particular system. In Table 3 we have presented results for θ [W ] , θ [B] andθ in a similar format to that of Table 2, except with results categorized according to ρ rather than N .
Our results indicate that θ [W ] tends to be strongest for very small (i.e. close to zero) or very large (i.e. significantly larger than 1) values of ρ -which is unsurprising, since it is known to be asymptotically optimal in light-traffic and heavy-traffic limits due to Theorem 2. Of greater interest, perhaps, are the comparisons with the alternative heuristic policies θ [W ] andθ , and how these are affected by the value of ρ. Table 3 shows that the suboptimality of θ [W ] is greatest when ρ is close to 1, although it remains within 1.6% of optimality (on average) in such cases. The Bernoulli improvement policy θ [B] may be slightly stronger than θ [W ] when ρ is close to 1, but (unlike θ [W ] ) it performs worse as ρ increases beyond 1. This is consistent with the result of Theorem 7. The selfish policyθ performs well when ρ is very small, but is very poor in other cases.
We also investigated the effect of heterogeneity between service facilities on our results. Recall that a particular facility i in our model has four parameters: c i , μ i , α i and β i . For each of our 32,934 randomly-generated parameter sets we calculated the coefficient of variation (i.e. the ratio of the standard deviation to the mean) of the values c 1 , . . . , c N in order to obtain a measure, denoted by φ c , of the variation between c i values. We then repeated this process for the other parameter types in order to obtain the analogous statistics φ μ , φ α and φ β , and calculated the average coefficient of variation asφ := (φ c + φ μ + φ α + φ β )/4. Table 4 shows comparisons between the performances and suboptimality values of the three heuristic policies θ [W ] , θ [B] and θ , with results categorized according to the value ofφ. Table 4 indicates that θ [W ] remains very strong for all values ofφ. More interestingly, however, there is a clear trend for the other two heuristics (θ [B] andθ ) to perform As Sect. 6.1, our interest lies mainly in assessing the strength of the Whittle policy θ [W ] . Given that we cannot evaluate the exact suboptimality of θ [W ] in larger systems, we decided to compensate by expanding our set of alternative heuristic policies to be used for comparison purposes. The results in Sect. 6.1 have already shown that the selfish policyθ performs poorly in many cases, especially if the demand rate is high. However, we can derive other (possibly stronger) heuristics by considering a simple generalization of the selfish decision-making rule. For a given state x ∈ S, action a ∈ {0, 1, . . . , N } and parameter p ∈ [0, 1], let w a (x, p) be defined as follows: Thus, w a (x, p) is equivalent to the expected net reward for an individual customer defined in (4) except that the rewards α i for the various facilities are scaled by a multiplier p.
Letθ p denote the policy which operates in such a way that the action chosen under state x ∈ S is the action which maximizes w a (x, p), with ties broken arbitrarily (except that a = 0 is chosen only if w i (x, p) < 0 for all i ∈ {1, 2, . . . , N }). Also, letg p denote the average reward under policyθ p . If p = 1 thenθ p is equivalent to the usual selfish policy,θ . However, the value of p which maximizesg p is likely to be smaller than 1, especially if the demand rate is high.
Let D be a discretization of the interval [0, 1] and let us defineg D bỹ i.e.g D is the maximum average reward attained over all possible policiesθ p , subject to the constraint p ∈ D. We will also useθ D to denote a policy in {θ p } p∈D which attains the average rewardg D . It should be noted thatθ D is not an admissible policy itself; instead, it represents the strongest-performing of a set of policies for a particular system. We intend to useg D as a benchmark in order to evaluate the strength of θ [W ] in larger systems.
In each of our 4660 randomly-generated scenarios we simulated the performances of all 100 policies in the set {θ p } p∈D , with the discretized set D given by D = {0.01, 0.02, . . . , 0.99, 1}, and estimatedg D by taking the maximum of these. We also simulated the performances of θ [W ] and θ [B] using the same random number seed used to simulate the 100 policies in {θ p } p∈D . We note here that simulating the performances of 102 different stationary policies is a computationally intensive task, and this is why we have considered fewer random scenarios in Sect. 6.2 than in Sect. 6.1. The implementation and simulation of the Whittle policy itself is extremely fast even in very large systems, and does not pose any computational difficulty. Table 5 summarizes the performance of θ [W ] against θ [B] andθ D in these 4460 experiments, with results categorized according to the value of N (the number of  [W ] , was at least as great as g [B] (resp.g D ). There is a general trend for θ [W ] to increase its advantages against θ [B] andθ D as N increases. Also, by comparing Tables 2 and 5, we may observe that the relative strength of θ [W ] versus the other heuristics appears to have increased significantly in these 'large system' experiments. Table 6 shows additional comparisons between the three heuristic policies with results categorized according to the value of ρ = λ N i=1 c i μ i −1 . As in Sect. 6.1 (see Table 3), θ [W ] becomes stronger relative to the alternative heuristics as ρ increases beyond 1. The Bernoulli improvement policy, θ [B] , also tends to be stronger thanθ D (this can be seen by comparing columns 3 and 4, for example). Finally, Table 7 shows the results of our experiments categorized according to the value ofφ, whereφ is defined the same way as in Sect. 6.1; i.e. it is the average of the coefficients of variation for the four parameter types (c i , μ i , α i , β i ). As in Section 6.1, we observe that θ [W ] tends to increase its advantage over other heuristics when the heterogeneity between facilities is increased.
It seems clear from our results in Sect. 6.2 that, if we want to find an alternative heuristic policy which rivals the performance of the Whittle policy θ [W ] , it is not sufficient to simply modify the selfish decision rule so that rewards are given relatively less importance compared to expected waiting costs (and hence the system becomes less busy). Indeed, the Whittle policy is based on socially optimal (i.e. average reward optimal) decisions at individual facilities, and this enables it to make smarter decisions than the policies in the set {θ p } p∈D . To illustrate this point, suppose we have two facilities i and j such that w i (x, p) = w j (x, p) under a particular state x ∈ S. Suppose also that the reward for service α i is substantially larger than α j , but the expected waiting costs at facility i are also larger (this may be due to a longer expected  Table 7 95% confidence intervals for the percentage improvement given by θ [W ] against alternative heuristics (columns 3-4) and the percentage of experiments in which θ [W ] equalled or exceeded the performances of alternative heuristics (columns 5-6) for different values ofφ = (φ c + φ μ + φ α + φ β )/4 waiting time, for example). In this situation, the generalized selfish policyθ p is unable to distinguish between facilities i and j, but one might imagine that joining facility j should be a better choice in the context of average reward maximization, since it has a smaller impact on future congestion levels in the system. The index (13) employed by the Whittle policy is better-suited to taking such considerations into account.
The randomly-generated parameter sets and results of the numerical experiments reported in this paper have been archived and are available at http://doi.org/10.5281/ zenodo.3775332.

Conclusions
Theorem 2 in Shone et al. (2016) has established that it is theoretically possible to find an average reward optimal policy for the MDP formulated in Sect. 2 by truncating the state space S, and applying a dynamic programming algorithm to an MDP with the finite state spaceS. Unfortunately, the finite setS might itself be very large in many problem instances, and for this reason it is necessary to look for heuristic approaches which can be relied upon to yield near-optimal policies in a short amount of time.
As discussed in the introduction, the Whittle index heuristic is now well-established in the field of stochastic dynamic programming and we have shown (Lemma 1) that the indexability property, which can be difficult to prove in other settings, can be applied in our problem. A key finding of our paper is that the positive recurrent state space under an optimal stationary policy is not only bounded above by the selfish state spacẽ S, but also bounded below by the 'Whittle state space' S W (Theorem 1). We have also proved certain structural properties of the Whittle policy, including its asymptotic optimality in light-traffic and heavy-traffic limits (Theorems 2-6). These results are useful since, in general, structural properties of optimal policies are difficult to prove for routing problems involving heterogeneous service facilities.
The empirical results in Sect. 6 have shown that the Whittle policy θ [W ] is very close to optimality in systems which are 'small enough' to allow the computation of an optimal policy. In larger systems, we have verified that it performs strongly against alternative heuristics, including the policy θ [B] obtained by applying one step of policy improvement to a 'Bernoulli splitting' policy. Notably, its superiority over other heuristics appears to increase as (i) the traffic intensity increases beyond 1; (ii) the number of facilities increases; (iii) the heterogeneity between service facilities increases.
The first of the above characteristics is implied by Theorem 2, but the reasons for the second and third characteristics are less obvious. Indeed, all of the heuristics that we have considered share some broad methodological similarities, in that they require the computation of indices for individual facilities-so it is not immediately clear why the Whittle policy's indices should be more robust than others with respect to dimensionality or heterogeneity. We intend to investigate this further in future work.
For any given set of system parameters, the indices which characterize the Whittle policy θ [W ] are calculated in a completely deterministic way. Thus, this heuristic does not rely on any iterative algorithm, nor does it involve any type of simulation or random sampling. One might regard the deterministic nature of this heuristic as both a strength and a weakness. On one hand, the simplicity makes it extremely easy to implement; on the other hand, if the heuristic is found to perform poorly in a particular system, then it is not necessarily easy to see how the decision-making indices might be adjusted in order to achieve closer proximity to an optimal policy. In future work, we intend to test the performance of the Whittle heuristic against policies obtained by approximate dynamic programming (ADP) methods, including those which have achieved popularity in the fields of neuro-dynamic programming and reinforcement learning (Bertsekas and Tsitsiklis 1996;Powell 2007;Sutton and Barto 1998). We also intend to use the Whittle policy in conjunction with ADP methods, by allowing it to act as a reference point within broader search algorithms.