Containment of socially optimal policies in multiplefacility Markovian queueing systems
Abstract
We consider a Markovian queueing system with N heterogeneous service facilities, each of which has multiple servers available, linear holding costs, a fixed value of service and a firstcomefirstserve queue discipline. Customers arriving in the system can be either rejected or sent to one of the N facilities. Two different types of control policies are considered, which we refer to as ‘selfishly optimal’ and ‘socially optimal’. We prove the equivalence of two different Markov Decision Process formulations, and then show that classical M/M/1 queue results from the early literature on behavioural queueing theory can be generalized to multiple dimensions in an elegant way. In particular, the state space of the continuoustime Markov process induced by a socially optimal policy is contained within that of the selfishly optimal policy. We also show that this result holds when customers are divided into an arbitrary number of heterogeneous classes, provided that the service rates remain nondiscriminatory.
Keywords
queues with balking Markov Decision Processes equilibrium strategies optimal strategies dynamic programming1. Introduction
One of the most persistent themes in the literature on behavioural queueing theory is the suboptimality of greedy or ‘selfish’ customer behaviour in the context of overall social welfare. In order to induce the most favourable scenario for society as a whole, customers are typically required to deviate in some way from the actions that they would choose if they were motivated only by their own interests. This principle has been observed in many of the classical queueing system models, including M/M/1, GI/M/1, GI/M/s and others (see, eg, Naor, 1969; Yechiali, 1971; Knudsen, 1972; Yechiali, 1972; Littlechild, 1974; Edelson and Hildebrand, 1975; Lippman and Stidham, 1977; Stidham, 1978). More recently, this theme has been explored in applications including queues with setup and closedown times (Sun et al, 2010), queues with server breakdowns and delayed repairs (Wang and Zhang, 2011), vacation queues with partial information (Guo and Li, 2013), queues with compartmented waiting space (Economou and Kanta, 2008) and routing in public services (Knight et al, 2012; Knight and Harper, 2013). More generally, the implications of selfish and social decision making have been studied in various applications of economics and computer science; Roughgarden’s (2005) monograph provides an overview of this work and poses some open problems.
The inspiration for our work is derived primarily from public service settings in which customers may receive service at any one of a number of different locations. For example, in a healthcare setting, patients requiring a particular operation procedure might choose between various different healthcare providers (or a choice might be made on their behalf by a central authority). In this context, the ith provider is able to treat up to c_{ i } patients at once, and any further arrivals are required to join a waiting list, or seek treatment elsewhere. A further application of this work involves the queueing process at immigration control at ports and/or airports. These queues are often centrally controlled by an officer aiming to ensure that congestion is reduced. Finally, computer data traffic provides yet another application of this work. When transferring packets of data over a network, there arise instances at which choices of available servers can have a major impact on the efficacy of the entire network.

In Section 2 we provide an MDP formulation of our queueing system and define all of the input parameters. We also offer an alternative formulation and show that it is equivalent.

In Section 3 we define ‘selfishly optimal’ and ‘socially optimal’ policies in more detail. We then show that our model satisfies certain conditions which imply the existence of a stationary socially optimal policy, and prove an important relationship between the structures of the selfishly and socially optimal policies.

In Section 4 we draw comparisons between the results of Section 3 and known results for systems of unobservable queues.

In Section 5 we show that the results of Section 3 hold when customers are divided into an arbitrary number of heterogeneous classes. These classes are heterogeneous with respect to demand rates, holding costs and service values, but not service rates.

Finally, in Section 6, we discuss the results of this paper and possible avenues for future research.
2. Model formulation
We consider a queueing system with N service facilities. Customers arrive from a single demand node according to a stationary Poisson process with demand rate λ>0. Let facility i (for i=1, 2, …, N) have c_{ i } identical service channels, a linear holding cost β_{ i }>0 per customer per unit time, and a fixed value of service (or fixed reward) α_{ i }>0. Service times at any server of facility i are assumed to be exponentially distributed with mean μ_{ i }^{−1}. We assume α_{ i }⩾β_{ i }/μ_{ i } for each facility i in order to avoid degenerate cases where the reward for service fails to compensate for the expected costs accrued during a service time. When a customer arrives, they can proceed to one of the N facilities or, alternatively, exit from the system without receiving service (referred to as balking). Thus, there are N+1 possible decisions that can be made upon a customer’s arrival. The decision chosen is assumed to be irrevocable; we do not allow reneging or jockeying between queues. The queue discipline at each facility is firstcomefirstserved (FCFS). A diagrammatic representation of the system is given in Figure 1.
We define Open image in new window to be the state space of our system, where x_{ i } (the ith component of the vector x) is the number of customers present (including those in service and those waiting in the queue) at facility i. It is assumed that the system state is always known and can be used to inform decision making.
No binding assumption is made in this paper as to whether decisions are made by individual customers themselves, or whether actions are chosen on their behalf by a central controller. It is natural to suppose that selfish decision making occurs in the former case, whereas socially optimal behaviour requires some form of central control, and the discussion in this paper will tend to be consistent with this viewpoint; however, the results in this paper remain valid under alternative perspectives (eg, socially optimal behaviour might arise from selfless cooperation between customers).
where e_{ i } is the ith vector in the standard orthonormal basis of Open image in new window
Lemma 1

For any stationary policy θ we have:
where r and Open image in new window are defined as in (1) and (3) respectively. That is, the longrun average net reward under θ is the same under either reward formulation.
Proof

We assume the existence of a stationary distribution {π_{ θ }(x)}_{x∈S}, where π_{ θ }(x) is the steadystate probability of being in state x∈S under the stationary policy θ and ∑_{x∈S}π_{ θ }(x)=1. If no such distribution exists, then the system is unstable under θ and both quantities in (4) are infinite. Under steadystate conditions, we can write:
noting, as before, that Open image in new window (unlike r) has a dependence on the action θ(x) associated with x. For each x∈S, the steadystate probability π_{ θ }(x) is the same under either reward formulation since we are considering a fixed stationary policy. Our objective is to show:
We begin by partitioning the state space S into disjoint subsets. For each facility i∈{1, 2, …, N}, let S_{ i } denote the (possibly empty) set of states at which the action chosen under the policy θ is to join i. Then S_{ i }=S_{i−}∪S_{i+}, where:
We also let S_{0} denote the set of states at which the action chosen under θ is to balk. Now let g_{ θ } (x, r) and Open image in new window be divided into ‘positive’ and ‘negative’ constituents in the following way:
By referring to (1) and (3), it can be checked that g_{ θ }(x,r)=g_{ θ }^{+}(x,r)+g_{ θ }^{−}(x,r) and Open image in new window It will be sufficient to show that Open image in new window and Open image in new window . Let S_{ i,k }⊆S_{ i } (for k=0, 1, 2, …) be the set of states at which the action chosen under θ is to join facility i, given that there are k customers present there. That is:
Using the detailed balance equations for ergodic Markov chains under steadystate conditions (see, eg, Cinlar, 1975) we may assert that for every facility i and k⩾0, the total flow from all states x∈S with x_{ i }=k up to states with x_{ i }=k+1 must equal the total flow from states with x_{ i }=k+1 down to x_{ i }=k. Hence:
Summing over all Open image in new window , we obtain:
which holds for i∈{1, 2, …, N}. The physical interpretation of (6) is that, under steadystate conditions, the rate at which customers join facility i is equal to the rate at which service completions occur at i. Multiplying both sides of (6) by α_{ i } and summing over i∈{1, 2, …, N}, we have:
which states that Open image in new window as required. It remains for us to show that Open image in new window We proceed as follows: in (5) (which holds for all Open image in new window and i∈{1, 2, .., N}), put k=c_{ i } to obtain:
Suppose we multiply both sides of (7) by c_{ i }+1. Since the sum on the lefthand side is over Open image in new window and the sum on the righthand side is over states with x_{ i }=c_{ i }+1, this is equivalent to multiplying each summand on the lefthand side by x_{ i }+1 and each summand on the righthand side by x_{ i }. In addition, multiplying both sides by β_{ i }/(c_{ i }μ_{ i }) yields:
We can write similar expressions with k=c_{ i }+1, c_{ i }+2 and so on. Recall that Open image in new window by definition. Hence, by summing over all k⩾c_{ i } in (8) we obtain:
Note also that multiplying both sides of (5) by β_{ i }/μ_{ i } and summing over all k<c_{ i } (and recalling that Open image in new window ) gives:
Hence, from (9) and (10) we have:
Summing over i∈{1, 2, …, N} gives Open image in new window as required. We have already shown that Open image in new window so this completes the proof that Open image in new window . □
It follows from Lemma 1 that any policy which is optimal among stationary policies under one reward formulation (either r or Open image in new window ) is likewise optimal under the other formulation, with the same longrun average reward. The interchangeability of these two reward formulations will assist us in proving later results.
3. Containment of socially optimal policies
Let us define what we will refer to as ‘selfishly optimal’ and ‘socially optimal’ policies. The terminology used in this paper is slightly incongruous to that which is typically found in the literature on MDPs, and the main reason for this is that we wish to draw analogies with the work of Naor (1969). The policies which we describe as ‘socially optimal’ are those which satisfy the wellknown Bellman optimality equations of dynamic programming (introduced by Bellman, 1957), and would be referred to by many authors simply as ‘optimal’ policies; on the other hand, the ‘selfishly optimal’ policies that we will describe could alternatively be referred to as ‘greedy’ or ‘myopic’ policies.
We begin with selfishly optimal policies. Suppose that each customer arriving in the system is allowed to make his or her own decision (as opposed to being directed by a central decisionmaker). It is assumed throughout this work that the queueing system is fully observable and therefore the customer is able to observe the exact state of the system, including the length of each queue and the occupancy of each facility (the case of unobservable queues is a separate problem; see, eg, Bell and Stidham, 1983; Haviv and Roughgarden, 2007; Shone et al, 2013). Under this scenario, a customer may calculate their expected net reward (taking into account the expected cost of waiting and the value of service) at each facility based on the number of customers present there using a formula similar to (3); if they act selfishly, they will simply choose the option which maximizes this expected net reward. If the congestion level of the system is such that all of these expected net rewards are negative, we assume that the (selfish) customer’s decision is to balk. This definition of selfish behaviour generalizes Naor’s simple decision rule for deciding whether to join or balk in an M/M/1 system. We note that since the FCFS queue discipline is assumed at each facility, a selfish customer’s behaviour depends only on the existing state, and is not influenced by the knowledge that other customers act selfishly.
In the case of ties, we assume that the customer joins the facility with the smallest index i; however, balking is never chosen over joining facility i when Open image in new window This is in keeping with Naor’s convention.
where h(x) is a relative value function and g* is the optimal longrun average net reward. (We adopt the notational convention that x^{0+}=x to deal with the case where balking is optimal in (11).) Under the anticipatory reward formulation in (3) these optimality equations are similar except that r(x) is replaced by Open image in new window , which must obviously be included within the maximization operator. Indeed, by adopting Open image in new window as our reward formulation we may observe the fundamental difference between the selfishly and socially optimal policies: the selfish policy simply maximizes the immediate reward Open image in new window , without taking into account the extra term h(x^{a+}); this is why it may be called a myopic policy. The physical interpretation is that under the selfish policy, customers consider only the outcome to themselves, without taking into account the implications for future customers, who may suffer undesirable consequences as a result of their behaviour.
Remark

It has already been shown (Lemma 1) that, in an infinitehorizon problem, a stationary policy earns the same longrun average reward under either of the reward formulations r and Open image in new window . However, this equivalence is lost when we consider finitehorizon problems. Indeed, given a finite horizon n, a policy which is optimal under reward function r may perform extremely poorly under Open image in new window This is especially likely to be the case if n is small.
 1
The decisions made under Open image in new window are entirely independent of the demand rate λ.
 2
The threshold b_{ i } (representing the steadystate maximum occupancy at i) is independent of the parameters for the other facilities j≠i.
We will refer to Open image in new window as the selfishly optimal state space. Note that, due to the convention that the facility with the smallest index i is chosen in the case of a tie between the expected net rewards at two or more facilities, the selfishly optimal policy Open image in new window is unique in any given problem. Changing the ordering of the facilities (and thereby the tiebreaking rules) affects the policy Open image in new window but does not alter the boundaries of Open image in new window
Example 1
 Consider a system with demand rate λ=12 and only two facilities. The first facility has two channels available (c_{ 1 }=2) and a service rate μ_{ 1 }=5, holding cost β_{ 1 }=3 and fixed reward α_{ 1 }=1. The parameters for the second facility are c_{ 2 }=2, μ_{ 2 }=1, β_{ 2 }=3 and α_{ 2 }=3, so it offers a higher reward but a slower service rate. We can uniformize the system by taking Δ=1/24, so that (λ+∑_{ i }c_{ i }μ_{ i })Δ=1. The selfishly optimal state space Open image in new window for this system consists of 12 states. Figure 3 shows the decisions taken at these states under the selfishly optimal policy Open image in new window and also the corresponding decisions taken under a socially optimal policy θ*.
By comparing the tables in Figure 3 we may observe the differences between the policies Open image in new window and θ*. At the states (2, 0), (2, 1), (2, 2) and (3, 1), the socially optimal policy θ* deviates from the selfish policy Open image in new window (incidentally, the suboptimality of the selfish policy is about 22%). More striking, however, is the fact that under the socially optimal policy, some of the states in Open image in new window are actually unattainable under steadystate conditions. Indeed, the recurrent state space Open image in new window consists of only six states (enclosed by the bold rectangle in the figure). Thus, for this system, Open image in new window and in this section we aim to prove that this result holds in general.
Lemma 2

For every state x∈S and discount rate 0<γ<1:
Proof

Let θ_{0} be the trivial policy of balking under every state. Each reward Open image in new window is zero and hence Open image in new window . Since Open image in new window by definition, the result follows. □
Lemma 3

For every state x∈S, discount rate 0<γ<1 and facility i∈{1, 2, …, N}, we have:
Proof

We rely on the finite horizon optimality equations (for discounted problems) and prove the result using induction on the number of stages. The finite horizon optimality equations are:
It is sufficient to show that for each state x∈S, discount rate 0<γ<1, facility i∈{1, 2, …, N} and integer n⩾0:
We define Open image in new window for all x∈S. In order to show that (16) holds when n=1, we need to show, for i=1, 2, …, N:
Indeed, let Open image in new window . It follows from the definition of Open image in new window in (3) that Open image in new window for any fixed action a and facility i. Hence:
Now let us assume that (16) also holds for n=k, where k⩾1 is arbitrary, and aim to show Open image in new window . We have:
Note that the indicator term in (17) arises because, under state x^{i+}, there may (or may not) be one extra service in progress at facility i, depending on whether or not x_{ i }<c_{ i }. Recall that we assume λ+Σ_{i=1}^{ N }c_{ i }μ_{ i }=1, hence (1−λ−Σ_{j=1}^{ N }min(x_{ j },c_{ j })μ_{ j }−I(x_{ i }<c_{ i })μ_{ i }) must always be nonnegative. We also have Open image in new window and Open image in new window (for j=1, 2, …, N) using our inductive assumption of monotonicity at stage k. Hence, in order to verify that (17) is nonpositive, it suffices to show:
Here, let a* be a maximizing action on the lefthand side, that is
By the monotonicity of Open image in new window and our inductive assumption, we have:
Hence the lefthand side of (18) is bounded above by Open image in new window , which in turn is bounded above by Open image in new window . This shows that Open image in new window , which completes the inductive proof that (16) holds for all Open image in new window . Using the method of ‘successive approximations’, Ross (1983) proves that Open image in new window for all x∈S, and so we conclude that Open image in new window as required. □
Lemma 4

For every x∈S, there exists a value M(x)>0 such that, for every discount rate 0<γ<1:
where 0 denotes the ‘empty system’ state, (0, 0, …, 0).
Proof

Let α_{max}=max_{i∈{1, 2, …, N}}α_{ i } denote the maximum value of service across all facilities. For each discount rate 0<γ<1 and policy θ, let us define a new function w_{ θ,γ } by:
By comparison with the definition of v_{ θ,γ } in (14), we have:
and since the subtraction of a constant from each singlestep reward does not affect our optimality criterion, we also have:
where Open image in new window . By the definition of Open image in new window in (3) it can be checked that Open image in new window for all stateaction pairs (x, a). Therefore Open image in new window is a sum of nonpositive terms and must be nonpositive itself. Furthermore, w*_{ γ } is the TEDR function for a new MDP which is identical to our original MDP except that we replace each Open image in new window (for n=0, 1, 2, …) by Open image in new window . Thus, w*_{ γ } satisfies:
Consider x=0^{i+}, for an arbitrary i∈{1, 2, …, N}. Using (20) we have, for all actions a:
In particular, if the action a=0 is to balk then Open image in new window and the only possible transitions are to states 0 or 0^{i+}. Hence:
Then, since γ⩽1 and by the nonpositivity of Open image in new window and Open image in new window :
From (19) and (21) we derive:
so we have a lower bound for Open image in new window which is independent of γ as required. We need to show that for each x∈S, a lower bound can be found for Open image in new window . Let us form a hypothesis as follows: for each state x∈S, there exists a value ψ(x) such that, for all γ:
We have ψ(0)=0 and, from (22), ψ(0^{i+})=μ_{ i }^{−1} for i=1, 2, …, N. Let us aim to show that (23) holds for an arbitrary x≠0, under the assumption that for all j∈{1, 2, …, N} with x_{ j }⩾1, (23) holds for the state x^{j−}. Using similar steps to those used for 0^{i+} earlier, we have:
and hence:
Then, using our inductive assumption that, for each j∈{1,2,…,N}, Open image in new window is bounded below by−λα_{max}ψ(x^{j−}):
Using (19), we conclude that the righthand side of (24) is also a lower bound for Open image in new window . Therefore we can define:
with the result that Open image in new window is bounded below by an expression which depends only on the system input parameters λ, α_{max} and the service rates μ_{1}, μ_{2}, …, μ_{ N } as required. Using an inductive procedure, we can derive a lower bound of this form for every x∈S. □
Lemma 5

For all states x∈S and actions a∈{0, 1, 2, …, N}:
Where−M( y ) is the lower bound for Open image in new window derived in Lemma 4.
Proof

This is immediate from Lemma 4 since, for any x∈S, the number of ‘neighbouring’ states y that can be reached via a single transition from x is finite (regardless of the action chosen), and each M(y) is finite. □
Theorem 1

Consider a sequence of discount rates (γ_{ n }) converging to 1, with Open image in new window the associated sequence of discountoptimal stationary policies. There exists a subsequence (η_{ n }) of (γ_{ n }) such that the limit
exists, and the stationary policy θ* is average reward optimal. Furthermore, the policy θ* yields an average reward Open image in new window which, together with a function h(x), satisfies the optimality equations:
Proof

We refer to Sennott (1989), who presents four assumptions which (together) are sufficient for the existence of an average reward optimal stationary policy in an MDP with an infinite state space. From Lemma 2 we have Open image in new window for every x∈S and γ∈(0,1), so a stronger version of Assumption 1 in Sennott (1989) holds. From Lemma 3 we have Open image in new window for all x, i∈{1, 2, …, N} and γ, which implies Open image in new window using an inductive argument. Therefore Assumption 2 in Sennott (1989) also holds. Assumptions 3 and 3* in Sennott (1989) follow directly from Lemmas 4 and 5. □
Theorem 2

There exists a stationary policy θ* which satisfies the average reward optimality equations and which induces an ergodic Markov chain on some finite set Open image in new window of states contained in Open image in new window .
Proof

From the definition of Open image in new window in (13) we note that it is sufficient to show that for some stationary optimal policy θ*, the action θ*(x) prescribed under state x∈S is never to join facility i when x_{ i }=b_{ i } (i=1, 2, …, N). The policy θ* described in Theorem 1 is obtained as a limit of the discountoptimal stationary policies Open image in new window . It follows that for every state x∈S there exists an integer U(x) such that Open image in new window for all n⩾U(x), and therefore it suffices to show that for any discount rate 0<γ<1, the discountoptimal policy θ*_{ γ } forbids joining facility i under states x with x_{ i }=b_{ i }. For a contradiction, suppose x_{ i }=b_{ i } and θ*_{ γ }(x)=1 for some state x, facility i and discount rate γ. Then the discount optimality equations in (15) imply:
that is, joining i is preferable to balking at state x. Given that x_{ i }=b_{ i }, we have Open image in new window and therefore (26) implies Open image in new window , but this contradicts the result of Lemma 3. □
Lemma 6

Any stationary policy θ* which maximizes the longrun average reward defined in (2) induces an ergodic Markov chain on some set of states contained in Open image in new window .
Proof

Suppose, for a contradiction, that we have a stationary policy θ which maximizes (2) and Open image in new window for some state Open image in new window with Open image in new window and Open image in new window . We proceed using a sample path argument. We start two processes at an arbitrary state x_{0}∈S and apply policy θ to the first process, which follows path x(t). Let (x(t), t) denote the statetime of the system. Since θ is stationary, we may abbreviate θ(x(t), t) to θ(x(t)). We also apply a nonstationary policy φ to the second process, which follows path y(t). The policy φ operates as follows: it chooses the same actions as θ at all times, unless the first process is in state Open image in new window , in which case φ chooses to balk instead of joining facility i. In notation:
Initially, x(0)=y(0)=x_{0}. Let t_{1} denote the first time, during the system’s evolution, that the first process is in state Open image in new window . At this point the process earns a negative reward Open image in new window by choosing action i; meanwhile, the second process earns a reward of zero by choosing to balk. An arrival may or may not occur at t_{1}; if it does, the first process acquires an extra customer, and if not, both processes remain in state Open image in new window (but nevertheless, due to the reward formulation in (3), the second process earns a greater reward at time t_{1}). Let u_{1} denote the time of the next visit (after time t_{1}) of the first process to the regenerative state 0. In the interval (t_{1}, u_{1}], the first process may acquire a certain number of extra customers at facility i (possibly more than one) in comparison to the second process due to further arrivals occurring under state Open image in new window . Throughout the interval (t_{1}, u_{1}], x(t) dominates y(t) in the sense that every facility has at least as many customers present under x(t) as under y(t). Consequently, at time u_{1} or earlier, the processes are coupled again. At each of the time epochs t_{1}+1, t_{1}+2, …, u_{1} we note that the reward earned by the first process cannot possibly exceed the reward earned by the second process; this is because the presence of extra customers at facility i results in either a smaller reward (if facility i is chosen) or an equal reward (if a different facility, or balking, is chosen). Therefore the total reward earned by the first process up until time u_{1} is smaller than that earned by the second process.
Using similar arguments, we can say that if t_{2} denotes the time of the next visit (after u_{1}) of the first process to state Open image in new window , the second process must earn a greater total reward than the first process in the interval (t_{2}, u_{2}], where u_{2} is the time of the next visit (after t_{2}) of the first process to state 0. Given that Open image in new window , the state Open image in new window is visited infinitely often. Hence, by repetition of this argument, it is easy to see that θ is strictly inferior to the nonstationary policy φ in terms of expected longrun average reward. We know (by Theorem 1) that an optimal stationary policy exists, so there must be another stationary policy which is superior to θ. □
Theorem 1 may be regarded as a generalisation of a famous result which is due to Naor. In 1969, Naor shows (in the context of a single M/M/1 queue) that the selfishly optimal and socially optimal strategies are both threshold strategies, with thresholds n_{ s } and n_{ o }, respectively, and that n_{ o }⩽n_{ s }. This is the M/M/1 version of the containment property which we have proved for multiple, heterogeneous facilities (each with multiple service channels allowed). We also note that Theorem 1 assures us of being able to find a socially optimal policy by searching within the class of stationary policies which remain ‘contained’ in the finite set Open image in new window . This means that we can apply the established techniques of dynamic programming (eg, value iteration, policy improvement) by restricting the state space so that it only includes states in Open image in new window ; any policy that would take us outside Open image in new window can be ignored, since we know that such a policy would be suboptimal. For example, when implementing value iteration, we loop over all states in Open image in new window on each iteration and simply restrict the set of actions so that joining facility i is not allowed at any state x with x_{ i }=b_{ i }. This ‘capping’ technique enables us to avoid the use of alternative techniques which have been proposed in the literature for searching for optimal policies on infinite state spaces (see, eg, the method of ‘approximating sequences’ proposed by Sennott (1991), or Ha’s (1997) method of approximating the limiting behaviour of the value function).
4. Comparison with unobservable systems
The results proved in Section 3 bear certain analogies to results which may be proved for systems of unobservable queues, in which routing decisions are made independently of the state of the system. In this section we briefly discuss the case of unobservable queues, in order to draw comparisons with the observable case. Comparisons between selfishly and socially optimal policies in unobservable queueing systems have already received considerable attention in the literature (see, eg, Littlechild, 1974; Edelson and Hildebrand, 1975; Bell and Stidham, 1983; Haviv and Roughgarden, 2007; Knight and Harper, 2013).
Consider a multiplefacility queueing system with a formulation identical to that given in Section 2, but with the added stipulation that the action a_{ n } chosen at time step n must be selected independently of the system state x_{ n }. In effect, we assume that the system state is hidden from the decisionmaker. Furthermore, the decisionmaker lacks the ability to ‘guess’ the state of the system based on the waiting times of customers who have already passed through the system, and must simply assign customers to facilities according to a vector of routing probabilities (p_{1}, p_{2}, …, p_{ N }) which remains constant over time. We assume that Σ_{i=1}^{ N }p_{ i }⩽1, where p_{ i } is the probability of routing a customer to facility i. Hence, p_{0}≔1−Σ_{i=1}^{ N }p_{ i } is the probability that a customer will be rejected.
where L_{ i }(λ_{ i }) is the expected number of customers present at i under demand rate λ_{ i }. In this context, a socially optimal policy is a vector (λ*_{1,} λ*_{2, …,} λ*_{ N }) which maximizes the sum Σ_{i=1}^{ N }g_{ i }(λ*_{ i }). On the other hand, a selfishly optimal policy is a vector Open image in new window which causes the system to remain in equilibrium, in the sense that no selfinterested customer has an incentive to deviate from the randomized policy in question (see Bell and Stidham, 1983, p 834). More specifically, individual customers make decisions according to a probability distribution Open image in new window (where Open image in new window for each i∈{1, 2, …, N}) and, in order for equilibrium to be maintained, it is necessary for all of the actions chosen with nonzero probability to yield the same expected net reward.
First of all, it is worth making the point that no theoretical upper bound exists for the number of customers who may be present at any individual facility i under a Poisson demand rate λ_{ i } which is independent of the system state (unless, of course, λ_{ i }=0). Indeed, standard results for M/M/c queues (see Gross and Harris, 1998, p 69) imply that the steadystate probability of n customers being present at a facility with a positive demand rate is positive for each n⩾0. As such, the positive recurrent state spaces under the selfishly and socially optimal policies are both unbounded in the unobservable case, and there is no prospect of being able to prove a ‘containment’ result similar to that of Theorem 2. However, it is straightforward to prove an alternative result involving the total effective admission rates under the two policies which is consistent with the general theme of socially optimal policies generating ‘less busy’ systems than their selfish counterparts.
It is worth noting that the theory of nonatomic routing games (see Roughgarden, 2005) assures us that the equilibrium and socially optimal policies both exist and are unique. This allows a simple argument to be formed in order to show that the sum of the joining rates at the individual facilities under a socially optimal policy (let us denote this by η*) cannot possibly exceed the corresponding sum under an equilibrium policy (denoted Open image in new window . Indeed, it is clear that if the system demand rate λ is sufficiently large, then the selfishly and socially optimal joining rates at any individual facility i will attain their ‘ideal’ values Open image in new window and λ*_{ i } (as depicted in Figure 4), respectively, and so in this case it follows trivially that Open image in new window . On the other hand, suppose that λ is not large enough to permit w_{ i }(λ_{ i })=0 for all facilities i. In this case, w_{ i }(λ_{ i }) must be strictly positive for some facility i, and therefore the probability of a customer balking under an equilibrium strategy is zero (since balking is unfavourable in comparison to joining facility i). Hence, one has Open image in new window in this case, and since η* is also bounded above by λ the result Open image in new window follows.
The conclusion of this section is that, while the ‘containment’ property of observable systems proved in Section 3 does not have an exact analogue in the unobservable case, the general principle that selfish customers create ‘busier’ systems still persists (albeit in a slightly different guise).
5. Heterogeneous customers
An advantage of the anticipatory reward formulation in (3) is that it enables the results from Section 3 to be extended to a scenario involving heterogeneous customers without a redescription of the state space S being required. Suppose we have M⩾2 customer classes, and customers of the ith class arrive in the system via their own independent Poisson process with demand rate λ_{ i } (i=1, 2, .., M). In this case we assume, without loss of generality, that ∑_{ i }λ_{ i }+∑_{ j }c_{ j }μ_{ j }=1. For convenience we will define λ:=∑_{ i }λ_{ i } as the total demand rate. We allow the holding costs and fixed rewards in our model (but not the service rates) to depend on these customer classes; that is, the fixed reward for serving a customer of class i at facility j is now α_{ ij }, and the holding cost (per unit time) is β_{ ij }. Various physical interpretations of this model are possible; for example, suppose we have a healthcare system in which patients arrive from various different geographical locations. Then the parameters α_{ ij } and β_{ ij } may be configured according to the distance of service provider j from region i (among other factors), so that patients’ commuting costs are taken into account.
for i=1, 2 …, M. We note that expanding the action set in this manner is not the only possible way of formulating our new model (with heterogeneous customers) as an MDP, but it is the natural extension of the formulation adopted in the previous section. An alternative approach would be to augment the state space so that information about the class of the most recent customer to arrive is included in the state description; actions would then need to be chosen only at arrival epochs, and these actions would simply be integers in the set {0, 1, …, N} as opposed to vectors (see Puterman, 1994, p 568) for an example involving admission control in an M/M/1 queue). By keeping the state space S unchanged, however, we are able to show that the results of Section 3 can be generalized very easily.
Note that the maximization in (28) can be carried out in a componentwise fashion, so that instead of having to find the maximizer among all vectors a in A (of which the total number is (N+1)^{ M }), we can simply find, for each customer class i∈M, the ‘marginal’ action a_{ i } which maximizes Open image in new window . This can be exploited in the implementation of dynamic programming algorithms (eg, value iteration), so that computation times increase only in proportion to the number of customer classes M.
Example 2

We modify Example 1 from earlier so that there are now two classes of customer, with demand rates λ_{ 1 }=12 and λ_{ 2 }=10 respectively. The first class has the same cost and reward parameters as in Example 1; that is, β_{ 11 }=3, α_{ 11 }=1 (for the first facility) and β_{ 12 }=3, α_{ 12 }=3 (for the second facility). The second class of customer has steeper holding costs and a much greater value of service at the second facility: β_{ 21 }=β_{ 22 }=5, α_{ 21 }=1, α_{ 22 }=12. Both facilities have two service channels and the service rates μ_{ 1 }=5 and μ_{ 2 }=1 remain independent of customer class. We take Δ=(∑_{ i }λ_{ i }+∑_{ j }c_{ j }μ_{ j })^{ −1 }=1/34 to uniformize the system.
We have previously seen that customers of the first class acting selfishly will cause the system state to remain within a set of 12 states under steadystate conditions, with x_{1}⩽3 and x_{2}⩽2 at all times. The incorporation of a second class of customer has no effect on the selfish decisions made by the first class of customer, so (as shown in Figure 5 ) these decisions remain the same as shown in Figure 3 previously. The first table in Figure 5 shows that selfish customers of the second class are unwilling to join the first facility when x_{1}⩾2; however, under certain states they will choose to join the second facility when x_{2}=3 (but never when x_{ 2 }>3). As a result, the selfish state space Open image in new window is expanded from 12 states to 20.Figure 5 shows that the new selfish state space Open image in new window may be represented diagrammatically as the smallest rectangle which encompasses both Open image in new window and Open image in new window , where (for i=1, 2) we have defined:
It is somewhat interesting to observe that Open image in new window includes states in the ‘intersection of complements’ Open image in new window . These states would not occur (under steadystate conditions) if the system operated with only a single class of customer of either type, but they do occur with both customer types present.
The policy θ* depicted in the second table in Figure 5 has been obtained using value iteration, and illustrates the containment property for systems with heterogeneous customer classes. It may easily be seen that the socially optimal state space Open image in new window consists of only 9 states; under steadystate conditions, the system will remain within this smaller class of states. We also observe that, unlike the selfish decisions, the socially optimal decisions for a particular class of customer are affected by the decisions made by the other class of customer (as can be seen, in the case of the first customer class, by direct comparison with Figure 3 from Example 1). Indeed, under θ*, customers of the first class never join Facility 2, and customers of the second class never join Facility 1.
with Open image in new window , thus contradicting the result of (the modified) Lemma 3. The sample path argument in Lemma 6 can be applied to a customer of any class, with only trivial adjustments needed.
6. Conclusions
The principle that selfish users create busier systems is wellobserved in the literature on behavioural queueing theory. While this principle is interesting in itself, we also believe that it has the potential to be utilized much more widely in applications. As we have demonstrated, the search for a socially optimal policy may be greatly simplified by reducing the search space according to the bounds of the corresponding ‘selfish’ policy, so that the methods of dynamic programming can be more easily employed.
Our results in this paper hold for an arbitrary number of facilities N, and (in addition) the results in Section 5 hold for an arbitrary number of customer classes M. This lack of restriction makes the results powerful from a theoretical point of view, but we must also point out that in practice, the ‘curse of dimensionality’ often prohibits the exact computation of optimal policies in largescale systems, even when the state space can be assumed finite. This problem could be partially addressed if certain structural properties (eg monotonicity properties) of socially optimal policies could be proved with the same level of generality as our ‘containment’ results. It can be shown trivially that selfish policies are monotonic in various respects (eg, balking at the state x implies balking at state x^{j+}, for any facility j) and, indeed, the optimality of monotone policies is a popular theme in the literature, although in our experience these properties are usually not trivial to prove for an arbitrary number of facilities. In future work, we intend to investigate how the search for socially optimal policies can be further simplified by exploiting their theoretical structure.
Notes
References
 Argon NT, Ding L, Glazebrook KD and Ziya S (2009). Dynamic routing of customers with general delay costs in a multiserver queuing system. Probability in the Engineering and Informational Sciences 23 (2): 175–203.CrossRefGoogle Scholar
 Bell CE and Stidham S (1983). Individual versus social optimization in the allocation of customers to alternative servers. Management Science 29 (7): 831–839.CrossRefGoogle Scholar
 Bellman RE (1957). Dynamic Programming. Princeton University Press: New Jersey.Google Scholar
 CavazosCadena R (1989). Weak conditions for the existence of optimal stationary policies in average markov decision chains with unbounded costs. Kybernetika 25 (3): 145–156.Google Scholar
 Cinlar E (1975). Introduction to Stochastic Processes. PrenticeHall: Englewood Cliffs, NJ.Google Scholar
 Economou A and Kanta S (2008). Optimal balking strategies and pricing for the single server Markovian queue with compartmented waiting space. Queueing Systems 59 (3): 237–269.CrossRefGoogle Scholar
 Edelson NM and Hildebrand DK (1975). Congestion tolls for poisson queuing processes. Econometrica 43 (1): 81–92.CrossRefGoogle Scholar
 Glazebrook KD, Kirkbride C and Ouenniche J (2009). Index policies for the admission control and routing of impatient customers to heterogeneous service stations. Operations Research 57 (4): 975–989.CrossRefGoogle Scholar
 Grassmann W (1983). The convexity of the mean queue size of the M/M/c queue with respect to the traffic intensity. Journal of Applied Probability 20 (4): 916–919.CrossRefGoogle Scholar
 Gross D and Harris C (1998). Fundamentals of Queueing Theory. John Wiley & Sons: New York.Google Scholar
 Guo P and Li Q (2013). Strategic behavior and social optimization in partiallyobservable Markovian vacation queues. Operations Research Letters 41 (3): 277–284.CrossRefGoogle Scholar
 Ha A (1997). Optimal dynamic scheduling policy for a maketostock production system. Operations Research 45 (1): 42–53.CrossRefGoogle Scholar
 Haviv M and Roughgarden T (2007). The price of anarchy in an exponential multiserver. Operations Research Letters 35 (4): 421–426.CrossRefGoogle Scholar
 Knight VA and Harper PR (2013). Selfish routing in public services. European Journal of Operational Research 230 (1): 122–132.CrossRefGoogle Scholar
 Knight VA, Williams JE and Reynolds I (2012). Modelling patient choice in healthcare systems: Development and application of a discrete event simulation with agentbased decision making. Journal of Simulation 6 (2): 92–102.CrossRefGoogle Scholar
 Knudsen NC (1972). Individual and social optimization in a multiserver queue with a general costbenefit structure. Econometrica 40 (3): 515–528.CrossRefGoogle Scholar
 Lee HL and Cohen AM (1983). A note on the convexity of performance measures of M/M/c Queueing systems. Journal of Applied Probability 20 (4): 920–923.CrossRefGoogle Scholar
 Lippman SA (1975). Applying a new device in the optimisation of exponential queueing systems. Operations Research 23 (4): 687–710.CrossRefGoogle Scholar
 Lippman SA and Stidham S (1977). Individual versus social optimization in exponential congestion systems. Operations Research 25 (2): 233–247.CrossRefGoogle Scholar
 Littlechild SC (1974). Optimal arrival rate in a simple queueing system. International Journal of Production Research 12 (3): 391–397.CrossRefGoogle Scholar
 Naor P (1969). The regulation of queue size by levying tolls. Econometrica 37 (1): 15–24.CrossRefGoogle Scholar
 Puterman ML (1994). Markov Decision Processes—Discrete Stochastic Dynamic Programming. Wiley & Sons: New York.Google Scholar
 Ross SM (1983). Introduction to Stochastic Dynamic Programming. Academic Press: New York.Google Scholar
 Roughgarden T (2005). Selfish Routing and the Price of Anarchy. MIT Press: Cambridge, MA.Google Scholar
 Sennott LI (1989). Average cost optimal stationary policies in infinite state markov decision processes with unbounded costs. Operations Research 37 (4): 626–633.CrossRefGoogle Scholar
 Sennott LI (1991). Value iteration in countable state average cost markov decision processes with unbounded costs. Annals of Operations Research 28 (1): 261–272.CrossRefGoogle Scholar
 Serfozo R (1979). An equivalence between continuous and discrete time Markov decision processes. Operations Research 27 (3): 616–620.CrossRefGoogle Scholar
 Shone R, Knight VA and Williams JE (2013). Comparisons between observable and unobservable M/M/1 queues with respect to optimal customer behavior. European Journal of Operational Research 227 (1): 133–141.CrossRefGoogle Scholar
 Stidham S (1978). Socially and individually optimal control of arrivals to a GI/M/1 queue. Management Science 24 (15): 1598–1610.CrossRefGoogle Scholar
 Stidham S and Weber RR (1993). A survey of Markov decision models for control of networks of queues. Queueing Systems 13 (1): 291–314.CrossRefGoogle Scholar
 Sun W, Guo P and Tian N (2010). Equilibrium threshold strategies in observable queueing systems with setup/closedown times. Central European Journal of Operations Research 18 (3): 241–268.CrossRefGoogle Scholar
 Wang J and Zhang F (2011). Equilibrium analysis of the observable queues with balking and delayed repairs. Applied Mathematics and Computation 218 (6): 2716–2729.CrossRefGoogle Scholar
 Yechiali U (1971). On optimal balking rules and toll charges in the GI/M/1 queuing process. Operations Research 19 (2): 349–370.CrossRefGoogle Scholar
 Yechiali U (1972). Customers’ optimal joining rules for the GI/M/s queue. Management Science 18 (7): 434–443.CrossRefGoogle Scholar
 Zijm H (1985). The optimality equations in multichain denumerable state markov decision proceses with the average cost criterion: The bounded cost case. Statistics and Decisions 3 (1): 143–165.Google Scholar
Copyright information
This work is licensed under a Creative Commons Attribution 3.0 Unported License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/