Redundancy scheduling with scaled Bernoulli service requirements

Redundancy scheduling has emerged as a powerful strategy for improving response times in parallel-server systems. The key feature in redundancy scheduling is replication of a job upon arrival by dispatching replicas to different servers. Redundant copies are abandoned as soon as the first of these replicas finishes service. By creating multiple service opportunities, redundancy scheduling increases the chance of a fast response from a server that is quick to provide service, and mitigates the risk of a long delay incurred when a single selected server turns out to be slow. The diversity enabled by redundant requests has been found to strongly improve the response time performance, especially in case of highly variable service requirements. Analytical results for redundancy scheduling are unfortunately scarce however, and even the stability condition has largely remained elusive so far, except for exponentially distributed service requirements. In order to gain further insight in the role of the service requirement distribution, we explore the behavior of redundancy scheduling for scaled Bernoulli service requirements. We establish a sufficient stability condition for generally distributed service requirements and we show that, for scaled Bernoulli service requirements, this condition is also asymptotically nearly necessary. This stability condition differs drastically from the exponential case, indicating that the stability condition depends on the service requirements in a sensitive and intricate manner.


Introduction
Redundancy scheduling has recently attracted strong interest as a strategy for significantly reducing response times in parallel-server systems [1,2,4,5,6,7,8,11,12,13,14]. The key feature in redundancy scheduling is replication of a job upon arrival allowing replicas to be assigned to, say, d different servers, chosen uniformly at random (without replacement). Redundant replicas are abandoned as soon as the first of these replicas either starts service ('cancel-on-start' or c.o.s.) or completes service ('cancel-on-completion' or c.o.c.). By creating multiple service opportunities, redundancy scheduling boosts the chance of a fast response from a server that is swift to provide service, and alleviates the risk of a long delay incurred when a job is assigned to a single server that may be slow. Note that the c.o.c. and c.o.s. policies both ensure that the first replica starts service at the server with the smallest workload among the d selected servers. The possibly concurrent service of multiple replicas under the c.o.c. policy provides a further hedge against potentially slow execution of the first replica in case replicas are independent (although it may also result in wastage of service effort).
The diversity offered by redundant requests has been shown to strongly improve the response time performance, especially in case of highly variable service requirements. Analytical results for redundancy scheduling are unfortunately scarce however, and have largely remained limited to exponentially distributed service requirements. Specifically, Gardner et al. [6] extensively analyzed the c.o.c. redundancy policy with exponentially distributed service requirements. They established the stability condition and showed that it does not depend on the number of replicas d, and thus coincides with the nominal condition without any redundancy. This may be explained from the fact that even with concurrent service the expected aggregate amount of time invested in the service of a job remains equal to the mean service requirement of a single instance due to the memoryless property of the exponential distribution. Gardner et al. [6] also derived an explicit expression for the expected latency, and proved that the latency is decreasing in the number of replicas d. Simulation experiments additionally demonstrated greater improvements in the latency in case of highly variable service requirements, particularly heavy-tailed distributions.
We are not aware of any analytical results for the c.o.c. redundancy policy with independent replicas and nonexponential service requirements. Hellemans & Van Houdt [10] consider the c.o.c. policy with identical replicas, and derive a differential equation for the marginal workload distribution at each of the servers in a limiting regime where the number of servers grows large. While the differential equation implicitly captures the stability condition, it does not yield any analytical expression, and the derivations for identical replicas rely on highly specific arguments that do not extend to independent replicas. It is also worth observing that the c.o.s. redundancy policy is equivalent to a power-of-d version of the Join-the-Smallest-Workload policy. While the workload and waiting-time distributions for these policies do not appear analytically tractable, the stability condition is simple and coincides with the nominal condition without any redundancy since no concurrent service takes place.
In order to gain further insight in the role of the service requirement distribution, we focus in the present paper on the behavior of the c.o.c. redundancy policy for scaled Bernoulli service requirements. While this is admittedly a rather special case, it provides a typical instance of highly variable service requirements for which redundancy scheduling is particularly relevant, and is also of intrinsic merit given the paucity of analytical results for general service requirement distributions.
First of all, we establish a simple sufficient stability condition in terms of a lower bound for the system capacity, i.e., the maximum aggregate load that can be supported. The lower bound is obtained from a stochastic coupling between the maximum workload across all the servers and the workload in a related single-server queue with the same arrival process and a service requirement that corresponds to the minimum service requirement across d replicas. The lower bound for the system capacity grows without bound with (a) the 'scale' of the service requirement and (b) the number of replicas d, but remarkably enough (c) does not depend on the number of servers at all (assuming that number to be at least equal to the number of replicas d). The 'scale' of the service requirement here refers to its non-zero value relative to its mean and provides a proxy for the degree of variability. The growth in the system capacity with (a) and (b) reflects the huge benefits provided by redundancy scheduling for highly variable service requirements.
In view of (c), the lower bound may at first sight seem loose for a larger number of servers, but we will use a further stochastic comparison argument to prove that it is in fact asymptotically tight when the scale of the service requirement grows suitably large. This implies that increasing the number of replicas significantly increases the system capacity, while adding servers does not asymptotically. Or stated differently, given the number of replicas d, redundancy scheduling ensures that asymptotically just d servers suffice to achieve the capacity achievable with any number of servers, which further highlights the great gains provided by redundancy scheduling for highly variable service requirements.
The remainder of the paper is organized as follows. In Section 2 we present a detailed model description, state a sufficient stability condition for generally distributed service requirements. In Section 3 we prove that this condition is also asymptotically nearly necessary for scaled Bernoulli service requirements. An upper bound for the expected waiting time is derived in Section 4 and in Section 5 we provide a conclusion.

Workload model and sufficient stability condition
We consider a system with N parallel servers. Jobs arrive according to a Poisson process with rate λ. Each arriving job is replicated and immediately allocated to d servers chosen uniformly at random (without replacement). The replicas at each server are served in order of arrival (FCFS) and the job is completed as soon as the first replica finishes service, whereafter the other d − 1 replicas are instantaneously abandoned. The service requirements of the d replicas are assumed to be independent and identically distributed (i.i.d.) copies of some random variable B. Note that this model corresponds to the independent runtime (IR) model described in [6].
Let ω = (ω1, . . . , ωN ) denote the workload of the system, where ωi is the workload at server i, for i = 1, . . . , N . Here we define workload as the real amount of work, i.e., the amount of work a server needs to complete to become idle in the absence of any arrivals. This may be smaller than the sum of the service requirements of all the replicas at the server since some replicas may get partly or entirely abandoned, see Example 1. Let sj and bj denote the sampled server and the realized service requirement of the j-th replica, respectively, for j = 1, . . . , d. The first replica will finish service on server sj * , where j * = arg min j∈{1,...,d} (ωs j + bj). The workload of server sj is then max{ωs j * + bj * , ωs j }, for j = 1, . . . , d.
Let ω (·) denote the workloads arranged in descending order, thus Throughout this paper we refer to synchronicity as the situation in which all workloads are equal, i.e., ω1 = . . . = ωN . Moreover, let Strun denote the truncated state space of the ordered workload vectors with The next property states that the d largest workloads will always be equal from some point onward. We will later see that under certain conditions the system will in fact be in full synchronicity nearly all the time.

Property 1
If ω ∈ Strun then ωnew ∈ Strun, where ωnew is any future workload. In other words, once the largest d workloads are equal, they will always remain equal.
Before stating and proving a sufficient stability condition, we prove the following lemma for generally distributed service requirements.

Lemma 1
The sequence of maximum workloads ω (1) at arbitrary epochs is stochastically upper bounded by the sequence of workloads ω M/G/1 in a corresponding M/G/1 queue with arrival rate λ M/G/1 = λ and generic service requirement B M/G/1 = min{B1, . . . , B d }, provided that the initial maximum workload ω (1) is smaller than the initial workload in the M/G/1 queue.

Remark 1
Observe that in synchronicity, in which all servers have the maximum workload, the bound min j∈{1,...,d} bj is tight, since here every arrival adds exactly min j∈{1,...,d} bj work to each of the d sampled servers.

Proposition 1 A sufficient stability condition is
(1) Proof: By Lemma 1 we know that the maximum workload in the system is bounded by the workload in a corresponding M/G/1 queue with arrival rate λ M/G/1 = λ and generic service requirement B M/G/1 = min{B1, . . . , B d }. The (necessary and sufficient) stability condition for the latter M/G/1 queue is given by In case N = d, the above condition is not only sufficient but in fact also necessary since the system behaves exactly as the corresponding M/G/1 queue, see also [11]. In case N > d the above condition is no longer strictly necessary. However, we will show that, surprisingly, it is asymptotically nearly necessary for independent scaled Bernoulli service requirements, which are defined as where K is a fixed positive real number, and X is a general strictly positive random variable with E[X] = 1. Moreover, we assume that E[B] = 1, which implies that From Proposition 1 it follows that for independent scaled Bernoulli service requirements the sufficient stability condition reduces to, since all jobs, other than type-A jobs, which have arrival rate (1 − p) d λ and service requirement min{X1K, . . . , X d K}, have service requirements for which min{B1, . . . , B d } = 0.

Asymptotically necessary stability condition
In this section we shall prove that the sufficient stability condition (2) is in fact also asymptotically nearly necessary. The proof relies on the property that the system is most of the time in synchronicity as K grows large.
In preparation for the proof let us first define a measure for synchronicity. Let the surplus workload, denoted by ω + , be the sum of the (element-wise) differences between the maximum workload and the workload at server i for i = 1, . . . , N , i.e., ω + = N i=1 ω (1) − ωi ; see Figure 1 for a visual representation. Note that ω + = 0 if and only if the system is in synchronicity.
In order to prove that the system is in synchronicity nearly all the time, we introduce an auxiliary system which is the same as our system except for three differences. In the auxiliary system (i) the workload at each server only decreases over time when in synchronicity, (ii) all type-A jobs are allocated to the first d ordered servers and (iii) only specific type-B jobs, so-called type-B1 jobs, are considered and the other type-B jobs are omitted. We define type-B1 jobs as ones for which d − 1 replicas, with at least one replica with service requirement equal to 0, are allocated to the first d − 1 ordered servers and one replica with service requirement X d K to the N -th ordered server, i.e., the server with the lowest current workload.
Below we comment on the properties of the surplus workloadω + in the auxiliary system.

Property 2
The surplus workload in the auxiliary systemω + experiences downward jumps at the instants of a Poisson process of rate (N −d)! N ! (1 − p)p d−1 λ, which is exactly the arrival rate of type-B1 jobs. The sizes of the downward jumps are equal to min{ω (1) −ω (N ) , X d K}.
Note that the surplus workload in the original system ω + experiences downward jumps at a higher rate thanω + , since not only type-B1 jobs decrease the surplus workload. Moreover, the sizes of the downward jumps in the surplus workload and in the surplus workload in the auxiliary system can differ, since these depend on the workloads in both systems (which are not necessarily equal).

Property 3
The surplus workload in the auxiliary systemω + experiences upward jumps of size exactly (N −d) min{X1, . . . , X d }K as a Poisson process of rate (1−p) d λ, which is the arrival rate of type-A jobs.
Note that the surplus workload in the original system ω + experiences upward jumps of smaller or equal size, since type-A jobs add at most min{X1, . . . , X d }K work to the current maximum workload, see Remark 1.
The number of jumps, denoted by Z, to reach synchronicity in the auxiliary system when only considering downward jumps is equal to the total number of type-B1 jobs that are needed at each server to bridge the difference between the maximum workload and the workload at this server. Thus, the expectation of the number of jumps to reach synchronicity, when only considering type-B1 jobs and starting in the initial workload stateω, whereω ∈ Strun, is where Sn = For proving an asymptotically necessary stability condition, we first need to prove the following two lemmas. Lemma 2 states that the surplus workload in the auxiliary system stochastically dominates the surplus workload in the original system and Lemma 3 states that the surplus workload in the auxiliary system is a high fraction of the time equal to 0 in the long term as K grows large. Together Lemmas 2 and 3 imply that the original system will also be in synchronicity a high fraction of the time in the long term as K grows large. This in turn implies that almost every arriving job will add B M/G/1 = min{B1, . . . , B d } to the maximum workload. Observe that this is exactly the upper bound, see Lemma 1, which resulted in the sufficient stability condition.
• When no arrivals occur, by definition of both systems, only the value of ω + (t) can decrease over time. Thus, it follows that ω (1) • In case of an arrival of a type-A job the value ofω + (t) increases with exactly (N − d) min{X1, . . . , X d }K, whereas the value of ω + (t) increases with at most (N − d) min{X1, . . . , X d }K, see the proof of Lemma 1 and Property 3. Also, note that a type-A job in the auxiliary system is always allocated to the first d ordered servers, instead of d servers sampled uniformly at random. Thus, it follows that min{X1, . . . , X d }K =ω (1) Combining the latter two inequalities yieldsω (1) • In case of an arrival of a type-B job, excluding a type-B1 job, only the value of ω + (t) can decrease. Thus, it follows that ω (1) (which is a strict inequality in case of a type-B job that adds workload to server i), whereasω (1) • In case of an arrival of a type-B1 job the value ofω + (t) decreases with min{ω (1) Observe that the decrement in the value ofω + (t) can be greater than the decrement in the value of ω + (t), see Property 2, but only if ω (1) (t2) − ωN * (t2) = 0, where N * is the server at time t2 that had the minimum workload at time t1 (which is not necessarily the server with minimum workload at time t2). Therefore, it follows thatω (1) We conclude that in all scenarios it still holds thatω (1) Lemma 2 implies that the surplus workload in the auxiliary systemω + (t) stochastically dominates the surplus workload ω + (t) when starting in the same initial workload state, see Figure 2. Now we prove that the surplus workload in the auxiliary system, ω + (t), is a high fraction of the time equal to 0 in the long term as K grows large. Proof: First denote τ1 := inf{t ≥ 0|ω + (t) > 0} as the time that the value ofω + (t) remains equal to 0, when starting in synchronicity. Note that τ1 is the time until the next upward jump, see Property 3. Therefore the expectation of τ1 is given by Denote the time that the workload in the auxiliary system remains in non-synchronicity, i.e., the time thatω + (t) > 0 when starting in initial workload stateω(0) =ω, wherẽ ω ∈ Strun, by τ2 := inf{t ≥ 0|ω + (t) = 0}. Moreover, let {Y |ω(0) =ω} denote the number of increments in the value ofω + (t) before reaching synchronicity when starting inω ∈ Strun, then the expectation of τ2 is with E[min{X1, . . . , X d }] ≤ E[X] = 1. The second equality results from Wald's equation, i.e., the equality between the expected time to reach synchronicity (given the number of upward jumps) and the expected number of downward jumps (given the number of upward jumps) multiplied with the expected time between such downward jumps. The inequality in the next step results from the proof of Lemma 1, which implies that the surplus workload increases with at most (N − d) min{X1, . . . , X d }K per upward jump, and using the bound on the expected number of downward jumps (given the number of upward jumps), i.e., Equation (3). Together with Wald's equation we can bound the expected time in non-synchronicity, namely . Moreover m(ω + K ) ↓ 0 as K grows large and by renewal theory (cf. [9]) we know that This completes the proof that the auxiliary surplus workload is at least a fraction (1 − ) of the time equal to 0 in the long term.
Remark 2 So far it has been assumed that in the initial state the first d ordered workloads are equal, but this assumption is not necessary. One can show via an approach analogous to Lemma 3, but with bound E[Z] < N m(ω + K ) + 1 , that the expected time to reach synchronicity when starting in an arbitrary initial workload state is still finite. Note that after reaching synchronicity the assumption is valid and that directly after synchronicityω Now we are ready to prove the main theorem of the paper.
Theorem 1 For every > 0 there exists a K (d, N ) such that for all K > K (d, N ) a necessary stability condition for independent scaled Bernoulli service requirements is Proof: From Lemma 2 we know thatω + (t) stochastically dominates ω + (t) and Lemma 3 states that for every > 0 there exists a K (d, N ) such that for all K > K (d, N ) the value ofω + (t) is at least a fraction (1 − ) of the time equal to 0 in the long term. Hence this latter statement also holds for the value of ω + (t). Moreover, by definition, if ω + (t) = 0 then the system is in synchronicity. In synchronicity, type-A jobs add exactly min{X1, . . . , X d }K work to the sampled servers. We conclude that, independent of the behavior in non-synchronicity, in the long term at least a fraction (1 − ) of the type-A jobs adds exactly min{X1, . . . , X d }K work to the current maximum workload. Thus, for the system to be stable it should at least be able to handle these latter type-A jobs.

Remark 3
The expected time in non-synchronicity depends on the renewal function m(t), see Lemma 3. This function in turn depends on the distribution of the X component in the service requirement distribution B. For some distributions an explicit expression for m(t) is known (cf. [9]):

Numerical results
In Section 3 it is proven that the system, for scaled Bernoulli distributed service requirements and K large enough, is a high fraction of the time in synchronicity in the long term. In this section we will use simulation to quantify this statement for various values of N and K, where d = 2 is fixed. In Figure 3 the long-term fraction of time in synchronicity is depicted as a function of K for various values of N , where we allow λ to depend on K and write λ(K) to reflect that. It can be seen that the system with N = d is always in synchronicity, which follows from Property 1. Moreover, the long-term fraction of time in synchronicity is higher for lower values of λ(K) K . The reason is that the empty state is included in the definition of synchronicity. Another observation is that for fixed λ(K) and K, increasing N decreases the long-term fraction of time in synchronicity. This is related to the fact that K (d, N ) defined in Theorem 1 depends on N .
In Lemma 1 we proved that the maximum workload is bounded by the workload in a corresponding M/G/1 queue. From this lemma it follows that for independent scaled Bernoulli service requirements, the maximum workload is bounded by the workload in a corresponding M/G/1 queue with arrival rate λ M/G/1 (K) = (1 − p) d λ and service requirement B M/G/1 (K) = min{X1, . . . , X d }K since all arrivals, other than the arrivals of type-A jobs, have service requirements for which min{B1, . . . , B d } = 0. This bound can be used to find an upper bound on the expected waiting time since an arriving job needs to wait at most for the current maximum workload, which is bounded by the workload V M/G/1 in the corresponding M/G/1 queue. From M/G/1 Note that this bound is tight for N = d since the system behaves exactly as the corresponding M/G/1 queue and asymptotically tight in K for N > d. For X ≡ 1 constant and d = 2 we get , which is linear in K if we assume that λ K is fixed. Notice that this upper bound does not depend on the number of servers. Figure 4 shows the expected waiting time as a function of K for various values of N ; again we allow λ to depend on K and write λ(K) to reflect that. When comparing both figures we can conclude that for finite K the number of servers N influences the expected waiting time more than the value of the fraction λ(K) K . Moreover, for N large, also a larger K is needed for the upper bound to be accurate.
Observe that with the upper bound for the expected waiting time we also have an   Figure 5 the difference between the expected latency and the expected waiting time is depicted as function of K for various values of N . Indeed, it can be seen that this difference vanishes as K grows large.

Conclusion
In this paper we have proven that the maximum workload in a parallel-server system with c.o.c. redundancy is upper bounded by the workload in a related M/G/1 queue. This directly yields a sufficient stability condition. Moreover, we proved that in the case of independent scaled Bernoulli service requirements the system is asymptotically a high fraction of the time in so-called synchronicity in the long term. In synchronicity the upper bound of the related M/G/1 queue is in fact tight, and this resulted in an asymptotically necessary stability condition. Interestingly, both the sufficient and asymptotically nearly necessary condition is independent of the number of servers, but do depend on the number of replicas d. In contrast, in the case of exponentially distributed service requirements the stability condition depends linearly on the number of servers and not on the number of replicas. This indicates that the stability condition in a c.o.c. redundancy system with i.i.d. service requirements is highly sensitive to the distribution of these service requirements.
The bound on the maximum workload also resulted in an upper bound for the expected waiting time, which is again (asymptotically) tight (as the scale of the service requirement grows large). This bound directly resulted in an upper bound for the expected latency.
We assumed that jobs arrive according to a Poisson process, but it might be possible to relax this assumption. In particular, the proof of the sufficient stability condition does not rely on Poisson arrivals and could be extended to a general arrival process. The extension of the proof of the asymptotically necessary condition is more involved. Another interesting topic for further research is to extend the developed framework to obtain the stability condition for more general service requirements.