Controlling variability in split–merge systems and its impact on performance

We consider split–merge systems with heterogeneous subtask service times and limited output buffer space in which to hold completed but as yet unmerged subtasks. An important practical problem in such systems is to limit utilisation of the output buffer. This can be achieved by judiciously delaying the processing of subtasks in order to cluster subtask completion times. In this paper we present a methodology to find those deterministic subtask processing delays which minimise any given percentile of the difference in times of appearance of the first and the last subtasks in the output buffer. Technically this is achieved in three main steps: firstly, we define an expression for the distribution of the range of samples drawn from n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document} independent heterogeneous service time distributions. This is a generalisation of the well-known order statistic result for the distribution of the range of n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document} samples taken from the same distribution. Secondly, we extend our model to incorporate deterministic delays applied to the processing of subtasks. Finally, we present an optimisation scheme to find that vector of delays which minimises a given percentile of the range of arrival times of subtasks in the output buffer. We show the impact of applying the optimal delays on system stability and task response time. Two case studies illustrate the applicability of our approach.


Introduction
Numerous physical systems of practical interest feature a queue of incoming tasks which split into synchronised subtasks that are processed in parallel at a set of (potentially heterogeneous) servers.Subtasks that complete service are held in an output buffer until all of its siblings have completed service.Examples of such systems include the processing of logical I/O requests by a RAID enclosure housing several physical disk drives (Lebrecht et al. 2011), parallel job processing in MapReduce environments comprising several compute nodes (Zaharia 2010), and the assembly of customer orders made up of multiple items in the highly-automated warehouses of large online retailers (Serfozo 2009).
The importance of performance prediction in such systems has been long appreciated by performance modellers who have devised abstractions for their representation, most notably split-merge queueing systems and their less synchronised-but analytically much less tractable-counterparts, fork-join queueing systems (Bolch 2006).
Understandably, for both kinds of model, the primary focus of research work to date has been on the computation of the stationary distribution of the number of subtasks queued at each server and on moments of task response time, most especially the mean.Flatto et al. (Flatto 1985;Flatto and Hahn 1984) derive exact analytical solutions for the stationary distribution of the number of subtasks in each queue in a two-node fork-join queueing systems with exponential task arrivals and heterogeneous exponential service time distributions.Heidelberger and Trivedi (1983) develop two highly accurate approximation methods for mean queue length and mean response time prediction in closed parallel queueing systems based on M/M/1-queues where primary tasks fork into two or more secondary subtasks.For fork-join systems with homogeneous exponential service time distributions, Nelson and Tantawi (1988) describe a technique which yields approximate lower and upper bounds on mean task response time as a function of the number of servers.Kim and Agrawala (1989) derive an approach which approximates mean task response time and state-occupancy probabilities in multi-server fork-join systems with Erlang service time distributions.Baccelli et al. (1989Baccelli et al. ( , 1989) ) derive bounds on various transient and steady-state performance measures for (predominantly homogeneous) fork-join queueing systems by stochastically comparing a given system with constructed queueing systems with simpler structure but identical stability characteristics.Towsley et al. (1990) develop mathematical models for mean task response time of fork-join parallel programs executing on a shared memory multiprocessor under Poisson tasks arrivals and two different scheduling policies.Under the task scheduling policy, the authors derive lower and upper bounds for mean task response time, while under the job scheduling policy, standard birth-death theory leads to an exact expression for mean task response time.Varma and Makowski (1994) use interpolation between light and heavy traffic modes to approximate the mean response time for a homogeneous fork-join system of M/M/1 queues.The same fork-join system was considered in Lebrecht and Knottenbelt (2007), where a maximum order statistic provides an easily-computable upper bound on response time.Varki et al. (unpublished) present bounds on mean response time in a forkjoin queueing system with Poisson arrivals and exponential subtask service time distributions.Harrison and Zertal (2007) present an approximation for moments of the maximum of response times in a split-merge queueing system with Poisson task arrivals and general heterogeneous subtask service times; this gives an exact result in the case of exponential subtask service time distributions.Another recent approach by Sun and Peterson (2012) presented in the context of parallel program execution time analysis-but with ready application to the analysis of split-merge systems-approximates the expectation of the maximum value of a set of random variables drawn from certain distribution classes by solving for the domain value at which the inverse cdf of the maximum order statistic is equal to a constant (0.570376002).
By contrast, the focus of the present paper is not response time computation; rather it concerns ways to control the variability of subtask completion time (that is the difference in time between the arrival of the first and last subtasks of a task in the output buffer) in split-merge systems.The idea is to try to cluster the arrival of subtasks in the output buffer by applying judiciously-chosen deterministic delays to subtasks before they are dispatched to the parallel servers.This has especial relevance for systems that involve the retrieval of orders comprising multiple items from automated warehouses (Serfozo 2009), since partially completed subtasks must be held in a physical buffer space that is often limited and highly utilised; consequently it is difficult to manage.Despite this, to the best of our knowledge, this problem has not received significant attention in the literature.Our previous work (Tsimashenka and Knottenbelt 2011) presented a simple mean-based methodology for computing the vector of deterministic subtask delays that minimises a cost function given by the difference between the expected maximum and expected minimum subtask completion times (across all subtasks arising from a particular task).However, an expected value does not always satisfy service level objectives; in addition there is a dependence between the maximum and minimum subtask completion times which must be taken into account for any distributional analysis.The methodology we present here yields the set of subtask delays which minimises any given percentile of the distribution of the difference in the time of appearance of the first and last subtasks in the output buffer.
The present paper is an extended version of our work presented in Tsimashenka et al. (2012).The technical contribution of this work begins with a generalisation of the wellknown order statistics result for the distribution of the range when n samples are taken from a given distribution FðtÞ.In particular, we obtain the distribution of the range of n samples taken from heterogeneous distributions F i ðtÞ (i ¼ 1; . ..; n).Having extended this theory to incorporate deterministic subtask processing delays, we show how an optimisation procedure can be applied to a split-merge system to find that vector of subtask delays which minimises a given percentile of the range of subtask completion times.Our further contributions over Tsimashenka et al. (2012) include: (a) quantification of the adverse impact of minimising subtask variability on system stability and expected task sojourn time in the system, (b) lowering of the computational complexity of our optimisation procedure through the use of Brent's method rather than the Bisection method, (c) implementing a split-merge system simulator and performing mutual validation between the analysis and simulation, and (d) an additional case study.
The rest of the paper is organised as follows.Section 2 describes essential preliminaries including a definition of split-merge systems and selected results from the theory of order statistics.Section 3 presents various heterogeneous order statistic results, including the distribution of the range.Section 4 shows how the basic split-merge model can be enhanced to support deterministic delays, defines an appropriate objective function, and presents a related optimisation procedure.Section 5 describes the impact of applying subtask delays on system stability and expected task response time.Section 6 presents two case studies which demonstrate the applicability of our work.The first case study considers a split-merge system with just three parallel servers, which allows for convenient visual representation of the optimisation landscape.The second case study considers a larger system of eight parallel servers.Section 7 concludes and considers avenues for future work.

Parallel systems
A split-merge system (see Fig. 1) is a composition of a queue of waiting tasks (assumed to arrive according to a Poisson process with mean rate k), a split point, several heterogeneous servers (which serve their allocated subtask with general service time distribution with mean service rate 1=l i ), buffers for completed subtasks (merge buffers) and a merge point (Bolch 2006).We note that in practice in physical systems it is not uncommon for the merge buffers to share the same physical space which is managed as a single logical output buffer.When the queue of waiting tasks is not empty and the parallel servers are idle, a task is injected into the system from the head of the queue.The task is split into n subtasks at the split point and the subtasks arrive simultaneously at the n parallel servers to receive service.Completed subtasks join a merge buffer.Only after all subtasks (belonging to a particular task) are present in the merge buffers does the original task depart the system via the merge point.We note that this split-merge system is a more synchronised type of fork-join system.In split-merge systems parallel servers are blocked after they have served a subtask while the original task is in the system, whereas in fork-join systems there is no queue of waiting tasks, but there is a queue of subtasks at each parallel server.Task response time in a splitmerge system yields an upper bound on task response time in a fork-join system having the same set of parallel servers and task arrival rate (Lebrecht and Knottenbelt 2007).
2.2 Theory of order statistics (David 1980) Definition: Let the increasing sequence X ð1Þ ; X ð2Þ ; . ..; X ðnÞ be a permutation of the real valued random variables X 1 ; X 2 ; . ..; X n , i.e. the X i arranged in ascending order X ð1Þ 6 X ð2Þ 6 . . .6 X ðnÞ .Then X ðiÞ is called the ith order statistic, for i ¼ 1; 2; . ..; n.The first and last order statistics, X ð1Þ and X ðnÞ , are the minimum and maximum respectively, which are also called the extremes.T ¼ X ðnÞ À X ð1Þ is the range.
We assume initially that the random variables X i are identically distributed as well as independent (iid), but of course the X ðiÞ are dependent because of the ordering.

Distribution of the kth-order statistic (iid case)
If X 1 ; X 2 ; . ..; X n are n independent random variables, the cumulative distribution function (cdf) of the maximum order statistic (the maximum) is simply given by Likewise, the cdf of the minimum order statistic is: These are special cases of the general cdf of the rth order statistic, F r ðxÞ, which can be expressed as: The pdf of X r ; f r ðxÞ ¼ F 0 r ðxÞ, where the prime denotes the derivative with respect to x, when it exists, is then: Multiplying both sides by ''small'' , this result follows intuitively from noting that we require one of the X i to take a value in the interval ðx; x þ , exactly r À 1 of the X i to be less than or equal to x and exactly n À r of them to be greater than x.The coefficient n!=ððr À 1Þ!1!ðn À rÞ!Þ is the number of ways of doing this, given that the X i are stochastically indistinguishable.
The joint density function of the rth and sth order statistics X ðrÞ ; X ðsÞ , where for 1 6 r\s 6 n and x y, is: where S rs ¼ n! ðrÀ1Þ!ðsÀrÀ1Þ!ðnÀsÞ!, by similar reasoning.The corresponding joint cdf F rs ðx; yÞ of X ðrÞ and X ðsÞ may be obtained by integration of the pdf or, alternatively, following the same conditions, we have: Finally, the joint pdf for the k order statistics X ðn1Þ ; . ..; X ðnkÞ ; 1 6 n 1 \. ..\n k 6 n, is similarly, for x 1 6 . . .6 x k : where S n1;...;nk ¼ n! ðn1À1Þ!ðn2Àn1À1Þ!...ðnkÀnkÀ1À1Þ!ðnÀnkÞ! .

Distribution of the range
The pdf f Trs ðxÞ of the interval T rs ¼ X ðsÞ À X ðrÞ follows from the joint pdf of the rth and sth order statistics in Eq. 2 by setting y ¼ x þ t rs and integrating over x, giving: In the special case when r ¼ 1 and s ¼ n; T rs is the range T ¼ X ðnÞ À X ð1Þ and the pdf simplifies to: The cdf of T then follows by integrating inside the integral with respect to x, giving: As noted in David (1980), this equation follows intuitively by noting that the integrand (multiplied by an infinitesimal quantity dx) is the probability that X i falls into the interval ðx; x þ dx (for some i) and the remaining n À 1 of the X j ; j 6 ¼ i fall into ðx; x þ t.There are n ways of choosing i, giving the factor n.

Heterogeneous order statistics
We now consider n independent, real-valued random variables X 1 ; . ..; X n where each X i has an arbitrary probability distribution F i ðxÞ and probability density function f i ðxÞ ¼ F 0 i ðxÞ.In this case of ''heterogeneous'' (or independent, but not necessarily identically distributed) random variables, we call the order statistics heterogeneous order statistics to distinguish them from the better known results where the random variables are implicitly assumed to be identically distributed.
Recent decades have seen increasing consideration given to the heterogeneous case in the literature.Key theoretical results for the distribution and density functions of heterogeneous order statistics are summarised in David and Nagaraja (2005).This includes the work of Sen (1970), who derived bounds on the median and the tails of the distribution of heterogeneous order statistics.Practical issues related to the numerical computation of the ith heterogeneous order statistic are considered in Cao and West (1997), with special consideration of recurrence relations among distribution functions of order statistics.

Distribution of the rth heterogeneous order statistic
The rth heterogeneous order statistic, derived similarly to Eq. 1, has the following cdf: where P i is the set of all two-set partitions fD; Eg of f1; 2; . ..; ng with jDj ¼ i and jEj ¼ n À i, and ' hk is the kth component of the vector ' h for h ¼ 1; 2.
Similarly to the homogeneous case, the minimum and maximum order statistics are respectively given by: and Differentiating Eq. 4 and simplifying yields the pdf: where I is the indicator function and P hÀ i is the set of all 2-set partitions of {1, 2,…,n}\{h} with i elements in the first set and 1 h n.In fact this result also follows from an intuitive argument using the infinitesimal interval ðx; x þ , as in the homogeneous case.
The joint density function f rs ðx; yÞ of two order statistics, X ðrÞ and X ðsÞ , for 1 6 r\s 6 n and x y, follows similarly as: where P h1À;h2À i1;i2 is the set of all 3-set partitions of {1, 2,…,n}\{h 1 , h 2 } with i 1 elements in the first set, i 2 elements in the second set, and so n À i 1 À i 2 À 2 in the third, and 1 h 1 6 ¼ h 2 n.

Distribution of the range for heterogeneous order statistics
From the joint pdf of two heterogeneous order statistics in Eq. 5, we obtain the pdf of the interval T rs ¼ X ðrÞ À X ðsÞ by setting t rs ¼ y À x: For the range, we want the special case in which r ¼ 1; s ¼ n and T ¼ X ðnÞ À X ð1Þ , giving the pdf: The cdf now follows by integration (inside the sum and integral with respect to x): In fact, the same result can be obtained by noting that Eq. 3 generalises using the argument given immediately following it.This is that, given a particular choice of i ¼ 1; 2; . ..; n, the integrand (multiplied by an infinitesimal quantity dx) is the probability that X i falls into the interval ðx; x þ dx and the other X j ; j 6 ¼ i fall into ðx; x þ t.Of course there are n ways of choosing i, and so we have to sum over n terms; in the homogeneous case, all these terms are the same, which gave the factor n.For heterogeneous order statistics, we therefore obtain: This is a useful result, which requires a sum of only n terms.It can be directly applied to determine the distribution of the variability of subtask completion times in any split-merge system with heterogeneous service time distributions, and will form the basis for rangeoptimisation as considered in the next section.
4 Controlling variability in split-merge systems

Introducing subtask delays
Our aim is to control the variability of subtask completion (equivalently output buffer arrival) times by introducing a vector of delays: Here, d i denotes the deterministic delay that will be applied before a subtask is sent to server i for processing.From X i $ F i ðtÞ we define the random variables of parallel service time with applied delays as The order statistics of parallel service time with applied delays are X d ð1Þ ; X d ð2Þ ; . ..; X d ðnÞ .After applying the delays from Eq. 10, the distribution of the range from Eq. 9 becomes: We assume that, 8i; f i ðt À d i Þ ¼ 0; 8t\d i .Similarly, 8j; F j ðt À d j Þ ¼ 0, 8t\d j .We note that in order to avoid unnecessarily delaying all subtasks we require that the subtask delay for at least one server (the ''bottleneck'' server) be set to 0.

Optimisation procedure
In this section we move away from our previous mean-based technique (Tsimashenka and Knottenbelt 2011) towards a more sophisticated framework for finding delay vectors which provide soft (probabilistic) guarantees on variability.More specifically, for a given probability a, a 2 ð0; 1 we aim to minimise the 100ath percentile of variability with respect to d.That is, we aim to solve for d in: Put another way, we aim to find that vector d a which yields the lowest value for the 100ath percentile of the difference in the completion times of the first and the last subtasks (belonging to each task).
Practically, we developed a numerical optimisation procedure by prototyping it in Mathematica and subsequently implementing a full version of it in C?? for efficiency reasons.Evaluation of Eq. 11 for a given a and d is performed by means of numerical integration using the trapezoidal rule.While this is adequate for almost all continuous service time density functions, complications arise in the case of the pdf of deterministic service time density functions because of its infinitely thin, infinitely high impulse.We resolve this by replacing the deterministic pdf with delay parameter a by the Gaussian approximation: which becomes exact as c ! 0; in practice we set c ¼ 0:01.In order to invert Eq. 11 for a given a and d, we have applied the Bisection method (Burden and Burden 2006) in our previous work (Tsimashenka et al. 2012) because of its excellent robustness characteristics, but it might be not computationally optimal for all classes of functions, particularly convex functions.In the present work, we apply Brent's method (1971Brent's method ( , 2002) ) which combines bisection, secant and inverse quadratic interpolation.On each iteration the proposed algorithm chooses which method to use based on (a) the method used for the previous iterate and (b) the trends observed in the most recent iterates.For ''well-behaved'' functions, this has the potential to deliver a considerably higher convergence rate while guaranteeing robustness (Brent 1971).
Finally, we explore the optimisation surface of F À1 range ða; dÞ with the initial d ¼ f0; . ..; 0g using a numerical optimisation procedure.We constrain the search such that d i !0 for all i and Q i d i ¼ 0 (that is, the ''bottleneck'' server(s) should have no unnecessary additional delay).In our implementation, we have used a simple Nelder-Mead optimisation technique (Nelder and Mead 1965), which is based on the simplex method.We note that a range of more sophisticated (and correspondingly considerably more complex to implement) gradient-free optimisation techniques are also available e.g.Ali and Gabere (2010), Lewis et al. (2007).
5 Impact of applied delays on system capacity and performance In Tsimashenka et al. (2012) our concern was solely the minimisation of a given percentile of subtask dispersion time by introducing delays at the parallel servers.However, the introduction of such delays has negative implications for system stability and task response time, as quantified below.

Impact on system stability
We observe first that, at a high-level, any given split-merge system with a set of parallel servers is conceptually equivalent to a single M/G/1 queue whose service time is equal to the maximum of the service times of the replaced parallel servers, i.e. maxfX 1 ; X 2 ; :::; X n g ¼ X ðnÞ .This is of course applicable for a split-merge system with delays as well, i.e. maxfX d 1 ; X d 2 ; :::; X d n g ¼ X d ðnÞ .Secondly, due to the fact that cdfs are monotonically increasing functions: F i ðxÞ dx: Thus: which proves the intuition that applying any delay to any subtask maintains or increases the mean task service time.A similar argument can be used to show that the same property applies to any percentile of task service time.
Exploiting the well-known stability condition for an M/G/1 queue, and denoting the maximum arrival rate at which the system with and without delays remains stable as k d max and k max respectively, we have: Thus, adding any delay to any subtask adversely impacts the maximum arrival rate at which the system remains stable.

Impact on response time
In order to quantify the impact of subtask delays on mean task response time we use the following metric: where E½R d¼d a ;k and E½R d¼0;k correspond for a given arrival rate k to mean task response time with optimal delays and without any delays respectively.Response Penalty corresponds to the percentage by which task response time increases after application of optimal delays.The expected task response time in a split-merge system is conceptually equivalent to the expected task response time in a single M/G/1 queue, whose service time is given by the maximum of the service times of the parallel servers.Consequently, we utilise the Pollaczek-Khinchine formula for mean task response time in an M/G/1 queue: where k is arrival rate, 1=l is the mean service time (with l ¼ E½X ðnÞ ), and Var½X ðnÞ is the variance of service time.The latter can be calculated as: Straightforward modification of the above formulae yields the expected task response time with applied delays (E½R d¼da;k ).

Numerical results
Below we consider two case studies, which are analysed using a combination of numerical results from the methodology above and simulation results, with mutual validation as appropriate.The simulator is written in C?? and performs an event-driven simulation of a split-merge queueing system using component classes for task queues, split/merge points, parallel servers, merge/output buffers and links.For split-merge systems with up to 10 parallel servers, the simulator processes approximately 300,000 tasks per second on a 3.5GHz Intel Core-i5 workstation with 8GB RAM, and permits computation and validation of various quantities, both in scenarios with and without application of optimal delays, including (a) distribution of the range (cf.Eq. 11) of subtask processing times, (b) expected task response time (cf.Eq. 15) and (c) mean output buffer population.In the following, each simulation is replicated three times.Each replication starts with a warm-up phase involving the processing of 1,000,000 tasks; this is followed by a measurement phase involving the processing of 10,000,000 tasks.Because of the very large number of processed tasks, the resulting confidence intervals are extremely narrow and are consequently not reported.Results, whether derived from numerical analysis or simulation, are reported to three significant figures.

Case study 1
Consider a split-merge system with task arrival rate k ¼ 0:1 (tasks/time unit) and three parallel servers having heterogeneous service time density functions: Without adding any extra delays, it is straightforward to apply Eq. 9 in a root-finding algorithm (e.g.Brent's method) to compute the 50th (a ¼ 0:5) and 90th (a ¼ 0:9) percentiles of the range of subtask completion times as t 0:5 ¼ 3:62 and t 0:9 ¼ 5:21 time units respectively.Simulations of the split-merge system show that the mean output buffer population is 0:560 subtasks.
Incorporating delays into the distribution of the range of subtask output buffer arrival times as per Eq.11, and executing a Nelder-Mead optimisation (suitably constrained so that Q i d i ¼ 0) to solve Eq. 12 given a ¼ 0:5 yields as shown in Fig. 2. The improved optimisation procedure's run time (using Brent's method) is 13 s in comparison with 46 s for the previous optimisation procedure from Tsimashenka et al. (2012).We note that in this case the ''bottleneck'' server is server 3.With the incorporation of the optimal delays, the 50th percentile of the range of subtask arrival times becomes t 0:5 ¼ 1:29 time units, representing an improvement of 64 % over the original system configuration without delays.Mean output buffer population is 0:372, a 33 % reduction.For a ¼ 0:9 we obtain d 0:9 ¼ ð0; 2:82; 1:16Þ as shown in Fig. 3.We note that for this percentile the ''bottleneck'' switches from server 3 to server 1, despite the fact that server 3 has a higher mean service time than server 1.With the incorporation of the optimal delays, the 90th percentile of the range of subtask arrival times becomes t 0:9 ¼ 3:35 time units, representing an improvement of 36 % over the original system configuration without delays.Mean output buffer population is 0:392 subtasks, a 29 % reduction.
Figure 4 shows how the distribution of the range of subtask output buffer arrival times changes according to the value of a.We note that a change of a can have a significant impact on the quantiles of Frangeðt; dÞ, and may result in the shifting of the ''bottleneck'' server.
Without delays, the expected task response time is E½R d¼0;0:1 ¼ 8:93 time units.After introducing optimal subtask delays, the expected task response time becomes E½R d¼d 0:5 ;0:1 ¼ 11:7 time units for a ¼ 0:5 and E½R d¼d 0:9 ;0:1 ¼ 12:8 time units for a ¼ 0:9.The percentage increases in expected task response time from Eq. 14 are 31 and 43 % respectively.Fig. 5 presents the corresponding distributions of task response time.For the system without delays the maximum sustainable arrival rate from Eq. 13 is k max \0:182 tasks/time unit.After introducing subtask delays to minimise the 50th and 90th percentiles of the range of subtask completion times, this drops to 0:160 and 0:153 tasks/time unit respectively.This is illustrated in Fig. 6.We consider a split-merge system with arrival rate k ¼ 0:1 tasks/time unit and eight service nodes with following heterogeneous service time distributions:  Applying Eq. 9 in a root-finding algorithm, we compute the 50th (a ¼ 0:5) and 90th (a ¼ 0:9) percentiles of the range of subtask completion times as t 0:5 ¼ 5:43 time units and t 0:9 ¼ 8:26 time units respectively.Simulations show that the mean output buffer population for the system is 3:20 subtasks.Applying our methodology for determining optimal subtasks delays under a ¼ 0:5 yields: d 0:5 ¼ ð3:84; 4:13; 2:32; 0; 4:71; 3:55; 4:88; 0:279Þ For this case the improved optimisation procedure (using Brent's method) takes 2 min 5 s, whereas the procedure from Tsimashenka et al. (2012) takes 29 min.We see the ''bottleneck'' server is server 4. With the incorporation of the optimal delays, the 50th percentile of the range of subtask arrival times becomes t 0:5 ¼ 2:02 time units, representing an improvement of 63 % over the original system configuration without delays.Mean output buffer population is 1:28 subtasks, a 60 % reduction.
Figure 7 shows how the distribution of the range of subtask output buffer arrival times changes according to the value of a.
Without delays, the expected task response time is E½R d¼0;k ¼ 10:3 time units.After introducing optimal subtask delays, the expected task response time becomes E½R d0:5;k ¼ 12:4 time units for a ¼ 0:5 and E½R d 0:9 ;k ¼ 19:2 time units for a ¼ 0:9.The percentage increases in expected task response time are 20 and 86 % respectively.Figure 8 presents the corresponding distributions of task response time.For the system without delays the maximum sustainable arrival rate from Eq. 13 is k max \0:171 tasks/time unit.After introducing subtask delays to minimise 50th and 90th percentile of the range of subtask completion times this drops to 0.156 and 0.133 tasks/time unit respectively.This is illustrated in Fig. 9.This paper has presented a methodology for controlling variability in split-merge systems.
Here variability is defined in terms of a given percentile of the range of arrival times of subtasks in the output buffer, and is controlled through the application of judiciously chosen deterministic delays to subtask service times.The methodology has three main building blocks.The first is an exact analytical expression for the distribution of the range of subtask output buffer arrival times over n heterogeneous servers in a split-merge system.This is a natural generalisation of the well-known order statistics result for the distribution of the range taken over n homogeneous servers.The second is the introduction of deterministic subtask delays into the aforementioned expression.The third is an optimisation procedure which yields the vector of subtask delays which minimises a given percentile of the range of subtask output buffer arrival times.We also quantified the impact of our optimisation on the system stability and mean task response time.We presented two case studies which showed that the choice of percentile can have a significant impact on the optimal delay vector and the ''bottleneck'' server.
As previously mentioned fork-join systems are significantly less analytically tractable than split-merge systems.However, they are more realistic abstractions of many real world systems on account of their less-constrained task synchronisation.Consequently a natural future direction of this work is to try and generalise our results to fork-join systems.In line with previous research we believe we are unlikely to find an exact analytical expression for the distribution of the range of join buffer arrival times.However, a numerical approach and/or an analytical approximation may be possible.
Finally, the scalability of our methodology to very large split-merge systems with 100? service nodes is currently an open question.However, large-scale problems are sometimes encountered when modelling real-life systems.Consequently we will conduct experiments to assess the scaling behaviour of our methodology.It may be beneficial to devise an approach that makes use of parallel computations using for example Message Passing Interface (MPI).

Ann Oper Res
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Fig. 7 Fig. 8
Fig. 7 Distributions of the range of subtask output buffer arrival times without any delays (red line), with delays optimised under a ¼ 0:5 (green line) and delays optimised under a ¼ 0:9 (blue line).(Color figure online)

Fig. 9
Fig. 9 Expected response time of split-merge system for various customer arrival rates without any delays (red line), with delays optimised under a ¼ 0:5 (green line) and delays optimised under a ¼ 0:9 (blue line).(Color figure online)