Parallel Computing

Serazzi, Giuseppe

doi:10.1007/978-3-031-36763-2_5

Giuseppe Serazzi²

1934 Accesses

Abstract

The concept of parallelism of computations is very important to obtain the results required by the algorithms of big data processing applied in several domains, like ML, AI, neural networks, approximate computing, etc. In this chapter the use of Fork and Join stations and of their respective generation/synchronization policies are described [12]. Three elementary models implementing different synchronization policies are analyzed. The first synchronizes the executions of n parallel tasks with similar characteristics, the second investigates the impact of the variability of the parallel task execution times on the synchronization times, and the third synchronizes on the fastest task. In spite of their simplicity, these three models can be combined with other components to implement very complex models.

You have full access to this open access chapter, Download chapter PDF

5.1 Synchronization of All Parallel Tasks

tags: open, single class, Source/Fork/Queue/Join/Sink, JSIMg.

5.1.1 Problem Description

The focus of this problem is on the use of Fork and Join stations for parallel computing and synchronization. The model is open and the workload consists of a single class of jobs.

We will consider the problem of modeling the execution of a job that at some point (i.e., at the Fork station) splits into several parts, referred to as tasks, that will be executed in parallel. The tasks may be instances of the same code processing different data set, or part of the code performing different computations. Each task can follow a different path through the resources between the Fork and the Join. When all tasks complete their executions, they are merged in the Join station and then, according to the synchronization policy, the job that generated them can continue its execution. This type of behavior is typical of many current applications, such as Map/Reduce, that alternate phases in which various instances of the code are generated and executed in parallel with phases that require their synchronization.

5.1.2 Model Implementation

We use a Fork station that when a job arrives generates four equal jobs, referred to as tasks, that will be executed in parallel, and a Join station to synchronize their executions. When all the executions are completed, the Join releases, i.e., fire, the job. The layout of the model is shown in Fig. 5.1. A Source station generates the flow of jobs with exponentially distributed Interarrival times. The Service times $S_i$ of the four queueing stations Queue1$\div 4$ are exponentially distributed.

The Arrival rate of the jobs is $\lambda =1\,$j/s and the mean Service times of the four queue stations are $S_1=S_2=S_3=S_4=0.5\, $s.

In the Editing Fork Properties window (Fig. 5.2) we do not flag the check box for enabling the Advanced Forking Strategies so the Standard Strategy is applied.

For each arriving job, this strategy send n tasks on each link in output of the Fork. For n we left the default value $n=1$. We want that all the four tasks generated by a job, one per output link, will be completely executed before the job exit the Join station. To synchronize all the tasks of a job at the Join station the Standard Join Strategy (see Fig. 5.3) is selected.

Initially we execute a model with Arrival rate $\lambda =1\, $j/s. The performance indexes to be collected together with the requested precision (in terms of confidence level and max relative error) of their values are shown in Fig. 5.4.

After this single simulation run, we investigate the behavior of performance indexes for different values of arrival rate $\lambda $. To this end we use the What-if analysis feature (Fig. 5.5). We check the box Enable what-if analysis, and we select Arrival rate as control parameter. Five executions are required with arrival rates $\lambda =$ 1, 1.2, 1.4, 1.6, 1.8 j/s.

5.1.3 Results

In this section we show some of the results obtained from the simulations and we compare their values with the corresponding exact values computed analytically, when these are available.

The simulation with $\lambda =1\,$j/s provided the values of all the measured performance indexes with the precision required in Fig. 5.4.

In Fig. 5.6, the mean Response times of Queue1 $R_{Q1}=1.01\,$s and of Join1 $R_{J1}=0.938\,$s stations are shown. The Response time of Queue1 is the mean time of a visit to Queue1 (queue+service). The Response time of Join1 is the synchronization time of the four tasks since it represents the mean time that three tasks, whose executions are already terminated, must wait that also the fourth end before the fire of the job can take place. The Fork/Join Response time (mean time within a Fork/Join section) provided by the simulation is $R_{FJ}=1.92\, $s and, in the model considered, is obtained by adding the mean of the four Response times of the queue stations and the Synchronization time of Join1.

The validation of the results of the individual queue stations, considered in isolation from the rest of the model, can be done by comparison with the corresponding exact values computed analytically. Indeed, each queue can be modeled as a M/M/1 station since both its Interarrival times and Service times are exponentially distributed. Thus, its Utilization is $U_i=\lambda S_i=0.5$, its Response Time (mean time for one visit, queue plus Service times) $R_i= S_i/(1-U_i)= 1\, $s, its mean Number of customers (tasks) in the station $N_i= U_i/(1-U_i)= 1\, task$.

The results obtained from the simulation are very close to these ones computed analytically: the Response time of Queue1 is $R_{Q1}=1.01\,$s (Fig. 5.6), the Utilization is $U_{Q1}=0.505$, and the mean Number of tasks is $N_{Q1}=1.01\,tasks$. Similar values have been obtained for the other three stations Queue2, Queue3, and Queue4.

To study the behavior of the Fork/Join Response time, that includes the Synchronization time of the tasks at Join1, we use a What-if analysis (Fig. 5.5) requiring the simulation of five models with Arrival rates $\lambda =1\div 1.8\,$j/s. The results are plotted in Fig. 5.7.

Unfortunately, the exact formula to compute the Fork/Join Response time is known only for particular models. In more general cases various approximate solutions are available.

The exact Fork/Join Response time can be computed only when there are two parallel paths in output of the Fork and the two servers are M/M/1 queue stations with the same service rate.

In this model, the exact mean Fork/Join Response time, see [17], is given by:

$$\begin{aligned} R_{FJ}= \frac{12-U}{8} \; \frac{S}{1-U} \end{aligned}$$

(5.1)

and the exact mean Synchronization time at the Join (referred to as $R_J$) is

$$\begin{aligned} R_J = \frac{4-U}{8} \; \frac{S}{1-U} \end{aligned}$$

(5.2)

The results obtained with simulation are validated considering the model of Fig. 5.8 whose exact Fork/Join Response time and Synchronization time are given by Eqs. 5.1 and 5.2, respectively. The parameters used are $\lambda =1\div 1.8\,$j/s and $S_{Q1}=S_{Q2}=0.5\,$s and all the distributions are exponential.

Table 5.1 shows the results obtained with JSIMg and the corresponding exact values. As can be seen, the exact values are within the 99% confidence intervals as required, (see Fig. 5.4).

Table 5.1 Fork/Join Response times and Synchronization times of the two equal parallel queues (Fig. 5.8) obtained with JSIMg and their exact values computed with Eqs. 5.1 and 5.2

Full size table

For Fork/Join structures with a number of parallel paths greater than two, heterogeneous queue stations, and general distributions there are no exact formula to compute the performance indexes. However, several approximations, some enough precise but complex to compute, are available in literature (see, e.g., [28]).

An estimation, rather coarse but simple to compute, can be obtained considering the model typically adopted to study the reliability of parallel infrastructures. A system consisting of n parallel components fails when all the n components fail. Consider the instants in which the tasks complete their executions as events corresponding to the failures of components of the reliability model. We can see that the two models (the Fork/Join and the reliability) are similar since both seek the mean time required for the end of all the n tasks or the failures of all the n components. In the reliability model several assumptions are typically made (that are not completely satisfied in the Fork/Join model): the n components are independent, identical (with exponentially distributed Interarrival times of failures with the same mean), non-repairable, and no interference is possible between consecutive events (no queues of events are possible for the same component). The events, i.e., the failures, can be regarded as generated by n independent Poisson streams with the same mean. Denoting with MTTF the mean time to failure of a single component, and with $MTTF_n$ the mean time to failure of all the n components, (its derivation is summarized in Appendix A.3) it will be:

$$\begin{aligned} MTTF_{n}=\left( \frac{1}{n}+\frac{1}{n-1}+ ... + \frac{1}{2}+1\right) MTTF \end{aligned}$$

(5.3)

The MTTF of a component represents the mean Response time R of a queue station of our model, whose values are exponentially distributed since each station is modeled as a M/M/1 queue. The $MTTF_n$ represents the mean time required to have the executions of all the n parallel tasks completed, i.e., the Fork/Join Response time.

Unfortunately, our original model (Fork/Join) violate several assumptions of the reliability model: the events on the four queue stations are not independent (the Fork generates the n parallel tasks of a job simultaneously), the tasks may be queued at a station to wait until the server is idle, and a task of a job may start its execution on a station also if the tasks of a previous job are still in execution on the other stations. However, in spite of these violations, the values given by Eq. 5.3 are not very far from the results of the simulation.

To verify these results, consider the Fork/Join Response times shown in Fig. 5.7. With $\lambda =1\,$j/s the result of simulation is $R_{FJ}=1.922\,$s while Eq. 5.3 gives $2.08\,$s. With $\lambda =1.4\,$j/s the $R_{FJ}~\text {is}~3.096\,$s and the approximated value is $3.47\,$s. With $\lambda =1.8\,$j/s the simulation provides $R_{FJ}=9.036\,$s and the approximated value is $10.4\,$s. If we consider a very low arrival rate, e.g., $\lambda =0.1\,$j/s the utilization is 0.05, the queues are very unlike and the simulation provides $R_{FJ}=1.086\,$s while Eq. 5.3 gives $1.096\,$s, very close! Clearly, the errors increase with the queue lengths, i.e., with the arrival rate, and then with the Utilization of the stations.

5.1.4 Limitations and Improvements

Servers with different mean Service times: we assumed that the Service times of the four Queue stations have the same mean and that are exponentially distributed. For a generalization it is sufficient to select Queue stations with different mean Service times, see case study Sect. 5.2.
Different number of tasks on each output link of a Fork: the number of tasks generated by a job on each output link is the same. Generalizations are easy to implement by selecting the Advanced Forking Strategies, see Fig. 5.2.

5.2 Impact of Variance on Synchronization

tags: open, single class, Source/Fork/Queue/Join/Sink, JSIMg.

5.2.1 Problem Description

As in the previous problem, we consider the parallel executions of four tasks and their synchronization. The layout of the model is shown in Fig. 5.9. The only difference with respect to the problem of Fig. 5.1 stem in the variance of Service times of one of the four queue stations: Queue1 has a higher variance with respect to the other three stations (all the mean values are always $S_i=0.5\,$s, as in the previous model). We want to investigate on the impact on the synchronization time of the four executions of this high-variance station.

The high variability of Service times is typical of many current computing infrastructures since frequently the applications are executed by very different systems. For example, the Virtual Machines that are dynamically allocated to applications have different computational power and their workloads are often unbalanced. What is surprising is that even a relatively small difference in the variance of the Service times of one station out of four (that have the same mean) has a deep impact on the Fork/Join Response time.

5.2.2 Model Implementation

The mean Service times of the four queue stations are the same used in the model of Fig. 5.1, $S_1 = S_2 = S_3 = S_4 = 0.5\,$s. In this model, we assume that the coefficient of variation cv (given by standard dev./mean) of Queue1 Service times is cv = 3 instead of 1 (as it was in the previous model where we assumed exponential distributions). Thus, the standard deviation of the Service times is 1.5 s, and the variance is 2.25 $s^2$. Since it is cv>1, to simulate the Service times of Queue1 we use the Hyperexponential distribution (Fig. 5.10) with parameters $cv=3$ and mean value $S_{Q1}=0.5\,$s (see Fig. 5.10). From these two parameters JSIMg automatically derives the other parameters needed to generate an hyper-exponential distribution with a given mean and variance.

Initially a model with Arrival rate $\lambda =1\,$j/s is executed. The impact of the variability of Service times of one of the stations (Queue1) on the Synchronization time of the four tasks is then investigated. Comparisons with the performance of a single station M/G/1 are also done.

5.2.3 Results

A single simulation run is executed with $\lambda =1\, $j/s. The Response times of Queue1 and Join1 stations are shown in Fig. 5.11. The latter represents the Synchronization time of the executions of the four tasks.

We evaluated the behavior of the Fork/Join Response time for different values of Arrival rate $\lambda $ using a What-If (Fig. 5.5). Five models with $\lambda =$ 1, 1.2, 1.4, 1.6, 1.8 j/s have been executed and the corresponding Fork/Join Response times are shown in Fig. 5.12.

In Table 5.2 we report for comparison purposes the Fork/Join Response times and the Synchronization times of the two models of Fig. 5.1 (column Exp) and Fig. 5.9 (column Hyper), respectively, obtained with JSIMg. The impact of the variability of Service times of one station to the Global Response time of the Fork is evident. For example, with stations utilized at 90% the Fork/Join Response time increases from 9s (when the standard deviation of Service times is 0.5 s) to 25s (when the standard deviation of is 1.5s).

Table 5.2 Fork/Join Response times and Synchronization times of the two models of Figs. 5.1 (label Exp, four exp) and 5.9 (label Hyper, one hyper-exp and three exp)

Full size table

Let us remark that it has been sufficient that only one of the four servers increased its variance of Service times of three times to generate a similar increase of the Response Time of the Global Fork/Join structure.

To analyze the impact of the Queue1 station with the high variance on the Fork/Join Response time we study it in isolation. According to the assumptions, the Interarrival times of the tasks are exponentially distributed and its Service times follow an hyper-exponential distribution. Thus, Queue1 can be modeled analytically as a M/G/1 queue. Its Response Time is given by (see, e.g., [36]):

$$\begin{aligned} R_{Q1}=S+\frac{US(1+cv^2)}{2(1-U)} \end{aligned}$$

(5.4)

where U is the Utilization of the station (U $=\lambda $ S), S = 0.5 s is the mean of Service times and cv = 3 is their coefficient of variation. In the last two columns of Table 5.2 the Response Times of two stations M/G/1 and M/M/1, considered in isolation, are reported. The contribution of the M/G/1 station to the global Fork/Join performance is evident if we consider, for example, that with $\lambda =1.8\,$j/s its Response time ($23\,$s) represents the 92% of the Fork/Join Response time ($25\,$s) with the four queue stations. It must also be pointed out the huge difference between the Response Times of the two types of queues M/G/1 and M/M/1, for example, with U = 0.9 the two values are 23 and $5\,$s, respectively (last two columns of Table 5.2).

5.2.4 Limitations and Improvements

High variability of Service times: All the servers considered in the models have the same mean. It is easy to generalize these models considering heterogeneous servers with different mean Service times and distributions.
Impact of high variability of Service times of one server: Let us remark that it has been enough an increase of the variance of Service times of only one server out of four to generate dramatic effects on system performance. You may imagine how frequently this condition occurs in real world data centers with the high degree of heterogeneity of current workloads! It is therefore very important to keep the variability of Service times of all the servers under control.

5.3 Synchronization on the Fastest Task

tags: open, single class, Source/Fork/Queue/Join/Sink, JSIMg.

5.3.1 Problem Description

In this section we will analyze the effects on the Fork/Join Response time of a Join Strategy different from the Standard one (that synchronizes the executions of all the tasks). According to the Quorum strategy, a Join station releases a job, i.e., fire the job, when a subset of the parallel tasks generated by the Fork for each job completed their execution. In this problem we assume that as soon as one task of a job completed its execution, the Join releases the job.

This problem is typical of several actual digital infrastructures like CEPH, the object storage used by OpenStack, or RAID1, the mirroring storage architecture, that use data replication as a technique to improve performance and reliability of systems. The requests for a object (data, file or other subject) are split in several tasks that are sent in parallel to all the devices containing the replicated data. In our case, the object is sent back when the first task (the fastest) finishes. The results show that the impact of replication technique to the performance and reliability of digital infrastructures is significant.

5.3.2 Model Implementation

We consider the parallel executions of four tasks generated by a job at the Fork on four Queue stations having the same characteristics.

The service requests of the four tasks have the same mean $S_1=S_2=S_3=S_4=0.5\, $s and are exponentially distributed. The arrival rate of the jobs is $\lambda =1\,$j/s, and the interarrival times are exponentially distributed. The layout of the model is shown in Fig. 5.13. The difference of this model with respect to the one considered in Sect. 5.1 is that the Join do not requires that all the four executions must be completed before releasing the job but it is sufficient that only one of them (i.e., the fastest) completes. We will use the Join Strategy with Quorum=1 (see Fig. 5.14).

5.3.3 Results

The behavior of performance indexes for different values of arrival rate $\lambda $ is investigated using the What-if (see, e.g., Fig. 5.5). The Arrival rate is selected as control parameter and the solution of seven models is requested with $\lambda =$ 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75 j/s, respectively.

The mean Fork/Join Response times are shown in Fig. 5.15. We want to emphasize the differences between the mean values of this index obtained with Quorum=1 (the Join releases the job when the shortest task completes its execution) and the ones obtained with Quorum=4 (the Join wait that all the executions of four tasks are completed before release the job). Table 5.3 shows the values of these indexes in the first two columns for the Arrival rates $\lambda $ ranging from 0.25 to 1.75 j/s.

Table 5.3 Fork/Join Response times with Join Strategies Quorum=4 (Join waits all four tasks) and Quorum=1 (Join waits only the fastest task)

Full size table

As can be seen, the differences between the values obtained with Quorum=1 and Quorum=4 are remarkable. For example, for $\lambda =1\;$j/s the value with Quorum=1 is about 6 times less than the value with Quorum=4 (0.320 s vs. 1.915 s)!

To highlight the impact of the parallelism and synchronization policies, we also show in the Table 5.3 the Response time and Utilization of one of the Queue stations considered in isolation. Let us remind that all the four queues are the same. Since with Quorum=1 the Fork/Join Response time is the Response Time of only one task, it may seem correct to consider only one of the queue in isolation to compute its value (see the Queue1 Resp.time column). This assumption is wrong. Indeed, the Response times with Quorum=1 are considerably lower than those obtained with a single queue station (e.g., with $\lambda =1\,$j/s it is 0.320 s vs. 1.006 s!). The error occurs because with Quorum=1 only the minimum of four sequences of exponentially distributed execution times (with the same mean) is considered, while with the single queue only the average of a single sequence of exponentially distributed Service times is considered.

An estimate very easy to compute (referred to as Optimal Approximation) of the Fork/Join Response time with Quorum=1 can be obtained considering the end of each task as events generated by n independent poissonian generators with the same rate 1/R. This modeling approach is often used to study the reliability of a system consisting of n components, e.g., devices, connected in series. This type of systems fails when any one of the n components fails. The events considered are the failures of devices. The time between two consecutive failures of a device is referred to as MTTF, mean time to failure. The assumptions considered are: independence of the n identical components, the failures are exponentially distributed in time, and the components are non-repairable. Thus, we may consider the model as consisting of n identical independent poissonian arrival streams of events (the failures) with interarrival times exponentially distributed, and no queues are possible among consecutive events. The mean time $MTTF_1 (n)$ for the first failure of such a systems (see Appendix A.3 for its derivation) is given by:

$$\begin{aligned} MTTF_1(n) = \frac{MTTF}{n} = \frac{R}{n} \end{aligned}$$

(5.5)

where MTTF represents in our model the mean Response time R of a queue station that is exponentially distributed. Indeed, according to the assumptions, each queue is of M/M/1 type, and thus it is $R=S/(1-\lambda S)$ with exponential distribution. Considering that the number n of parallel stations is 4, Eq. 5.5 provides the values reported in the last column Optim.Approx. of Table 5.3.

As can be seen, these values are not very far from the corresponding Fork/Join Response times with Quorum=1 and the differences increase with $\lambda $. This is due to the assumptions made in the failures model that are violated in the simulated model of Fork/Join. Indeed, the parallel tasks of a job are generated by the Fork simultaneously on the n queue stations, so are not independent, and furthermore interferences are possible among consecutive tasks at any of the n queue.

If we consider a very low Arrival rate, e.g., $\lambda =0.05 \,$j/s, the Fork/Join Response time given by Eq. 5.5 is 0.128 s and the value obtained with JSIMg is 0.130 s (Fig. 5.16). These values are so close because in this case the Utilization of the queues is very low, $U=\lambda S = 0.025$, and thus the interferences among consecutive tasks are negligible. Indeed, the Response time of a Queue station given by the simulation is 0.514 s and the mean Service time S is 0.5 s, very close (practically, queues of tasks waiting for the server almost never form). Clearly, with the increase in the arrival rate, the approximation becomes increasingly losing.

Author information

Authors and Affiliations

Milano, Pavia, Italy
Giuseppe Serazzi

Authors

Giuseppe Serazzi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Serazzi .

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Serazzi, G. (2024). Parallel Computing. In: Performance Engineering. Springer, Cham. https://doi.org/10.1007/978-3-031-36763-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-36763-2_5
Published: 24 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36762-5
Online ISBN: 978-3-031-36763-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Parallel Computing

Abstract

5.1 Synchronization of All Parallel Tasks

5.1.1 Problem Description

5.1.2 Model Implementation

5.1.3 Results

5.1.4 Limitations and Improvements

5.2 Impact of Variance on Synchronization

5.2.1 Problem Description

5.2.2 Model Implementation

5.2.3 Results

5.2.4 Limitations and Improvements

5.3 Synchronization on the Fastest Task

5.3.1 Problem Description

5.3.2 Model Implementation

5.3.3 Results

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation