Abstract
In this chapter we will examine the gap between the expected execution time of a parallel algorithm and the actual running time achieved by executing its encoding as a ParC program in an actual shared memory machine. Though there is such a gap for sequential programs, it is more problematic with parallel programs where users often encounter cases of parallel programs that fail to run fast enough or as fast as expected. In particular, a parallel program that runs on a parallel machine with P processors is expected to run about P times faster than its sequential version. There are two issues involved with this problem:
-
Determining the execution time of a parallel program and comparing it with a desired execution time.
-
If there is a significant gap between the two, we need to identify which factors in the parallel program should be corrected so that the performance is improved.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
5.1 Introduction
In this chapter we will examine the gap between the expected execution time of a parallel algorithm and the actual running time achieved by executing its encoding as a ParC program in an actual shared memory machine. Though there is such a gap for sequential programs, it is more problematic with parallel programs where users often encounter cases of parallel programs that fail to run fast enough or as fast as expected. In particular, a parallel program that runs on a parallel machine with P processors is expected to run about P times faster than its sequential version. There are two issues involved with this problem:
-
Determining the execution time of a parallel program and comparing it with a desired execution time.
-
If there is a significant gap between the two, we need to identify which factors in the parallel program should be corrected so that the performance is improved.
In general, there are two approaches to the problem of closing the gap between parallel algorithms and their actual runs as parallel programs on real machines:
- Simulation :
-
This simulation is the common method of solving efficiency gaps. It is accomplished by executing the program in a simulation environment, and collecting the events and statistics relating to the execution. The events and the statistics allow the user to locate the inefficient parts of his or her program. The main problem with this method is the lengthy and often impractical simulation times. For example, a simulator can display the idle times of each processor, allowing the user to visually observe that some processors are relatively overloaded compared to others.
- Performance models :
-
This is the approach discussed in this chapter. It is based on the development of a formula that can predict the expected execution time of a parallel program when executed by a specific parallel machine. By analyzing the time formula we can determine if the expected time is sufficient. If it is not, we can identify the factors that limit the performance. This approach is faster than simulations but is likely to be less accurate. The formula is used to define the “speedup” of a program, namely, how well the program uses the parallelism of the machine in comparison to a sequential execution.
In this section, we develop a model of a virtual parallel machine (VPM), through which the user can evaluate the effect of changes made to the program. This model includes not only the structure of a virtual parallel machine, but also a speedup equation and the characteristics of efficient programs. The second element is a set of high level constructs of the programming language relating to different aspects of efficiency (such as the distinction between local and non-local memory references). The third element is a set of transformations that the user applies to the code of an initial program, resulting in an efficient program that exploits different aspects of the underlying machine.
5.2 A Simple Model of a Virtual Machine
The virtual parallel machine (VPM) is a model of a schematic parallel machine that describes the parallel execution of a program in that language. The VPM model is therefore a generalization of all of the practical aspects involved in the execution of parallel programs. The model we present here reflects two practical aspects in the realization of PRAM-like programs:
-
The overhead involved in the creation of new threads.
-
The different access times to local memory and to remote (global) memory, which is accessed through some sort of network.
The virtual machine hides minor aspects of the physical hardware and software that executes ParC . Thus, the programmer can ignore questions such as, how exactly does the operating system execute the program, or does the machine overcome barriers using busy-waits. Moreover, the virtual machine connects important aspects of the hardware (such as the number of processors) with the program’s parameters and syntax (such as the number of threads spawned by the program). This connection results in a formula and a model through which the execution time of any program can be evaluated.
A “true” model will be able to fully predict the execution time of a given program. However, such a model might be too complicated as a program development tool. The model is, therefore, a compromise between the need to include as many low-level factors as possible, and the need to create a simple tool that a programmer can use to develop applications.
The realization that the model cannot provide an exact prediction of the execution time leads to the consideration of weaker requirements. The first consideration deals with an important and desirable property of the model. We say that a VPM model is useful if any change in a program R→R′ such that R′ uses fewer resources than R improves the time prediction of R′ in the model. For example, if R′ uses fewer network accesses than R, we require that VPM(R′)≤VPM(R) (where VPM(R) is the prediction of the amount of time it will take the virtual machine model to execute R). In other words, a model is useful if the user can predict and evaluate the effect (or usefulness) of any possible modification to his or her program.
The second consideration deals with the desirable property of realizations we would like the VPM to have, such as the implementation of the language. We say that the realization of a parallel programming language (ParC in our case) on physical hardware is fair if for any two programs R1,R2, if VPM(R1)≤VPM(R2) then RM(R1)≤RM(R2) (where RM(R) is the actual time it takes to execute R on the above hardware). Clearly, the validation that a given model satisfies such weaker requirements is obtained via experiments and actual usage (i.e., it cannot be proven theoretically because we argued that a complete model is impossible to construct).
These two definitions imply that, once a VPM model has been accepted as the standard model of a parallel language, a minimal set of requirements from any possible realization of that language is devised as well. Note that the above definitions are the weakest possible, because they demand only that any improvement in the program should yield an improvement in its execution time. A stronger requirement might require a constant relation between improvements in the model and the execution time: \(c_{1} \frac{\mathit{VPM}(R1)}{\mathit{VPM}(R2)} \leq\frac{\mathit {RM}(R1)}{\mathit{RM}(R2)} \leq c_{2} \frac{\mathit{VPM}(R1)}{\mathit{VPM}(R2)}\).
Finally, we will discuss the type of equations to use for the proposed model. This is a meta-discussion whose goal is to prompt the mathematical formulation of the proposed model. Assume that we are to estimate the value of a function f(x) without really knowing the exact formula of f(x). The idea is to find a sequence of lower bounds Ω1(x),Ω2(x),… each bounding the value of f(x) by considering a different aspect of f(x). For example, if f(x)=3x 2.5+zx+3, we may a priori know that f(x)>x 2, thereby obtaining Ω1(x)=x 2. Thus, if we are able to find a sequence of two lower bounds, we can write that f(x)>max(Ω1(x),Ω2(x)). The proposed model will then argue that f(x)≈Ω1(x)+Ω2(x). Note that it is always true that
so we have some justification for assuming that f(x)≈Ω1(x)+Ω2(x). A stronger justification would be to say that if f(x)<Ω1(x)+Ω2(x), then we can always find another factor of f(x) that bounds a different aspect of f(x) (e.g., Ω3(x)=2x) and add it to the equation, obtaining f(x)≈Ω1(x)+Ω2(x)+Ω3(x). Eventually, we will sum enough lower bounds to overestimate the value of f(x). Note that because we have used only lower bounds, we know that in the worst case our estimation is at most k the actual size of f(x), where k is the number of lower bounds we used. For example, if f(x)<Ω1(x)+Ω2(x)+Ω3(x), then since Ω1(x)+Ω2(x)+Ω3(x)<3⋅max(Ω1(x),Ω2(x),Ω3(x)), we get that 3⋅f(x)>Ω1(x)+Ω2(x)+Ω3(x).
5.3 The ParC Virtual Machine Model
The previous section presented the notion of the VPM model and its goals. In this section we introduce a specific VPM model for the execution of ParC programs. This model is also suitable for other parallel programming languages that spawn explicit threads, and use shared memory.
The execution of a parallel program is a dynamic thread in which new threads are created and terminated. Moreover, the execution of one thread may be dependent on the execution of another thread (e.g. one thread waits for a value to be computed by another thread). This fact implies that threads cannot be run to completion, but rather one should execute threads alternately. In addition, a real parallel machine should be able to manipulate more threads than the number of physical processors. This manipulation is accomplished through one or more queues of thread records. Every processor then picks a thread from a queue and executes it for a while, then returns it to the queue and picks another one (see Fig. 5.1). The model observes the following rules:
-
When a thread needs to spawn new threads, it puts one representative record with a description of these threads in the global queue.
-
The spawning thread can resume its operation only when all of its children have terminated (the last child should wake the parent).
-
We assume parallel access to the global queue. Moreover, many processors can access the same representative record simultaneously (e.g. using the faa() instruction). Thus, initializing parfor(int i=0;i<1000;i++) requires one step, not 1000 steps.
-
Processors are never idle as long as the global queue is not empty.
Although our primary goal is to examine multicore machines, the proposed model can be used for other types of shared memory machines. In general, the distinction between the local and global memories deserves some elaboration. Some shared memory machines put the processors on one side of the communication network, and the memories on the other side. However, we claim that at least some local memory is essential to ensure optimal performance. Given that access to local memory is always faster than access to memory through the network, the program, at least, should be stored locally. Note that this dichotomy is not equivalent to the distinction between private and shared memory. Indeed, the local memories may be globally accessible to the rest of the processors. We indicate only that it is reasonable to assume that any shared memory machine will allow the processor to access a portion of the memory as its local memory. The “global memory” in the model is a conceptual entity, capturing the concept of memory that has to be accessed through the network and therefore degrades performance. It does not necessarily match any specific hardware component.
The VPM model connects these two sides (machine and program) and calculates the execution time to be the total work and overhead (of a program) divided by the number of processors. For a given program R, a specific input (which, for reasons of convenience, is omitted) and a fair realization, the execution time on the VPM model is defined using the following parameters. Some of the parameters depend on the program R, while others are constants.
- P :
-
the number of physical processors that the machine has.
- C :
-
the overhead needed to spawn a new thread and then to delete it, including its share in the coordination required to wake the parent thread.
- c :
-
the overhead or time needed to synchronize N threads, using the sync statement. Note that C>c because C contains the allocation of thread resources besides synchronization.
- N(R):
-
the total number of threads spawned by the program.
- D(R):
-
the longest path in the execution graph (see Fig. 5.2), which is defined recursively on R:
$$\renewcommand{\arraystretch}{1.3} D(R) = \left\{ \begin{array}{l@{\quad}l} 1 & R~\mathrm{is~an~atomic~statement} \\ \sum_1^k D(S_i) & R = \{ S_1 ; \ldots; S_k \} \\ \sum_{i=1}^k (D(S_{i}) + 2) & R = \textsf {for}~(i=0;i<k;i++)~S ; \\ 1 + \max\{ D(S_1), D(S_2) \} & R = \textsf {if}~(exp)~S_1;~\textsf {else}~S_2;\\ 1 + D(S_{f(x)}) & R = f(x); \\ C + \max_1^k D(S_i) & R = \textsf {parfor}~\textsf {int}~i; 1; k; 1;~S_i;~\textbf {epar}\\ C + \max_1^k D(S_i) & R = \textsf {parblock}~S_1 : \ldots: S_k~\textbf {epar}\\ \end{array} \right. $$where S f(x) is the body of the function f after substituting the parameters. Note that the i<k; and the i++; instructions are counted in the case of the sequential for statement.
In the case of a sync instruction, the longest path should include the waiting times that result from waiting for the sync instruction. Sync waiting times are inserted into the graph by adding edges between any sync instruction and all of the sync instructions executed by the threads spawned by the current parallel construct. Consider, for example, the recursive parallel program of Fig. 5.2. Before adding the extra sync edges, the longest path in the execution graph is 65, containing only the loop of the first thread. However, by adding the sync edges between suitable sync instructions, the longest path now contains all of the loops and its value is 71 (as depicted in Fig. 5.3’s broken line).
- W(R):
-
the total number of instructions executed by R. W(R) is computed in the same way as D(R), except that C+max is replaced by ∑ in the case of the parallel constructs. W(R) is the time it would take us to simulate the sequential execution of R.
- S s (R):
-
the sequential part of R. This counts all of the instructions that are not within the scope of any parallel construct (including indirect scopes through function calls).
- S l (R),S g (R):
-
the total number of accesses to local and global memory, respectively. S l is computed in the same way as W(R), except that for every atomic instruction that does not belong to S s , we count the number of accesses to local variables. A local access is any reference to a variable declared in the same block wherein the access occurred (e.g. accessing parameters and local variables in a function body). S g is defined in a similar way, except that we count the accesses to non-local variables. For example, in the code of Fig. 5.4, S l =3 and S g =6.
- F m (P,S g ,S l ):
-
F m is used to estimate the delay caused by each global access in the execution of R. F m takes into account the number of processors (the maximal number of references through the communication network) and the relationship between the local and global accesses. Let \(Z = \frac{S_{g}}{S_{g}+S_{l}}\) be the relative weight of the global accesses. At any given time, there are at most P accesses to the memory. When Z is small compared to \(\frac{1}{P}\), we estimate that most of the accesses are local and the network is not loaded. This is a simplified model but it captures the intuition that
if for every global access through the underlying communication network there are P−1 local references that can be executed concurrently, then no two global accesses occur at the same time. Hence, the network is not a bottleneck.
Thus, F m is a lower bound because it assumes the best possible scheduling order of global memory references, minimizing the number of global memory references that attempt to use the communication network at any given time. This is, of course, a very optimistic assumption. It may be that all of the global accesses are executed first, saturating the communication network. Only later are all of the local accesses executed. When \(Z \geq\frac{1}{P}\), we estimate that a fraction, Z, of the time the network is loaded and \(\frac{1}{Z}\) of the accesses are delayed by the network. For multicore machines, the communication network is a bus. Assuming a bus can service one processor at a time, we get that
$$F_{bus}(P, S_{g}, S_{l}) = \max\biggl\{ \frac{S_{g} \cdot P}{S_{g}+S_{l}}, 1 \biggr\} . $$Multicore include a caching mechanism, therefore, some of the global read accesses of ParC become local accesses to the cache. The cache efficiency is represented by a factor 0≤α≤1 (the “hit ratio”), such that \(S_{l}^{eff} = S_{l} + \alpha S_{g}\) and \(S_{g}^{eff} = (1-\alpha) S_{g}\). Note that α should not be confused with the cache efficiency in a single core machine, which is usually high 0.95. The reason for this difference is that when a ParC program is executed, the underlying MESI/MOESI coherence protocol will invalidate more cache lines than in the case of a sequential program executed by a single core. The expected cache efficiency α is reduced when the shared memory is used extensively and is subject to experimental evaluation for each program separately. Since we model the effect of bus transactions by a constant that is experimentally determined, we can, for the time being, ignore it, assuming that α=1.
Using the above parameter set, the execution time of a program R and a given input is calculated in the VMP model by the following expression. This expression is, in a sense, a more elaborate version of the formula used to express Amdahl’s Law (see related exercise).
Two lower bounds are used as a basic motivation to construct T(R). Clearly, the parallel time is bounded by D(R) because it is the longest path in the execution graph. No matter how many processors the machine has, D(R) is a lower bound. On the other hand the total amount of work divided by the number of processors \(\frac{W(R)}{P}\), is another lower bound to the execution time of R (Otherwise, there would have been a better sequential algorithm). The overhead of a parallel construct with explicit sync instructions is C+c, and the total time should be computed accordingly.
In many practical cases, the user can express D,S s ,N,W,S s and S l as a function of the input size. In this case the user can derive an explicit formula that describes the parallel execution time of the program for any given input. For example, consider the recursive program in Fig. 5.5. Let n be the number that is given as an argument to f. Then
-
D(n)=(C+2)logn. The depth of a binary tree wherein each node takes C instructions to be created and then executes one if statement and a function call.
-
S s (n)=1. There is only one sequential statement: the first if.
-
N(n)=2n. The number of nodes in a binary tree.
-
W(n)=6n. There are 2n nodes each executing an if statement and two function calls.
-
S g (n)=2n. Each function call f(x/2) accesses x, which is declared a local parameter of the outer function.
-
S l (n)=2n. In the if (x > 1) the x is a local variable (a parameter).
-
On a multicore bus machine
$$F_{bus}(P) = \max\biggl\{ \frac{2n \cdot P}{2n+2n}, \; 1 \biggr\} = \max \biggl\{ \frac{P}{2}, \; 1 \biggr\} $$
The final expression (assuming P>1) is then
Consider, for example, some specific values of n=1000,C=100,P=10 for the execution of the above program. It follows that the most dominant factor is \(\frac{ 2n \cdot C}{P} = 20{,}000\). The sequential execution time for n=1000 is 2000. Thus, using the above formula allows us to determine that this is an extremely inefficient program that runs ten times slower than a naive sequential version. Moreover, it is clear that unless we use P>100 processors, the above program will never be speeded up to the point that its execution time will compare with that of a sequential execution. Note that this thread took place before running or debugging, thereby saving considerable effort and debugging.
5.4 Speedup Notion for Programs
Using the VPM model, we can define the notion of speedup for parallel programs in a similar manner to the well known theoretical speedup for parallel algorithms
Both W(R) and T(R) are defined only for a specific input to R. However, a speedup notion that varies from one input to the other is not useful, because it does not characterize the program. For example, consider a program that for odd input sizes has a speedup of P, while for even input sizes it has a speedup of 1. The speedup notion is therefore valid only if it is defined independently of the program’s specific inputs. Thus, we assume that the user is able to express all of the factors of W(R) and T(R) by symbolic expressions of the input size n.
The importance of the speedup notion is that it allows us to discuss the efficiency of parallel programs. For example, we may say that a parallel program R is considered efficient (in the sense that it exploits the parallelism of the machine) if there is an input size N 0, such that for any n>N 0 the speedup is \(\frac{P}{2} \leq\mathit{SP}(R)\).
Note that this definition is different than the theoretical one. The theoretical definition compares the execution time of a parallel algorithm with P processors to the time of the best-known sequential algorithm. Had we attempted to use this notion for parallel programs, we would have had to determine the best sequential program, which is not practical. Therefore, we compare the sequential execution time of a parallel program to its parallel execution, where the sequential execution time of a parallel program is the total number of instructions executed by that program. Hence, the speedup definition for programs is made to be less than P only as a result of non-optimal coding and execution, not the incorrect selection of the algorithmic aspects.
The ratio between sequential and parallel execution actually gives the efficiency in terms of the amount of time that the processors are active and executing useful code. The difference between the notion of a program speedup and an algorithm speedup is that in the case of the former, the algorithm has already been determined. The remaining question is how well the algorithm’s implementation exploits the parallel machine.
The proposed definition identifies efficient parallel programs as those that have a linear speedup with a constant between \(\frac{1}{2}\) and 1. This definition can be refined further by dividing the speedup equation into “speedup factors,” which can be considered necessary conditions for efficiency. If any one of these conditions is violated, the program cannot be efficient. The equations lead to the identification of four such conditions:
- Large optimal speedup: :
-
\(\mathit{SP}^{o}(R) \equiv\frac{W}{D} \geq\frac{P}{2}\). In terms of the execution graph a program R should be as “wide” as possible (i.e. each parallel construct should spawn as many threads as possible). In order to correct SP o(R) the critical path D must be shortened, possibly by executing parts of it in parallel. This goal can be accomplished by, for example, joining consecutive independent parallel constructs into one construct. Clearly, SP o is an optimal speedup because no matter how many processors we add, the speedup of a program cannot exceed SP o.
- Short sequential code factor::
-
\(\mathit{seq}(R) \equiv\frac{S_{s}}{W} \leq\frac{2}{P}\). This is actually a restatement of Amdahl’s Law.
- Large average size of a thread::
-
\(\mathit{grain}(R) \equiv\frac{W}{N} \geq\frac{C}{2}\). \(\frac{W}{N}\) is the average size of a thread. This implies that the threads should be made large enough, or that not all threads can be fine-grained.
- Small global access factor::
-
\(\mathit{glob}(R) \equiv\frac{ S_{g} \cdot F_{m}( P, S_{g}, S_{l} )}{W} \leq2\). Usually, W≈S g +S l (the number of instructions corresponds to the number of memory references). Thus, the condition may be rewritten as \(\frac{S_{g} (F_{m} - 2)}{2} \leq S_{l}\), leading to the intuitive coarse approximation S g ⋅F m ≤S l . This condition indicates that on average, a global or external access occurs every F m local accesses. Thus, there are no delays caused by the network (see definition of F m ).
Note that the fifth factor of the speedup equation, \(\frac{W-S_{s}}{W \cdot P}\), is always \(\leq\frac{2}{P}\). Therefore, it does not limit efficiency.
The most important factor in the above list is SP o because it is the only factor that does not relate to the hardware parameters. Thus, SP o characterizes the program, while the other factors characterize its execution by the parallel hardware. Recall that SP o provides an upper bound on the speedup. Thus, in cases where SP o>P, we can define the effective speedup to be \(\mathit{SP}^{e}(R) = \frac{W \cdot P}{ S_{s} \cdot(P-1) + N \cdot C + W + S_{g} \cdot F_{m} }\), i.e. the other parts (besides SP o) in the speedup definition. In these cases, SP(R)=SP e(R).
The discussion so far has assumed that the hardware characteristics are fixed and the user should adapt his or her program to the hardware in order to reach optimal performance. However, the opposite situation is also worth considering: the user has a specific program or application to which parallel hardware should be adapted. More specifically, for a given program and input, what is the optimal number of processors that the user needs? Simple algebraic manipulation shows that the optimal number of processors P opt is obtained when SP o=SP e, and is given by: But F m is also a function of P. Taking that \(F_{m}=\frac{S_{g}\cdot P}{S_{g}+S_{l}}\) yields:
Thus, after simplifications we get that
In many cases, it is reasonable to assume that D≥S g , and since S l +S g >S g , then \(\frac{S_{g}^{2} }{D(S_{g}+S_{l})} < 1\) and its corresponding factor in Eq. (5.1) can be ignored.
Hence:
Taking S l >2⋅S g makes the above condition become \(D\geq \frac{S_{g}}{3}\). Otherwise, \(D\leq\frac{S_{g}}{3}\) and we get that:
Note that the optimal number of processors is usually (assuming \(D>\frac{S_{g}}{3}\)) not dependent on F m . This lack of dependence reflects the fact that changing the number of processors does not help the delay caused by external accesses. As in the case of the speedup, P opt can be expressed as a function of the input size.
5.5 Using the Speedup Factors
The speedup factors can be used to analyze the program, determine its efficiency, and then decide which factors should be improved and how. The programmer is free to modify his or her program and optimize it as long as he or she does not change the semantics of the underlying program, as expressed in the following definition:
Definition 5.1
Let E(R,I) denote all possible execution orders of a program R (as described earlier). R′ is a legal, efficient version of R if E(R′,I)∈E(R,I) for every I, and SP(R′)>SP(R).
Typically, the optimization threads of a given program are iterative threads, wherein a sequence of legal, efficient versions of the original program are created. The programmer attempts to isolate a group of execution orders that have a better speedup than the original program.
For example, consider the code segment in Fig. 5.6, executed on a 10 processor bus machine with C=10. In this program N=100,D=303,W=603,S g =500,S l =501 and S s =2. Using the expression for F m we obtain that F bus =5.0. The speedup factors for R are:
- SP o(R)=1.93;,:
-
which should be at least 5
- seq(R)=0.003;,:
-
which is indeed less than 0.2
- grain(R)=6;,:
-
which is above 5, as it should be
- glob(R)=4.15;,:
-
which is rather higher than 2, as it should be
The speedup equation is
and P opt (R)=13.
The above program is clearly inefficient, because instead of a speedup between 5 and 10, it achieves a speedup of 1.46. If the user wants to improve the speedup, he or she can add three additional processors to reach the optimal number of 13. However, the speedup will still not exceed 1.93.
Using the speedup factors, the user can evaluate the program as follows. It is not balanced, as there is one thread that is longer than the total length of the rest of the threads. This lack of balance is the dominant factor that limits the optimal speedup. In addition, there are too many global references. A reasonable approach for working on this program is to try to improve SP o(R) and glob(R), and hope for a speedup of 5.
In order to improve the performance, the new version Fig. 5.7 presents the following changes:
-
The number of threads has been reduced from 100 to 12. The first 10 threads simulate the 99 previous threads.
-
The instruction x=g/2 is executed outside the parallel construct. This change does not affect the results of this program because updating a global variable 99 times in parallel is usually equivalent to updating it once. Note that in this case the value of x has been determined to be 49 or 99, which are only two out of the 0…99 possible values for x in the program of Fig. 5.6.
-
In order to balance the execution graph, the “long” thread i=100 is divided into two threads i=11, i=12, each executing 50 assignments out of the original 100.
-
The number of processors has been reduced to 5, i.e., P=5.
In this version N=12,D=165,W=415,S g =300,S l =731 and S s =3. Using the expression for F m , we obtain that F bus =1.16. The speedup factors for R are:
- SP o(R)=2.67;,:
-
which is above 2.5, as required
- seq(R)=0.007;,:
-
which is indeed less than 0.4
- grain(R)=34.5;,:
-
which is above 5, as it should be
- glob(R)=0.8;,:
-
which is less than 2, as it should be
The speedup equation is
and P opt (R)=3.54, which is close to the new choice p=5. Hence, the new version of Fig. 5.7 satisfies the efficiency criteria presented so far.
Another simpler approach is to observe that the instructions g=j; are independent and can be executed in parallel. Hence, every g=i instruction can be appended to A[i]=x;x=g/2; as described in Fig. 5.8, creating a balanced program that can be further optimized (extracting x=g/2; outside the parallel loop). The previous optimized version is more complicated, and is used to illustrate a broader set of optimization techniques.
Next, we consider the use of mapped light parfor as a way to improve the speedup factors. For example, consider the code segment in Fig. 5.9, executed on a 10 processor bus machine with C=10 and n=106. Due to the use of light parfor, the number of threads is reduced to N=10. In this case D=n/P+P≈105 the work is W≈106, S g =2⋅106, S l ≈5⋅106 and S s =0. For a bus or a multicore machine F bus =2/7×10=2.85.
The speedup factors for this version are:
- SP o(R)=10;,:
-
which is indeed greater than P/2=5
- seq(R)=0;,:
-
which is indeed less than 0.2
- grain(R)=W/N=105,:
-
which is above 5, as it should be
- \(\mathit{glob}(R)= \frac{2\cdot10^{6}\cdot2.85}{10^{6}}=5.7\),:
-
which is higher than the required 2
The speedup equation is
The program in Fig. 5.9 is clearly inefficient, because instead of a speedup between 5 and 10, it achieves a speedup of 2. The main limiting factor is the relatively large number of global memory references.
This problem can be corrected by using the mapped version of light parfor, as depicted in Fig. 5.10. In this case D=n/P+P≈105 the work is still W≈106, S g =50, S l ≈7⋅106 and S s =0. For such small values of S g compared to S l , the speedup is bounded by SP o=10, which is optimal. It is evident that using more processors P=100 will not affect this result.
Note that thread management (i.e., access to the ready queue) should not be counted as a source for global memory references. Local variables should be allocated in the local memory of the processor that executes that thread wherein they are defined. As indicated earlier, this implies that a thread should always reside in the processor that has started to run it. Thus, ParC requires that threads should not migrate. Otherwise, the notion of a local variable as defined here has no meaning. Since threads cannot leave the processor, and one still needs to access the ready queue, a logical solution is to maintain a local queue of threads in every processor (see Fig. 5.11). Activities are created in the global queue, but when they are picked by a processor, they stay in the local queue of that processor.
To conclude, there are three speedup notions that are useful to a user developing a parallel application: SP(A), the speedup of the algorithm, which uses the best sequential solution as a reference point, SP o(R), the maximum speedup of the program compared to a sequential execution of the program itself, and SP e(R), the effective speedup that takes hardware aspects into account. The other speedup factors can be used to check the efficiency of a parallel program. For a given program, the user should attempt to improve each factor to the maximum. The factors also indicate the barrier that offers the better performance, and where it is best to invest effort.
5.6 The Effect of Scheduling on T(R)
The time equation of the VPM model is actually a lower bound, in that it states the minimal execution time possible for all possible execution orders. It still remains to bound the execution time from above. Clearly, it is pointless to improve (reduce) the execution time of a program from below when the upper bound remains high. In other words, improving the lower bound of the time execution of all possible execution orders of a program does not exclude the possibility that some execution orders will still remain at the same execution time. Hence, in this section we will develop an upper bound for the execution time such that improving the speedup factors will improve both the lower and the upper bounds of the execution time.
Note that T(R) smooths out delicate factors such as the effect of memory access patterns on bus contention and the overhead of cache misses in the MESI algorithm. It also ignores the fact that the overhead C contains global access to the ready queue and may depend on variable synchronization costs. Similarly, we ignore the fact that the time needed to create n threads in a queue might be a function of n, where n is the number of threads in PF i=1…n [R] or PB [R1| … Rn] statements. For example, the operating system can use a representative record, indicating that n threads need to be spawned. Thus, the creation of the representative is fast, but an inherent delay is caused when several processors attempt to extract threads from the same representative. An upper bound for the execution time can be developed only if the effect of at least one of these factors is bounded from above. In this section, the effect of possible schedulings (different mappings of threads to processors) is used to bound the execution time from above.
An important aspect ignored by the VPM model is the change in time caused by different possible orders in which the system can assign processors to threads. For example, consider the program in Fig. 5.12, executed on a machine with 2 processors. Since the machine has only two processors, one thread will be executed alone after the first two have been terminated. One order of execution is to start with A and B, and when B has terminated, use its processor to execute C, yielding T=100. Another order of execution is to start with C and B, and when B or C has terminated, use one processor to execute A, yielding T=150. Thus, different orders of execution may lead to different time calculations.
The delays caused by poor scheduling may be improved if context switches are used by the operating system. The following discussion describes some of the properties of preemption or context switches.
-
The system halts the execution of a thread T1 while it is being executed by a processor P1. P1 records its current state (using a program counter, stack-pointer and registers) and saves it in the ready queue. Another thread T2 that is not currently being executed by a processor is selected from the ready queue by P1. Finally, T2’s state is restored in P1 (using a program counter, stack-pointer and registers)and T2’s execution is resumed.
-
Context switches are initiated by one of the following events:
-
A time interrupt (e.g., every 10 milliseconds).
-
Explicit instructions that have been inserted in T1’s code by the compiler or manually inserted by the user.
-
A system call such as read/write operations that suspends the current thread until the operation (read/write) completes.
-
Synchronization operations between threads.
-
Spawning new threads.
As will be explained later on, ParC favors inserted context switches over time interrupt context switches.
-
-
Note that preemption switches the processors among the ready threads. Thus, it is essential in guaranteeing the “fair” execution of n>p threads, where p is the number of processors. Fairness can be defined as simulating the execution of n threads by p<n processors obtaining the same results as if these threads were executed by p=n processors. If preemption is not used, thread T1 may loop forever, executing a “busy-wait” while(flag); and waiting for the reset operation flag=0 of another thread T2. This reset operation is never executed because there is no processor to execute T2.
-
Since T1 and T2 use the same set of shared variables whose memory address are distinct, there is no need to save or invalidate the cache lines of T1 when switching to T2. Similarly, there is no need to restore the cache lines used by T2 when its execution is resumed. Hence, context switch operations need not involve the cache. However, context switches increase the probability that the cache lines used by T1 will be evicted from the local cache by T2’s load/store operations.
-
Context switches are expensive and may require about 100 clock cycles to complete. They need this amount of time because they include the operations of: (1) saving the state/context of a thread, (2) allocating a new/empty space in the ready queue, (3) saving the state of T1 in the ready queue, (4) restoring T2, and switching the stack pointer between the two threads.
-
Special care must be given to interrupts that are received while a context switch is being executed, because interrupts may be received on T1’s stack but handled on T2’s stack.
-
Context-switches are usually combined with a first-in-first-out (FIFO) policy of selecting threads from the ready queue. This is an important aspect because it ensures some sort of fairness and prevents the starvation of threads. ParC allows the user to change the FIFO policy and replicate it with another random selection, last-in-first-out or any other selection rule.
We will now consider how scheduling and preemption interact. Consider the program in Fig. 5.12, executed with a quantum time of q=20 statements between every context switch. The operating system also uses a round robin policy (FIFO), for a fair selection of threads from the queue. Let A=a 1;a 2;a 3;a 4;a 5 be a division of the first thread /∗A∗/ into a sequence of five units where each unit corresponds to 20 instructions of the loop. Similarly, let B=b 1=20;b 2=20;b 3=10 and C=c 1=20;c 2=20;c 3=10 be a division of threads /∗B∗/ and /∗C∗/ into the corresponding units. Then, (for the program in Fig. 5.12) any fair scheduling will achieve an execution time of 120 instructions (see Table 5.1). This scheduling improves the previous scheduling of 150 instructions, but it is still longer than the optimal scheduling of 100 instructions. Note that this scheduling satisfies the first-in-first-out fairness policy and processors select the next thread from the ready queue based on this criterion.
If the arbitrary selection of threads is allowed, unfair schedulings can result. Here, both the optimal scheduling of 100 instructions and the poor scheduling of 150 instructions are possible. These schedulings are demonstrated in Table 5.2 and Table 5.3. Note that the scheduling in Table 5.2 does not preserve the FIFO policy, nor does the optimal scheduling of Table 5.3.
Our model should hide this from the programmer because we do not expect that the programmer will calculate all of the possible orders of execution. Hence, we can no longer use an estimation but rather give upper and lower bounds to the execution time of a program. The execution time should be bounded below by the best execution order, and from above by the worst execution order.
Let a scheduling of a parallel program R be an assignment of a time and processor values to every instruction executed by this program. Finding the optimal scheduling (i.e., the scheduling with the minimal execution time) with a fixed number of processors is an NP-complete problem. However, using a known result an approximation scheduling can be defined such that T(R) is less than twice the optimal time possible:
Claim 5.1
Let T be a set of n threads executed by P<n processors such that the sequential execution time of each thread is arbitrarily determined by an adversary when that thread starts. The execution time of each thread is thus fixed and does not depend on other external events such as a busy-wait. In addition, a thread ends either by termination or when it spawns new threads. Assume that the scheduling used by the P processors is to place all threads in a ready queue, letting each processor that completed a thread immediately select another thread from the ready queue. For any choice of the execution times by the adversary, the scheduling time obtained by the non-idle-processor scheduling is at most twice as bad as the best possible scheduling of these threads.
Proof of this claim trivially follows from a stronger result that we will prove next.
Since the execution model of ParC explicitly demands that processors are never idle, it is actually implementing the above rule of scheduling and thus obtains up to twice the optimal execution time possible. Hence, T(R) can be estimated as follows:
This estimation can be improved by a closer examination of the effect of scheduling of a parallel program. Let S i , i=1…P denote the number of instructions (including overheads) executed “with i processors” and let \(S= \sum_{i=1}^{P} S_{i}\) be the total number of instructions. An instruction is executed with i processors if there were i processors working at the time that this instruction was executed. Since in our model a processor will take work from the queue when it runs out of work, an instruction is executed with i<P processors only if the queue is empty and the rest of the processors are idle. In order to avoid referring to the execution order, we can say that the execution time T is equal to:
Let \(S_{M} = \max_{i=(P-1)}^{1} S_{i}\) then we can bound T as follow:
The logP is obtained by a recursive process wherein we can bound the sum of every two consecutive fractions in the sum as follows:
This method, however, dose not solve the problem, because in several cases, computing S M requires that we compute all S i . Since the addition to \(\frac{S}{P}\) results from instructions that were executed with i<P processors, we might try to bound its direct effect on T rather than using the total sum of these instructions. Basically, we want to bound T by \(T < \frac{S}{P} + T^{\inf}\), where T inf is the time needed for a machine with an unbounded number of processors (actually T inf is equal to the longest path D from the previous section). The proof and the definitions of the above claim use the concept of “the execution graph” of a parallel program.
The execution graph (defined earlier but presented here in a more convenient form) of a program G(R) is defined by the composition of the graphs of the threads spawned by R to form G(R). Note that the execution graph can be determined if the programmer knows the exact sizes of the number of threads spawned at any point or the number of iterations of every loop. Usually, these sizes are a function of the input, and may be known by the programmer in advance. Hence, the programmer can compute G(R) as a function of N, the input size. More formally and graphically, the execution graph is defined as follows (the ‘|c’ denotes an overhead of ‘c’ instructions):
Note that G is a DAG beginning and ending with one node. G also satisfies the rule that any “split” (PF or PB) must be joined eventually to one node. After establishing the notion of G(R), we can determine the necessary definition for the upper bound of T. For a given program R and an instance of its execution graph G(R) (now referred to as G) and a parallel machine with P processors, we use the following definitions:
- S::
-
The total number of nodes in G or the instruction executed.
- Deletion: :
-
A node can be removed from G if its in-degree is zero, meaning that all of its fathers have been removed. Note that the only case in which a node has more than one father is the node following G(PF) or G(PB). Every node that is removed corresponds to an instruction being executed by some processor.
- Candidate-Group: :
-
All of the nodes that can be removed. Initially, the candidate group contains ‘m*’.
- Full-Step: :
-
Remove P nodes from the candidate group. Such a step corresponds to a step in the parallel machine where each processor executes one instruction. One cannot remove a node covered in this step.
- Bounded-Step: :
-
The candidate group may contain fewer than P nodes (as in the case of a sequential program). In that case, some processors are idle and have no work to do.
- Execution Order: :
-
By removing the upper node of G(PF) or G(PB), many new nodes can join the candidate group. The order in which the nodes are removed by full or bounded steps until G becomes empty determines an execution order.
- T::
-
The number of steps needed to delete all nodes in G, or the length of an execution order.
- G t ::
-
For a given execution order, G t is the remaining graph that is left after the first t bounded steps.
- MLP(G t )::
-
One of the longest paths in G t . Note that the MLP(G t ) starts from nodes in the candidate group and ends in ‘*m’.
- T inf::
-
is the length of MLP(G 0=G). If P=inf, then there is an execution order of size T inf. Hence, T inf is the fastest execution time for a given program and its execution graph.
Theorem 5.1
For a given program R and its execution graph, the time (or the number of steps) needed by any execution order with P processors to execute R is bounded by:
Clearly, any execution order needs at least \(\frac{S}{P}\) steps to remove all nodes in G. However, bounded steps may prolong this process. Clearly, no program can be executed faster than T inf, because any node in MLP(G) can be removed only in the next step after its father has been removed.
Claim 5.2
Let \(M^{ex_{i}}\) and \(R^{ex_{i}}\) denote the number of full and bounded steps executed by an execution order ex i . Then, for any execution order ex i \(M^{ex_{i}} \leq\frac{S}{P}\) and \(R^{ex_{i}} \leq T^{\inf}\).
Intuitively, this follows from the fact that each unbounded step must remove one node from every path that belongs to the group of paths with the maximal length in G. Since \(T^{ex_{i}} = M^{ex_{i}}+R^{ex_{i}}\) Claim 5.2 yields Theorem 5.2. The first part of Claim 5.2 is thus trivial but for the second part of the claim we use the following lemma:
Lemma 5.1
The length of MLP(G t ) is less equal to T inf−t.
This lemma shows that the length of \(\mathit{MLP}(G_{T^{\inf}})\) is zero. Hence, in the next step, all nodes are removed, and the above claim follows. The proof of Lemma 5.1 results from the fact that there are fewer than P MLP(G t ) in the current graph. Hence, all of them are removed at the t’th bounded step.
This claim is tight in that it can be shown that a simple improvement of the upper bound for T is tight. In other words, let us find a G(R) that cannot be scheduled with fewer than \(\frac{S}{P}+T^{\mathrm{inf}}\) steps. Let the new bound be:
where S s is number of nodes in G(R) that are not in the scope of any PF or PB. Clearly, all of the nodes in S s belong to the longest path in G(R) and therefore should not be counted twice (i.e., in \(\frac{S}{P}\)). The ⌊…⌋ can be justified as follows:
If the gap is one, then there is at least one bounded step that was counted in the \(\frac{S-S_{s}}{P}\) term and in T inf. This can be avoided if ⌊…⌋ are used.
Let G(R) be with S−S s =5−2=3 and T inf=3, then for P=2 the upper bound yields \(T\leq\lfloor\frac{3}{2}\lceil+3=4\). Clearly, G(R) cannot be scheduled with fewer than 4 steps using two processors.
In the above formalism, processors were able to switch from one thread to another through the granularity of an instruction. This is not practical. Common solutions allow context switching every pre-defined quantum time.
Corollary 5.1
Let q be the quantum time for context switches, and c the overhead for context switches. If we insert the overhead c in G every q nodes along any path from the root of G, then the new size of G is increased by at most \(( 1+ \frac{c}{q} )\) and so is the length of every path in G. The execution time for a given program and its graph G is bounded (regardless of any execution order) by:
The effect of this kind of context switch can be included by simply adding the overhead of the context switch as nodes in the graph. The nodes of the context switches should be added at a distance proportional to the quantum time.
Claim 5.3
Let \(M^{ex_{i}}\) and \(R^{ex_{i}}\) denote the number of full and bounded steps executed by an execution order ex i . Then, for any execution order ex i \(M^{ex_{i}} \leq\frac{S}{P}\) and \(R^{ex_{i}} \leq T^{\inf}\).
Hence, the effect of scheduling and preemption (context switches) on T(R) can be stated as follows:
Theorem 5.2
Let q be the scheduling time quantum, and c′ the overhead for a single preemption. The execution time T(R) is bounded (regardless of any execution order) by:
Practically, this implies that the effect of non-optimal scheduling (with a reasonable choice of q) can only double the execution time.
5.7 Accounting for Memory References for Multicore Machines
In this section we consider how to expand the time formula of the previous section to account for the bus transactions made in multicore machines. A closer look at the structure of the virtual machine in Fig. 5.1 shows that a processor has direct access to its local memory as opposed to access through a network to the global memory (as noted in Sect. 5.3. In addition, parts of the global memory may be local to other processors). The time formula
accounts only for multiple accesses to the bus caused by non-local memory references. For multicore machines this is unsatisfactory because it does not account for the overhead generated by the MESI protocol (as described in the earlier Sect. 4.5). In this section we attempt to port the above time formula to account for the MESI’s bus transactions.
Consider the following two programs executed over MESI. Which one is better? Which one minimizes the number of bus transactions generated by MESI? Assuming that x and y are allocated to two different cache lines, it is likely that the mixed case will generate more bus transactions than the homogeneous case. In the mixed case, after a true synchronous execution of both threads there will be two transitions of x/y between the threads, while in the homogeneous case there will be only one transition of x/y.
The proposed model is called the transitions model. It counts the number of “transitions” of a shared variable between different threads. Each transition is a possible bus transaction (BusRdX or BusRd) of the MESI protocol where a cache line moves from one cache to another cache. The following assumptions are used in deriving the proposed model:
-
It will be hard to track transactions based on the CPU location of shared variables. We can approximate them by considering the transactions between threads. By doing so, we assume that any two threads that are updating a shared variable are not executed by the same CPU and hence will lead to some cache misses and a MESI bus transaction.
-
The way that variables are allocated to the cache lines may affect the resulting number of transactions. For example, in the above program if both x,y are allocated to the same cache line, then the number of bus transactions may increase. In order to avoid tracking the mapping
-
All data dependencies must be exposed. For example, we should know whether or not ∗p and ∗q may point to the same variable in order to determine if the execution of ∗p followed by ∗q will result in a bus transaction. Given that tracking data dependencies through pointers is difficult, we will assume that concurrent access through pointers is not likely to occur and that data dependencies are due to shared variables and array references only.
-
Different schedulings can affect the number of bus transactions that are generated.
The transactions are computed syntactically using the following procedure:
-
Each PF=i=l…n do S i is replaced by PB [S i |S i+1|…S i+k ] where k is a constant sufficiently large enough to expose array backward dependencies. An array backwards dependency is formed when an array element used in iteration i was updated in a previous iteration, e.g., a recurrence of the form t1=A[i−1]; … A[i]+=t1;. Formally, let d be the maximal loop-carried dependency of a for-loop; then, k=d 2. The number of arrows that this PF contributes is arrow(PB [S i |S i+1|…S i+k ])∗(n−l)/k.
-
Sequential for-loops for(i=l;i<n;i++)S i are similarly transformed into block statements {S i |S i+1|…S i+k }.
-
If two variables are likely to be in the same cache line, they are replaced with the same name.
-
The execution graph G is constructed for the resulting program.
-
Arrows are placed following some topological order of visiting the nodes in G.
5.8 On the Usage of a Simulator
In here we briefly discuss the use of simulations to optimize the execution of parallel programs. In this respect optimizing speedups depends on isolating and understanding the effect of several parameter of the execution times as measured by a simulator for a given program. These parameters are the input size, number of cores, context-switch overhead, context switch quantum time and the scheduling policy. By the term simulation of a parallel program we refer to a sequential execution of an instrumented version of the program such that after each step of the program (execution of an assignment, evaluation of an expression or memory reference) the control is transfer to a routine that accumulates different statistics. Controlled execution of an instrumented version allow the simulator to do the following operations:
-
The simulator can control or determine which instruction will be executed next, thus it has full control over the execution order of the program’s code. Thus the simulator can execute the scheduling model of Fig. 1.16 computing the theoretical executions time for the case of infinite number of processors and it can also compute the execution time for a fix number of processors P by simulating the virtual machine model of Fig. 5.1 computing execution times of processors/cores and other relevant statistics.
-
In particular, since the simulator actually executes the program by selecting a possible scheduling of its threads it can also measure the effect of executing the context switch operations occurred during the selected scheduling. Thus it can compute execution times for different values of the context-switch overhead C and the time duration between context-switch operations q. Thus we distinguish between the simulator considered here
-
The simulator can also simulate the MESI algorithm and measure the overhead of the resulting cache misses and its effect on the resulting scheduling and the execution time.
-
We can use the ability of the simulator to measure execution times with different values of C,P,q,n (n is the input size) to evaluate how sensitive a given program is to the execution order. Following the approach in Ben-Asher and Haber (1996) we say that a program can be called “practically optimal” if for a reasonable number of cores P, and a reasonable context switch overhead C, there is a minimal input size n 0 and a choice of a context switch quantum time q, such that for every n>n 0 the gap between ideal execution times Tinf is relatively small and stays fix. By the term small we mean that it is proportional to the ratio between the optimal number of cores (the maximal width of the execution graph) and P is the actual number of cores used in the simulations.
Consider the PRAM algorithm (following code) computing Strongly Connected Components of sparse graphs described in Jaja (1992), p. 213. This algorithm use the fact that the graph is sparse, (i.e. the number of edges is \(m < O(\frac{n^{2}}{\log n})\) where n is the number of nodes in the graph) to compute the connected components in O(logn) parallel steps, compare to Ω(log2 n) needed by a transitive closure type of algorithm.
The program was executed by the simulator on a graph of random chains with the following parameters: n the size of the graph (actually matches the width of the execution graph), and P the number of cores. For each n and P there where two sets of experiments: in the first we measured Tinf as a function of q (the context switch delay), and in the second we measured T P . Our results Fig. 5.13 show that an for n=80 and P=30 Tinf≤T P , however for P=50 we get that Tinf>T P . When we increased n (width of the graph) to 200, we got the same phenomenon however the change happens between p=50 and p=100.
Note that the execution times of this program can vary according to the actual scheduling that took place. This explains the minimum points of some curves (around \(\mathcal{Q}=80\) and \(\mathcal{Q}=100\)). Hence, based on the simulation results, a proper choice of \(\mathcal {Q}\) may yield a better scheduling of a given program. As for the gap between effective T P and ideal times Tinf, we see that there are optimal combinations of \(\mathcal{Q}\) and P, where this gap is minimal. Thus, if P is fixed (actual number of available cores), there are cases where we can close this gap by proper choice of \(\mathcal{Q}\), otherwise (as is the case for n=200 and p=50) we can not close this gap and should attempt to modify the program.
5.9 Conclusions
Efficient execution of parallel shared memory programs (such as those written in ParC ) requires optimized scheduling, context switching, allocation of threads to processors and load balancing. An important factor in efficiency is the ratio between local and global memory references. When this ratio is smaller than the underlying network bandwidth, the global memory references collide and cause delays. This ratio is particularly important when the network bandwidth is small, as in cases with bus machines where only one memory reference is carried out in a step.
A simple model that bounds the process of executing a parallel program in a shared memory machine has been presented. The model incorporates overheads for creating threads, context switching, scheduling, and the local/global ratio of memory references into a basic speedup formula. The formula actually predicts or measures the expected efficiency of executing a given program. Using a simple calculation, the user is able to determine how well his or her program exploits the hardware.
If the prediction is negative, the user can analyze the program (R) and determine the main cause for its lack of efficiency. Possible causes include one (or more) of the following factors:
-
The optimal speedup factor (SP o) bounds the amount of parallelism possible for this program regardless of any hardware limitations.
-
The length of inherent sequential code segments (seq(R)).
-
The average size of the threads (grain(R)). If the average size is too large, the operating system’s overhead can dominate the execution times.
-
The ratio between global memory references and local ones (glob(R)). If this ratio is greater than the bandwidth of the communication network, communication latency will slow down the execution. We believe that determining this ratio is also useful for multicore machines where shared memory is simulated via a cache coherency protocol.
Given that the user cannot change the hardware, he or she has to modify the program so that it will achieve better performances. Each of the above factors is matched with a transformation that reduces its effect whenever possible. This methodology can lead to efficient versions of parallel programs.
5.10 Exercises
5.10.1 Optimizing Speedup Factors
Consider the following program with P>5 and C=10:
-
1.
Compute the speedup and the speedup factors for this program (as functions of n and P).
-
2.
What are the limiting factors for this program?
-
3.
What is the optimal number of processors (as functions of n and P) for this program? (Check the condition for D).
-
4.
What is the maximal value for P?
-
5.
Describe potential improvements for this program.
-
6.
What observations can be made regarding the use of faa()? Is it possible to use faa() to reduce the number of global references?
-
7.
Provide a new, final, optimized version of this program.
-
8.
Compute the new speedup and its factors and evaluate the results.
-
9.
What types of parallel programs inherently have efficiencies equal to one?
-
10.
For the matrix multiplication program in Fig. 3.9, find the minimal N for which the parallel execution becomes slower than a sequential one.
5.10.2 Amdhal’s Law
Amdhal’s Law states that the speedup of a program (be it a sequential or parallel one) is always restricted by the fraction of its inherent sequential part. Determine which of the following claims is incorrect. For each incorrect claim, write a small parallel program showing that this claim is really false.
-
1.
Let β (a rational number) denote the fraction of the inherent sequential part of a program. Thus, 1−β is the part of the program that is or may be subject to parallelism. Clearly, β characterizes every possible program.
-
2.
Let T(i) denote the execution time of a program executed by a parallel machine with i processors. The speedup of a program executed with n processors is \(\mathit{SP}=\frac{T(1)}{T(n)}\).
-
3.
The inherent sequential part of the program cannot be parallelized. Hence, \(T(n)= T(1)\cdot\beta+\frac{T(1)(1-\beta)}{n}\).
-
4.
The speedup for \(n>\frac{1}{\beta}\) is always
$$\mathit{SP}=\frac{1}{\beta+\frac{(1-\beta)}{n}} \approx \frac{1}{2\beta} $$Thus, no matter how many processors we use, the speedup of a program will always be less than \(\frac{1}{2\beta}\). Given that every program must have some substantial fixed fraction of the inherent sequential code, there is no point in using large parallel machines.
5.10.3 Scheduling
A parallel machine with three processors that does not perform context switches is provided. The machine executes the following “unbalanced” program:
Let the running time of a program be the maximal number of instructions executed by one of the three processors.
-
1.
Provide a mapping of the threads to processors (schedule) that achieves minimal running time, and justify its optimality.
-
2.
Provide a mapping of the threads to processors that achieves maximal running time.
-
3.
Is there a possibility that, for some program and a parallel machine, there is a mapping/schedule whose execution time is less than the average? (assume that the sum of all threads’ time is divided by P).
-
4.
Which of the two features (context switch or thread migration) is needed to guarantee optimal running time? (Give examples justifying your claims).
-
5.
Is there a possibility that, for some program and our standard machine model, there is a mapping/schedule where the gap between the most loaded processor and the least loaded processor is greater than the longest execution time of a thread?
-
6.
In general, can the difference between the minimal and maximal running time, over all possible schedulings, exceed the execution time of the longest thread? Discuss this issue for machines with different scheduling policies (i.e., context switch and migration).
5.10.4 Two-Processor Machines
A parallel machine with two processors that does not use migration (i.e., the threads cannot move from processor to processor during execution) is provided. The machine executes the following program:
-
1.
Draw the execution graph and number each thread, leaving some space between the instructions in the execution graph. Loops should be inserted as a sequence of instructions, as follows:
-
2.
What is the final value of g?
-
3.
What will happen if the assignment y=y is replaced by x=y?
-
4.
Find a scheduling for the execution graph by adding a time/processor index to every instruction in the execution graph (recall that threads are not supposed to migrate). What is the execution time for this program?
-
5.
Add a thread index i to all of the variables (x i ) in the execution graph (including parameters). Map the variables to processors such that x i will be mapped to a different processor than y i . Mark accesses to variables in the execution graph as local or external. What is the ratio between local and external accesses in this program?
-
6.
Let E denote the maximal number of simultaneous external accesses that a parallel machine allows (e.g., 1 for a bus machine). What is the maximal value of E that this program needs in order to minimize the overhead due to multiple accesses? For a given program execution, let L/G be the maximal number of local/external memory accesses that a processor executes. What is the optimal ratio between L, G and P, such that the program execution is not necessarily slowed down? (Check the case where P=3,L=2,G=3).
5.10.5 Sync-Based Exercises
A parallel machine that does not perform context switch and thread migrations is provided. The machine executes the following sync-based program:
-
1.
Extend the scheduling rules to include sync instructions.
-
2.
Draw the execution graph of the program for N=8; use a node [50] to mark a loop of 50 instructions. Give the same index to every group of sync nodes that are executed together.
-
3.
What is the minimal and maximal value for g?
-
4.
Using the scheduling rules to mark the nodes of the execution graph for the case when P=3, what is the execution time?
-
5.
Using the scheduling rules to mark the nodes of the execution graph for the case when P=1, what is the execution time?
-
6.
What will be changed in the conclusions obtained so far if we replace 100 with 4, and 50 with 2 in the program?
-
7.
What is the minimal running time for the case P=2 (Justify your answer).
-
8.
Compute T(R); note that D is affected by the execution of the sync.
5.10.6 The Effect of Sync on the Execution Time
The following program uses address passing in order to communicate between recursive parallel calls:
-
1.
Draw the execution graph G(f) for the execution of f(&t,4) (leave enough space for adding marks and instructions). Add g,t in the beginning of G(f) to indicate their declaration. Also, add G(f) to the variables (x i ,z′) y i whenever a call to f(x,z) took place, where i is the thread index and z′ is the value of z at the call.
-
2.
Use an arrow to show to which variable x i points. Add all sync instructions to G(f), followed by ∗x i =y i +2; when needed. Use a broken line to join syncs that are synchronized.
-
3.
Add the value of y j after the execution of the ∗x i =y i +2; that modified it, next to y j . Inside every loop in G(f) mark the number of iterations it executed. What are the values of g and t at the end? Explain why there is no need to replace g=g+1 by faa(&g,1) in order to avoid the overwriting effects caused by the fact that g=g+1 is not atomic.
-
4.
For a general execution of f(&t,n); with p=3 and C=50 compute all of the parameters N,D,W,S s ,S g ,S l ,F bus , and the speedup factors. Evaluate the results and give a short explanation of the connection between f()’s execution structure and the results.
-
5.
The following rules can be adopted as a policy by a parallel operating system:
-
(a)
An idle processor should take threads from the queue whenever possible.
-
(b)
The thread queue is managed as a FIFO queue.
-
(c)
The thread queue is managed as a LIFO queue.
-
(d)
Context switching (preemption).
-
(e)
Activities migration.
-
(f)
Local variables of a thread are mapped in the local memory of the processor that started to execute this thread.
-
(g)
The first son of a thread is mapped to the same processor that started to execute this thread.
-
(h)
The last son of a thread is mapped to the same processor that started to execute this thread.
-
(i)
A partial group of sons (greater than one) of a thread are mapped to the same processor that started to execute this thread.
Which combination of the above rules or their negation is best suited to execute f(&t,n)? (Explain why). Which combination of the above rules or their negation is worst suited to execute f(&t,n)? (Explain why). Which of the above rules contradict one another, in the sense that any one of them might speed up execution, but using both of them together will neutralize their intended effect?
-
(a)
5.11 Bibliographic Notes
Analytical study of speedup of parallel programs has been widely studied since the early days of parallel processing. Eager et al. (1989) establish some relations between speedup and efficiency showing that there is a bound as to how much both can be poor. Kruskal et al. (1990) developed a theory of parallel algorithms that emphasizes speedup over sequential algorithms, and efficiency in a form of complexity classes. This differs from the method presented here which is focused on the ability to bound the speedup. Reference to the NP-completeness of finding optimal schedulings for parallel programs can be found in Johnson (1983). Approximating the optimal scheduling can be done in polynomial times as is described in Papadimitriou and Yannakakis (1990). Williams and Bobrowicz (1985) obtained speedup predictions of scientific parallel programs for different number of processors using simulation techniques. Eager et al. (1989) showed that:
-
The speedup is bounded by the average parallelism.
-
The speedup and efficiency cannot both be bad (i.e. low efficiency guarantees high speedup).
-
For static allocation of processors, a number of processors that matches the average parallelism is near optimal.
An elementary result regarding speedup of parallel programs is ‘Amdahl’s law” (Amdahl 1967), namely that the serial part of a program limits the speedup that may be achieved by parallelism. Kruskal (1985) argues with Amdahl’s law and claims that linear speedup is almost always possible for large enough problems. Karp and Flatt (1990) proposed a new metric measuring the serial fraction which is calculated from the number of processors p and the measured speedup s as \(\frac{1/s - 1/p}{1 - 1/p}\). Flatt (1991) showed that scaling up the problem size may improve efficiency and defy Amdahl’s law, but may also increase the total execution time to such an extent that it would not be practical.
Sun and Gustafson (1991) show that the speedup notion is unfair in that it favors slow processors and poorly coded programs proposing two new performance metrics that are more fair. Several practical works considered measuring real speedups. For example, Wieland et al. (1992) expose a large difference in measuring speedup relative to parallel code running on one processor, or relative to an independent optimal sequential program. Sun and Ni (2002) proposed three new models of parallel speedup along with their speedup formulations. Zhang (1991) proposed a performance model to estimate the effect of factors such as sequential code, barriers, cache coherence and virtual memory paging on the execution time.
Many parallel programming languages use thread system that is similar to the VPM model described here, e.g. Gehani and Roome (1986), Kuehn and Siegel (1985) and Rose (1987). Using fetch and add instructions is described in Gottlieb et al. (1983).
References
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computer capabilities. In: AFIPS Spring Joint Comput. Conf., vol. 30, pp. 483–485 (1967)
Ben-Asher, Y., Haber, G.: On the usage of simulators to detect inefficiency of parallel programs caused by bad schedulings: the simparc approach. J. Syst. Softw. 33, 313–327 (1996)
Eager, D.L., Zahorjan, J., Lazowska, E.D.: Speedup versus efficiency in parallel systems. IEEE Trans. Comput. 38(3), 408–423 (1989)
Flatt, H.P.: Further results using the overhead model for parallel systems. IBM J. Res. Dev. 35(5/6), 721–726 (1991)
Gehani, N.H., Roome, W.D.: Concurrent C. Softw. Pract. Exp. 16(9), 821–844 (1986)
Gottlieb, A., Lubachevsky, B., Rudolph, L.: Basic techniques for the efficient coordination of very large numbers of cooperating sequential processes. ACM Trans. Program. Lang. Syst. 5(2), 164–189 (1983)
Jaja, J.: An Introduction to Parallel Algorithms. Addison-Wesley, Reading (1992)
Johnson, D.S.: The NP-completeness column: An ongoing guide. J. Algorithms 4(2), 189–203 (1983). (about parallel scheduling)
Karp, A.H., Flatt, H.P.: Measuring parallel processor performance. Commun. ACM 33(5), 539–543 (1990)
Kruskal, C.P.: Performance bound on parallel processors: An optimistic view. In: Broy, M. (ed.) Control Flow and Data Flow: Concepts of Distributed Programming. NATO ASI Series, vol. F-14, pp. 331–344. Springer, Berlin (1985)
Kruskal, C.P., Rudolph, L., Snir, M.: A complexity theory of efficient parallel algorithms. Theor. Comput. Sci. 71(1), 95–132 (1990)
Kuehn, J.T., Siegel, H.J.: Extensions to the C programming language for SIMD/MIMD parallelism. In: Intl. Conf. Parallel Processing, pp. 232–235 (1985)
Papadimitriou, C.H., Yannakakis, M.: Towards an architecture-independent analysis of parallel algorithms. SIAM J. Comput. 19(2), 322–328 (1990)
Rose, J.R.: C*: A C++-like language for data parallel computation. In: USENIX Proc. C++ Workshop, pp. 127–134 (1987)
Sun, X.H., Gustafson, J.L.: Toward a better parallel performance metric*. Parallel Comput. 17(10–11), 1093–1109 (1991)
Sun, X.H., Ni, L.M.: Another view on parallel speedup. In: Proceedings of Supercomputing’90, pp. 324–333. IEEE, New York (2002). ISBN 0818620560
Wieland, F., Reiher, P., Jefferson, D.: Experiences in parallel performance measurement: The speedup bias. In: Symp. Experiences with Distributed & Multiprocessor Syst., pp. 205–215. USENIX, Berkeley (1992)
Williams, E., Bobrowicz, F.: Speedup predictions for large scientific parallel programs on Cray X-MP like architectures. In: Intl. Conf. Parallel Processing, pp. 541–543 (1985)
Zhang, X.: Performance measurement and modeling to evaluate various effects on a shared memory multiprocessor. IEEE Trans. Softw. Eng. 87–93 (1991)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2012 Springer-Verlag London
About this chapter
Cite this chapter
Ben-Asher, Y. (2012). Improving the Performance of Parallel Programs: The Analytical Approach. In: Multicore Programming Using the ParC Language. Undergraduate Topics in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-2164-0_5
Download citation
DOI: https://doi.org/10.1007/978-1-4471-2164-0_5
Publisher Name: Springer, London
Print ISBN: 978-1-4471-2163-3
Online ISBN: 978-1-4471-2164-0
eBook Packages: Computer ScienceComputer Science (R0)