Keywords

1 Introduction

Analytical performance models are powerful for understanding and predicting the performance of large-scale simulations. An analytical performance model is a closed-form expression that describes application performance, expressed in a metric of choice, as a combination of application components, application specific parameters and hardware parameters. Analytical models are human-readable and cost little to no resources to use. Furthermore, they can provide many insights that are otherwise expensive to obtain, e.g., locating the performance bottlenecks [10], or are not obtainable at all, e.g., predicting how an application will perform on a next generation of supercomputers [7].

Analytical performance models are also useful in determining the impact of load-imbalance in large-scale parallel applications running on current parallel (distributed) systems. Load-imbalance occurs when the parallel processes of an application are not assigned an equal amount of work. This can cause significant inefficiency during the parallel execution of applications on large-scale systems. Several large-scale applications and libraries [5, 9, 15, 21] already use different simple analytical performance models to predict load imbalance. Some of these analytical models have a per-process view of the application, thus allowing for detailed load imbalance predictions [9].

Despite its advantages, analytical modeling remains challenging, as it requires both performance-modeling and application-specific expertise. Furthermore, because the resulting models are application-specific, most of the work needs to be redone when modeling a different application, or even the next version of the same application. While work on generalizing the process of building analytical performance models already exists [10, 14], most of these approaches aim to provide performance models that predict scalability and extrapolate to new, larger systems, and/or lack the fine granularity needed to support a better understanding of application inefficiency.

To address such limitations, we propose in this work a detailed methodology for building fine-grained, per-process analytical performance models for scientific simulations. By design, the per-process modeling gives us more detailed insights into the performance and load-balance of the modeled application. Our four-step modeling process is as follows: (1) we identify the code-sections and input parameters with a relevant performance impact (2) we build a symbolic analytical performance model that describes the application performance at process level, (3) we calibrate the models for a specific machine with the help of empirical performance data, and we aggregate the per-process models for an application-wide performance prediction. (4) We validate the model performance. The resulting model is outlined in Fig. 1. We note that the symbolic model remains constant for the target application - that is, migrating the model to a different system only requires re-calibration, i.e., collecting empirical performance data.

Fig. 1.
figure 1

Overview of the per-process analytical model.

We demonstrate the feasibility of our approach on HemoCell, a high-performance framework for dense cellular suspension flows [19, 20]. Previous work, starting from L. Axner, et al. [2], developed a model to predict runtime performance from fractional overheads and showed that, when accurate estimation of these overheads is available, the model is an accurate tool (at most 5% error) for analyzing code execution, even in load-imbalanced scenarios. S. Allowayyad, et al. [1] applied this model for HemoCell, where the fractional overhead caused by load imbalance was estimated under the assumption that it is entirely dependent on local red blood cell count. In this work, we propose a novel methodology to build a model for the function level performance, which, after calibration, provides an estimation for the main sources of computational load (i.e., for the fractional loads). This method is demonstrated using HemoCell, similarly to [1]. However, the calibrated performance functions are defined using natural units of the simulation (red blood cell count and fluid node count). Specifically, we build a per-process function-level symbolic model for HemoCell, and calibrate it for two different HPC platforms: Snellius (SURF, Netherlands) and DAS6 (ASCI, Netherlands) [3]. The model accuracy is evaluated on balanced and non-balanced simulations, using three scenarios that showcase different types of execution imbalanceFootnote 1. Our results demonstrate good prediction accuracy for our models, indicating they can be useful tools for assessing load-imbalance impact in scientific simulations.

The remainder of this paper is structured as follows. We present our modeling approach in Sect. 2 and further show, in Sect. 3, how it is applied to build and calibrate an analytical performance model for HemoCell. We further evaluate the accuracy of the model in Sect. 4 on four different scenarios with different degrees and types of load-imbalance. Finally, we provide a brief overview of related work in Sect. 5, and conclude the paper in Sect. 6.

2 Performance Modeling Methodology

In this section we present our methodology for building per-process analytical performance models for large-scale simulations. We assume a Single-Program, Multiple-Data model, where processes with different ranks and are executed concurrently on different processing units (e.g., cores or nodes). Throughout the modeling process we also assume that at least function level performance measurements of the application are available. The collected data depends on the desired model output, e.g., time (s), execution rate (Mflop/s), or energy (J).

Our methodology has four steps: (1) identifying relevant code sections and parameters, (2) building the model, (3) calibrating the model, and (4) validating the model. In this section we elaborate on each of these steps.

(1) Identify performance relevant code sections and parameters. A code section can be any part of the application code which is monitored individually. Usually, in practice, such code sections are functions. The relevant code sections are those code sections that are significant in the performance breakdown. The performance of a code section will change based on external parameters, e.g., size of the simulated domain. For each code section, we identify the relevant parameters, which are then selected as inputs for the model.

(2) Build the symbolic model: The model is built in a top-down manner: we start from a coarse symbolic model, and refine parts as needed, which allows for control over the level of detail incorporated into the model. The results, is a symbolic analytical performance model that describes the performance of a singe process in terms of the code-sections and parameters selected in step (1).

The output of the per-process model is aggregated into a final prediction using operators that are application- and metric-specific (see Fig. 1). For example, when predicting execution time for fully concurrent applications, the performance is dominated by the longest process; however, when processes run sequentially, the aggregated execution time is the sum of the execution time of all processes.

(3) Calibrate the model: To calibrate the model we replace the symbolic terms describing code section performance with predictive functions. Firstly, empirical data of code section performance is collected. This data is used to fit a function for each individual code section, the degree of the function depends on the relationship between the code section and the input parameters e.g., the output can scale linearly or exponentially in relation to the parameters.

(4) Validate the model: To validate the model, we measure performance on relevant (unseen) datasets, and report prediction error, calculated as \( e = \textrm{abs}(\textrm{predicted}-\textrm{measured})*100/\textrm{measured}\). If needed, to increase prediction accuracy, the model can be further refined (i.e., functions can be further split into smaller units). This, however, also increases the model complexity.

3 Modeling Hemocell

In this section, we build an analytical performance model describing the execution time of Hemocell. This model is calibrated on two different machines, Snellius (SURF, Netherlands) and DAS6 (ASCI, Netherlands) [3].

3.1 Hemocell

Hemocell is a coupled multi-scale simulation code used for modeling blood flow. The application simulates blood as a dense cellular suspension flow, modeling the evolution of particles, i.e., red blood cells (RBCs) and platelets, suspended in a solvent, i.e., the blood plasma, over multiple discrete time steps [19, 20]. The solvent is modeled as a fluid using the lattice Boltzman method (LBM). LBM calculations are handled by the Palabos library [12]. The movement, deformation, and interaction of particles is modeled separately from the LBM calculation. Both models are coupled together intermittently to simulate the full blood flow system.

For parallelization, Hemocell uses multi-processing: each process receives a section of the simulated domain, i.e., a subdomain, and is responsible for computing the fluid and particles within that subdomain. During the simulation, the processes communicate with each other using MPI. The edges of the fluid field, as well as parts of the particles which may span multiple subdomains, must be communicated to ensure correct results.

Previous research focused on improving Hemocell’s overall performance [16], as well as improving the scaling performance through better load balancing [1].

3.2 Performance-Relevant Functions and Parameters

A Hemocell simulation consists of three phases: (1) setup, (2) computation, and (3) data output. Our work focuses on the most expensive of these phases, the computation. In turn, the Hemocell computation phase has three components: (i) fluid computation, (ii) particle computation, and (iii) model coupling.

We define performance-relevant functions as those functions that have a non-negligible performance impact. Similarly, we define performance-relevant parameters as function parameters that have a non-negligible performance impact. The process of identifying the performance-relevant functions and parameters is based on both expert application knowledge, code inspection, and investigation of any available fine-grained performance measurements. Table 1 shows the performance-relevant functions and parameters for Hemocell.

Table 1. Performance-relevant functions and parameters for Hemocell.

3.3 Model-Building

For building the model, we start with the highest-level description of the application: the components.

$$\begin{aligned} T = \textrm{Iters}\ \times \,\,&\left[ \ \mathrm {FluidField(xs, ys, zs)} \right. \\\nonumber&+ \mathrm {ParticleField(xs, ys, zs, RBCs)}\\\nonumber&\left. +\ \mathrm {ModelCoupling(RBCs)}\ \right] \end{aligned}$$
(1)

In Eq. (1), Iters is the number of iterations, (xs, ys, zs) are the dimensions of the domain, RBCs is the number of red blood cells within the domain, and FluidField, ParticleField and ModelCoupling are the functions that describe the execution time per iteration for each componentFootnote 2. We improve on this initial model by expanding the component terms. Each component term is made up of the summation of the time spent in the respective relevant functions, see Table 1.

To simplify the calibration step we derive two new parameters: V and SA, representing the subdomain volume and surface area, respectively. For rectangular domains they are defined as \(V = xs \times xy \times xz\) and \(SA = 2 \times (xs \times ys + xs \times zs + ys \times zs)\). Expanding on the initial model, replacing xs, ys, zs with either V or SA, gives us the following analytical model:

$$\begin{aligned} T = \textrm{Iters}\ \times \,\,&\left[ \ \mathrm {FluidField(V, SA)} \right. \\\nonumber&+ \mathrm {ParticleField(SA, RBCs)} \\\nonumber&\left. +\ \mathrm {ModelCoupling(RBCs)}\ \right] \end{aligned}$$
(2)
$$\begin{aligned} \textrm{FluidField}(V, SA)\ =&\ \textrm{CollideAndStream}(V) \\\nonumber&+ \mathrm {CollideAndStream\_comm}(SA)\\\nonumber&+ \textrm{setExternalVector}(V) \end{aligned}$$
(3)
$$\begin{aligned} \textrm{ParticleField}(SA, RBCs) =&\ \textrm{syncEnvelopes}(RBCs)\\\nonumber&+ \mathrm {syncEnvelopes\_comm}(RBCs, SA)\\\nonumber&+ \textrm{AdvanceParticles}(RBCs)\\\nonumber&+ \textrm{applyConstitutiveModel}(RBCs)\\\nonumber&+ \textrm{deleteNonLocalParticle}(RBCs) \end{aligned}$$
(4)
$$\begin{aligned} \textrm{ModelCoupling}(RBCs)\ =&\ \textrm{spreadParticleForce}(RBCs)\\\nonumber&+ \textrm{interpolateFluidVelocity}(RBCs) \end{aligned}$$
(5)

3.4 Model Calibration

In the calibration step the terms in the model are replaced with predictors. The predictors are fitted, using empirical data collected from the two machines, Snellius and DAS6, the machine details are presented in Table 2.

Table 2. Machine Descriptions

To collect the data, we simulate a cuboid-shaped domain of blood for 500 iterations. The size of the domain ranges from \((12.5, 12.5, 12.5)\upmu \textrm{m}\) to \((75, 75, 50)\upmu \textrm{m}\). Every domain size is run with 7 different volume fractions of RBCs (hematocrit): 0%, 9%, 10%, 12%, 14%, 16%, and 18%. The workload in these experiments is fully balanced, i.e., every process performs the same amount of work. Throughout the modeling and analysis process, we use the Scalasca toolchain for automatic code instrumentation and performance measurements of Hemocell [8, 11].

The predictors are all fitted functions over the respective performance data. The degree of the fit function is dependent on the relationship between the parameters and the output metric. For this model we have chosen the parameters such that all relationships are linear. The calibrated predictors are presented in Table 3.

Table 3. Calibrated performance predictors for Snellius and DAS6.

4 Scenario Analysis

The fine granularity of the model allows for accurate performance predictions in scenarios that differ from the configuration used for calibration. In this section we evaluate the model, as presented in Sect. 3, and use it to analyze the performance of Hemocell in four different scenarios, (1) balanced workload, (2) imbalanced subdomains, (3) imbalanced hematocrit, and (4) imbalanced communication.

4.1 Scenario: Balanced Workload

In the balanced scenario each process receives the same amount of work. The setup is identical to the simulation configurations used for model calibration, however the domain sizes are of course different. Empirical and predicted results are shown for Snellius and DAS6 in Fig. 2a and 2b.The results show that the model can accurately predict the performance in this scenario a maximum error of 12.87% and an average error of 3.6%.

Fig. 2.
figure 2

Observed and predicted execution time and prediction error for the load balanced scenario, on DAS6 and Snellius. The standard deviation of the observed results is within 1.5%

4.2 Scenario: Imbalanced Subdomains

In an ideal scenario each process is assigned a subdomain of the same size. However, due to complex simulation domains this is not always achievable. An imbalanced distribution of the domain leads to a loss of performance.

The imbalance in this scenario is generated by assigning half of the processes to 75% of the full domain, and the other half of the process to 25% of the full domain, see Fig. 3. This means that half of the process are assigned three times more work than the other half.

Fig. 3.
figure 3

Imbalanced domain distribution. Both the red and blue parts are assigned to half of the processes. (Color figure online)

For each configuration, the results for the imbalanced and balanced configurations are measured and predicted. By comparing the balanced and imbalanced predictions we estimate the overhead introduced by the load imbalance. The results are presented in Fig. 4.

We observe good accuracy for the imbalanced scenario predictions, with a maximum error of 15.83% and an average of 10.26%. The prediction accuracy on Snellius is lower than on DAS6. This is most likely caused by the difference in memory layout. A node on Snellius is dual-socket, meaning that the L3 cache is not shared between all threads. By assigning most of the domain to half of the processes we are moving most of the data onto a single socket, significantly increasing the amount of data that is accessed by that socket. This combined with the larger overall domain on Snellius, which results in the subdomains of one process being further apart. The result is that the cost of memory operations increases more on Snellius, which is not captured by the model.

We also observe that the load-imbalance overhead is higher for the 18% hematocrit configurations, compared to the equivalent 0% runs. This increase is due to the RBC computation, which worsens the already-present workload imbalance. This increase is correctly captured by the model.

Fig. 4.
figure 4

Observed and predicted execution time and prediction error for the imbalanced domain scenario, on DAS6 and Snellius. The standard deviation of the observed results is within 1.5%

4.3 Scenarios: Imbalanced Hematocrit

The hematocrit value has a significant impact on performance, as can be seen in Fig. 2. A higher hematocrit means more of the volume is occupied by RBCs, resulting in more computation and communication. In the configurations up till now, we assumed a homogeneous hematocrit throughout the domain. However, in more realistic scenarios the hematocrit varies throughout the domain.

The imbalanced hematocrit scenario shows the performance overhead of having a non-homogeneous hematocrit. To create an imbalanced hematocrit each domain is initialized such that part of the domain has a hematocrit of 18%, the other part is either initialized at 9% or 0%. On DAS6 both parts are evenly sized, on Snellius they are either evenly sized, see Fig. 5a, or the first 16 threads are assigned the higher hematocrit subdomains, see Fig. 5b.

Fig. 5.
figure 5

Snellius imbalanced hematocrit (18% and 0%) across 128 processes.

Because the fluid computation is not affected by the hematocrit for the results in this scenario we only show the time spend on the particle and coupling components. The results are presented in Fig. 6.

Fig. 6.
figure 6

Observed and predicted execution time and prediction error for the imbalanced hematocrit scenario, on DAS6 and Snellius. The standard deviation of the observed results is within 1.5%

For the imbalanced configurations where the domain is split in half, the highest observed prediction error is acceptable, at 8.61%. However, for the (16/112) configurations, the predictions are less accurate, with errors above 16.8%. The large error is caused by a change in the number of communication neighbors, which is not captured in the current model. We address this limitation in Sect. 4.4.

4.4 Scenario: Imbalanced Communication

In the imbalanced hematocrit results, we observe a lower prediction accuracy on the (16 / 112) distribution configurations. This is partially caused by a change in the communication costs. The processes that are assigned more work, in the (16/112) configuration, as shown in Fig. 5b, are located at the edge of the non-periodic domain. This means that the number of neighbors that need to be communicated with is less than if the subdomain computed by the process is located in the middle of the domain. However, during calibration it is assumed that the processes are fully surrounded by neighbors. To address this we expand the original model to include a term to express how many direct neighbors a process needs to communicate with.

Fig. 7.
figure 7

Snellius imbalanced hematocrit (18% and 0%) across 128 processes.

The model is expanded by adding \(CR_{x}\), which denotes the communication ratio of the fluid or particle component, \(CR_{x} = \frac{\#\textrm{Neighbors}}{\mathrm {Max\ Neighbors}}\). This ratio is defined separately for the fluid and particle communication because for fluid communication the maximal number of neighbors that need to be communicated with is 18, as opposed to 26 possible neighbors for the particle communicationFootnote 3. In Eqs. (3) and (4) the functions describing the fluid and particle communications are replaced with Eqs. (6) and (7). The functions in Eqs. (6) and 7 multiply the original communication term with the newly introduced \(CR_{component}\) ratio.

$$\begin{aligned} \small \mathrm {syncEnvelopes\_comm}&(RBCs, SA, CR_{particle}) = \\\nonumber&CR_{particle} \times \mathrm {syncEnvelopes\_comm}(RBCs, SA)\end{aligned}$$
(6)
$$\begin{aligned} \mathrm {collideAndStream\_comm}&(SA, CR_{fluid}) = \\\nonumber&CR_{fluid} \times \mathrm {collideAndStream\_comm}(SA) \end{aligned}$$
(7)

To verify that the model expansion improves the accuracy the updated model is applied to the (16/112) configuration, as well as a (8/120) configuration, shown in Fig. 7a, and a (32/92) configuration, shown in Fig. 7b. These experiments are run on Snellius for 500 iterations with a subdomain size of (50, 50, 50)  \(\upmu \)m. The results are presented in Fig. 8.

Fig. 8.
figure 8

Observed and predicted results and prediction error for the imbalanced communication scenario, on DAS6 and Snellius. Standard deviation of the observed results is within 1.5%

The results show a clear improvement in the prediction accuracy, compared to the previous version without the added imbalance term. The highest prediction error is reduced from 24.53% to 16.19% when using the updated model. The results in all experiments performed before this did not change, because in those configurations the processes that dominate performance need to communicate with the maximum number of neighbors. Not only is the accuracy better, but the results also provide more detailed information about the different configurations. With the old model, the prediction for each configuration is identical. However, in the results we see that the different configurations do not perform the same. The updated model is capable of better highlighting this difference in performance.

5 Related Work

This section provides a brief overview of alternative performance models, and how they differ from our own approach.

Analytical performance models for modeling the performance of large-scale applications have been proposed for many years [14]. However, such models are application-specific, and not generalizable to a wider range of applications. To address this, Hoefler et al. [10] propose a multistep approach for building analytical performance models. Their metric of interest is application scaling behavior; therefore, these models do not capture per-process performance.

Other tools to automate parts of the modeling process, such as EXTRA-P [6, 7], target on finding scalability bugs in large-scale applications. EXTRA-P builds a statistical model based on empirical performance results of the application. However, the resulting model is non-trivial to understand and tweak, and cannot predict scalability bugs at process-level.

Beyond analytical models, other types of performance models, such as simulators and machine-learning based models, are typically more accurate, but have interpretability issues. For example, models based on machine-learning require significant training data and resources, and the resulting black-box models provide a lot less insight into the application performance characteristics [13, 17]. Functional and cycle-accurate simulations provide accurate information on how an application behaves, and why, but they take a very long time to build and calibrate, which renders them difficult to use for large-scale applications [4, 18].

6 Conclusion

In this paper we proposed a methodology for building per-process performance models for large-scale, multi-processing simulations. The per-process modeling approach gives portable, fine-grained, accurate, analytical performance models.

We further used the proposed methodology to build an accurate predictive model for Hemocell, a coupled simulation for blood flow. We demonstrated that the resulting model is capable of accurate performance prediction for both balanced workloads, where we see a maximum error of 12.9%, and an average of 3.6% error, and imbalanced scenarios, with a maximum of 16.2% error, and an average of 10.2% error. These results indicate that, although the model-building and calibration steps are based on simple balanced workloads, the model is able to analyze and predict simulation performance in load-imbalanced scenarios. This is a significant advantage for our per-process approach. Finally, we have shown how to refine the model to address potential inaccuracies, thus showing the advantage of a white-box, analytical approach.

In future work we aim to reduce the amount of manual work required to build the model, by for example incorporating statistical methods for determining performance behavior of code sections. In the near future, we aim to extend the model to work for multi-node computations and both apply our model to more simulations and increasingly dynamic scenarios. Specifically, at runtime, the workload could be dynamically shifting between processes as a result of the simulated phenomena, thus showing different load-imbalance patterns during execution. Thus, we plan to use the model to predict the performance degradation due to such changes in load balancing.