# E-AMOM: an energy-aware modeling and optimization methodology for scientific applications

## Authors

- First Online:

DOI: 10.1007/s00450-013-0239-3

- Cite this article as:
- Lively, C., Taylor, V., Wu, X. et al. Comput Sci Res Dev (2014) 29: 197. doi:10.1007/s00450-013-0239-3

- 168 Views

## Abstract

In this paper, we present the Energy-Aware Modeling and Optimization Methodology (E-AMOM) framework, which develops models of runtime and power consumption based upon performance counters and uses these models to identify energy-based optimizations for scientific applications. E-AMOM utilizes predictive models to employ run-time Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Concurrency Throttling (DCT) to reduce power consumption of the scientific applications, and uses cache optimizations to further reduce runtime and energy consumption of the applications. The models and optimization are done at the level of the kernels that comprise the application. Our models resulted in an average error rate of at most 6.79 % for Hybrid MPI/OpenMP and MPI implementations of six scientific applications. With respect to optimizations, we were able to reduce the energy consumption by up to 21 %, with a reduction in runtime by up to 14.15 %, and a reduction in power consumption by up to 12.50 %.

### Keywords

Performance modelingEnergy consumptionPower consumptionMPIHybrid MPI/OpenMPPower predictionPerformance optimization## 1 Introduction

Currently, an important research topic in high-performance computing is that of reducing the power consumption of scientific applications on high-end parallel systems (e.g., petaflop systems) without significant increases in runtime performance [2–6, 8–10, 12–15, 17–19, 22]. Performance models can be used to provide insight into an application’s performance characteristics that significantly impact runtime and power consumption. As HPC systems become more complex, it is important to understand the relationships between performance and power consumption and the characteristics of scientific applications. In this paper, we present E-AMOM, an Energy-Aware Modeling and Optimization Methodology for developing performance and power models and reducing energy. In particular, E-AMOM develops application-centric models of the runtime, CPU power, system power, and memory power based on performance counters. The models are used to identify ways to reduce energy.

To obtain the necessary application performance characteristics at the kernel level to determine application bottlenecks on a given system with regard to execution time and power consumptions for the system, CPU, and memory components.

To improve performance of the application at the kernel level with regard to applying DVFS [12] and DCT [5] to reduce power consumption and optimizing algorithms to improve energy consumption.

To provide performance predictions (about time and power requirements) for scheduling methods used for systems with a fixed power budget.

- (1)
Using the performance-tuned principal component analysis (PCA) method [1], we develop accurate performance models of hybrid MPI/OpenMP and MPI implementations of six scientific applications at the kernel level. Our models are able to accurately predict runtime and power consumptions of the system, CPU, and memory components across different numbers of cores, frequency settings, concurrency settings, and application inputs with the average error rate at most 6.79 % for the six scientific applications.

- (2)
The models are used to determine appropriate frequency and concurrency settings for application kernels to reduce power consumption. The kernel models are also used to improve runtime through loop blocking and loop unrolling.

- (3)
Our combined optimization strategy, developed in E-AMOM, is able to reduce energy consumption of hybrid and MPI scientific applications by up to 21 %, with a reduction in runtime by up to 14.15 %, and a reduction in power consumption by up to 12.50 % on multicore systems.

The remainder of this paper is organized as follows. Section 2 presents the E-AMOM framework. Section 3 discusses the details of the modeling component of E-AMOM. Section 4 provides the methodology for optimization of scientific applications using E-AMOM. Section 5 presents detailed experimental results obtained for the modeling and optimization experiments. Section 6 discusses some related work, and Sect. 7 concludes the work and discusses some future work.

## 2 E-AMOM framework

E-AMOM is one of the modeling and optimization components of MuMMI (Multiple Metrics Modeling Infrastructure) [27]. MuMMI facilitates systematic measurement, modeling, and prediction of performance, power consumption and performance-power trade-offs for multicore systems by building upon three existing frameworks: Prophesy [23], PowerPack [10] and PAPI [20]. E-AMOM utilizes MuMMI’s instrumentation framework [27], PowerPack for collecting power profiles, and PAPI for collecting performance counter data.

Dynamic Voltage and Frequency Scaling (DVFS) [12] is a technique that is used to reduce the voltage and frequency of a CPU in order reduce power consumption. Applying DVFS is especially beneficial during a period where communication slack time appears during parallel execution due to load imbalance between task communications. Dynamic Concurrency Throttling (DCT) [5] is a technique that can be used to reduce the number of threads used to execute an application. Applying DCT is especially beneficial during OpenMP performance phases that do not benefit from using the maximum number of OpenMP threads per node.

Much of the computation involved in the kernels of HPC applications occurs within nested loops. Therefore, loop optimization is fundamentally important for such applications. It is for this reason that E-AMOM utilizes loop blocking and loop unrolling with the optimization component.

## 3 Modeling methodology

In this section we present the performance-tuned principal component analysis method, which is used for the modeling component of E-AMOM. During an application execution we capture 40 performance counters utilizing PAPI and the perfmon performance library. For the given execution, all of the performance counters are normalized by the total cycles of execution to create performance event rates for each counter. In addition, we restrict the models to have non-negative regression coefficients to ensure that the models represent actual performance scenarios. A multivariate linear regression model is constructed for each performance component (execution time, system power, CPU power, and memory power), for each kernel in an application.

The most difficult part in predicting application performance via performance counters is to determine the appropriate counters to be used for modeling each performance component. The following algorithm, consisting of six steps, is used to identify the appropriate performance counters and the model development. We utilize the hybrid NPB BT-MZ with Class C to illustrate the method for modeling the runtime of the application. We focus on the full BT-MZ application with respect to the results for most steps. The training set used is consistent with the training set description given in Sect. 5 on experimental results.

### Step 1

*Compute the Spearman’s rank correlation for each performance counter event rate for each performance component* (*runtime*, *system power*, *CPU power*, *and memory power*).

*x*

_{i}and

*y*

_{i}represent the ranks of the performance counter and performance component (time, system power, CPU power, memory power) that are being correlated. The variables \(\overline{x}\) and \(\overline{y}\) represent the average of the samples for each variable. The Spearman rank correlation provides a value between −1 and 1 for

*ρ*.

The Spearman’s rank correlation coefficient is used because it provides a correlation value that is not easily influenced by outliers in the dataset.

For the BT-MZ benchmark, we do not give the coefficients for all 40 counters since many of the counters were eliminated based upon the Step 2.

### Step 2

*Establish a threshold*, *α*_{ai}, *to be used to eliminate counters with Spearman rank correlation values below the threshold*.

*α*

_{ai}is used to determine an appropriate threshold for eliminating performance counters with low correlation to the performance event that is to be modeled. The value for

*α*

_{ai}is established based on a cluster analysis using Gaussian mixture distribution. The Gaussian clustering analysis enables clusters to be determined based on the multivariate normal components of the representative points. Table 1 provides the results of this step for BT-MZ. The value

*α*

_{ai}was determined to be 0.55.

Correlation Coefficients for BT-MZ in Step 2

Counter | Correlation Value |
---|---|

PAPI_TOT_INS | 0.9187018 |

PAPI_FP_OPS | 0.9105984 |

PAPI_L1_TCA | 0.9017512 |

PAPI_L1_DCM | 0.8718455 |

PAPI_L2_TCM | 0.8123510 |

PAPI_L2_TCA | 0.8021892 |

Cache_FLD | 0.7511682 |

PAPI_TLB_DM | 0.6218268 |

PAPI_L1_ICA | 0.6487321 |

Bytes_out | 0.6187535 |

### Step 3

*Compute a multivariate linear regression model based upon the remaining performance counter event rates*. *Recall that we restrict the coefficients to be non*-*negative*.

### Step 4

*Establish a new threshold*, *α*_{bi}, *and eliminate performance counters with regression coefficients smaller than the selected threshold*.

*α*

_{bi}serves as the second elimination threshold and has a similar purpose as

*α*

_{ai}to eliminate performance counters that do not contribute substantially to modeling in the initial multivariate linear regression model. The determination of

*α*

_{bi}is accomplished via the same method as given in Step 2 but applied to the regression coefficients. Table 2 provides the results of the Step 3. An appropriate value for

*α*

_{bi}is important because if the value is not correctly chosen, then values needed in the modeling will be eliminated. The value determined for BT-MZ was 0.02. Table 3 provides the results of Step 4.

Regression Coefficients for BT-MZ in Step 3

Counter | Regression Coefficient |
---|---|

PAPI_TOT_INS | 1.984986 |

PAPI_FP_OPS | 1.498156 |

PAPI_L1_DCM | 0.9017512 |

PAPI_L1_TCA | 0.465165 |

PAPI_L2_TCA | 0.0989485 |

PAPI_L2_TCM | 0.0324981 |

Cache_FLD | 0.026154 |

PAPI_TLB_DM | 0.0000268 |

PAPI_L1_ICA | 0.0000021 |

Bytes_out | 0.000009 |

Regression Coefficients for BT-MZ in Step 4

Counter | Regression Coefficient |
---|---|

PAPI_TOT_INS | 1.984986 |

PAPI_FP_OPS | 1.498156 |

PAPI_L1_DCM | 0.9017512 |

PAPI_L1_TCA | 0.465165 |

PAPI_L2_TCA | 0.0989485 |

PAPI_L2_TCM | 0.0324981 |

Cache_FLD | 0.026154 |

### Step 5

*Compute the principal components of the reduced performance counter event rates*.

*Y*

_{i}, is given by a linear combination of variables

*X*

_{1},

*X*

_{2},…,

*X*

_{p}. For example, the first principal components would be represented by (2)

*a*

_{11},

*a*

_{12},…,

*a*

_{1p}are calculated with the constraint that the sum of their squares must equal to 1 as follows.

The number of principal components calculated is dependent upon the number of variables included in the original data. In our work, the number of principal components would be equal to the number of performance counters resulting from Step 4. The first two principal components represent the largest amount of variability, or information, in our data. Therefore, the first two principal components are used for further reducing the number of variables for creating our multivariate linear regression model. Using the vectors resulting from the first two principal components, the variables with the highest coefficients serve as the most accurate predictors for modeling. PCA is used to identify the final performance counters: PAPI_TOT_INS, PAPI_L2_TCA and PAPI_ L2_TCM for modding.

### Step 6

*Use the performance counter event rates with the highest principal component coefficient vectors to build a multivariate linear regression model to predict the respective performance metric*.

*r*

_{i}(

*i*=1,…,

*n*) represents the respective performance counter event rate for the counter

*i*, and

*β*

_{i}(

*i*=1,…,

*n*) is the regression coefficient for the counter

*i*.

Final Regression Coefficients for BT-MZ

Counter | Regression Coefficient |
---|---|

Frequency | 0.00476 |

PAPI_TOT_INS | 0.105050 |

PAPI_L2_TCA | 0.097108 |

PAPI_L2_TCM | 0.178700 |

## 4 Optimization methodology

In this section we discuss the methods that are used to improve the performance of scientific applications with respect to system power consumption and runtime since this is the focus of this paper. Details about the optimization methods for CPU power and memory power can be found in [16]. Our optimization methods include DVFS, DCT, loop unrolling and loop blocking.

*P*represents the performance of the application and

*K*

_{i}represents the performance of kernel

*i*.

### 4.1 Applying DVFS and DCT

- (1)Determine the appropriate configuration settings
- (a)DVFS Setting
- (i)
Compute expected power consumption and execution time at lower CPU frequency.

- (ii)
If frequency setting results in a 10 % saving in power consumption, without increasing runtime more than 3 %, then use the reduced frequency.

- (i)
- (b)DCT Setting
- (i)
Compute expected power consumption and execution time at concurrency settings using 1, 2, 4, and 6 threads.

- (ii)
Identify the concurrency setting that enables the best improvement in power consumption and runtime.

- (i)

- (a)
- (2)
Determine the total application runtime and system power consumption including synchronization overhead costs

*μ*_{i}from changing application settings. - (3)
Use new configuration settings for running the application.

*i*in the HPC application:

We utilize the multivariate linear regression equation presented in (5) to determine the appropriate configuration based on frequency and number of threads. The frequency, number of nodes, and threads per node, are incorporated into the regression equation with the performance counters to predict the performance of the application kernel at two CPU frequency settings (2.4 GHz and 2.8 GHz) and at concurrency settings of 1, 2, 4, 6, and 8 threads.

*K*

_{sysp_i}represents the predicted system power consumption of kernel

*i*and

*n*is the number of kernels in the application. We have determined the following scenarios in which it would be appropriate to apply changes to configuration settings in our applications:

- (1)
DVFS and DCT changes to specific kernels in the application

- (2)
DVFS-only applied to specific kernels in the application

- (3)
DCT-only applied to specific kernels in the application

- (4)
DVFS applied to a small number of time-steps within an application.

### 4.2 Loop optimizations

The optimal loop block size varies with different applications on different systems. In this work, we apply the following loop block sizes: 2×2, 4×4, 8×8 and 16×16 to our HPC applications to measure which loop block size is optimal. To determine the best block size for each application we measure the performance of the application for each block size using a reduced number of iterations to approximate the best block size. Previous work [25] has identified these block sizes as optimal sizes for achieving performance improvements within scientific applications.

Outer loop unrolling can increase computational intensity and minimize load/stores, while inner loop unrolling can reduce data dependency and eliminate intermediate loads and stores. For most application, loops were unrolled 4 times dependent upon the number of iterations for each loop.

In considering the different configurations for the loop optimization, the focus is on the runtime. The selected loop optimization configuration is the one resulting in the best runtime.

## 5 Experimental results

In this work, we use a power-aware cluster SystemG at Virginia Tech http://www.cs.vt.edu/facilities/systemg to conduct our experiments. SystemG has 325 Mac Pro computer nodes. Each node contains 30 thermal sensors, more than 30 power sensors and two quad-core 2.8 GHz Intel Xeon 5400 series processor with 8 GB memory. Note that, in this paper, *M*×*N* stands for the number of nodes *M* with the number of cores *N* used per node.

The training of our models is based on performance data obtained for each application at different processor sizes for predicting intra-node and inter-node performance. The training set consists of 12 different system configurations for each application; each system configuration involves the use of one data set for each of the applications. The 12 training set configurations focus on intra-node performance (1×1, 1×2, 1×3 executed at 2.8 GHz and 1×3, 1×4, and 1×6 executed at 2.4 GHz) and inter-node performance (1×8, 3×8, 5×8 at 2.8 GHz and 7×8, 9×8, 10×8 at 2.4 GHz).

Using the training set we construct a multivariate linear regression model for each application. This model is then used to predict 36 different system configurations (not included in the training set) for each application. The 36 different configurations included points that were outside of the training set including 2×8, 4×8, 6×8, 8×8, 9×8, 10×8, 11×8, 12×8, 13×8, 14×8, 15×8, and 16×8, which were executed at frequencies using 2.8 GHz and 2.4 GHz for different datasets for the applications.

The applications used throughout this paper include three NAS Multi-Zone Parallel Benchmarks 3.3 (with Class B, C and D) [11] and three large-scale scientific applications: GTC (a particle-in-cell magnetic fusion application with 50, 75 and 100 particles per cell) [7], PMLB (a parallel Lattice Boltzmann application with problem sizes of 64×64×64, 128×128×128 and 256×256×256) [24] and Parallel EQdyna (a parallel finite element earthquake simulation with the resolution size of 200 m) [26]. We consider the MPI and hybrid (MPI/OpenMP) versions of each application.

### 5.1 Modeling results

Figure 3 presents the modeling accuracy across the hybrid implementations of our six large-scale scientific applications. In terms of runtime, the BT-MZ and EQdyna applications had the lowest prediction error in the range of 1.9 %. For system power, the lowest prediction error occurred for the PMLB application, with an error of 0.84 %. The SP-MZ application had the lowest prediction error for CPU power and EQdyna had the lowest prediction error for memory power (1.73 %).

Figure 4 presents the modeling accuracy across the MPI implementations of our six large-scale scientific applications. For runtime, the BT-MZ application has the lowest prediction error of 1.06 %. For predicting system power, the GTC application had an error rate of 0.94 %, which provides the lowest error across all applications and performance components for the MPI implementations. The LU-MZ application provides the lower error rate for both CPU power prediction (2.01 %) and memory power prediction (1.62 %).

Overall, the prediction results indicate that the predictive models have the average error rate of up to 6.79 % across six hybrid and MPI scientific applications on up to 128 processor cores and can be used to obtain insight into improving applications for better performance on multicore systems.

### 5.2 Optimizations of scientific applications

In this section we present the optimization results obtained across four large-scale scientific applications. Optimizations were applied to all six MPI and hybrid scientific application but four applications are presented due to space constraints. The four applications are representative of the results for all six applications.

#### 5.2.1 BT-MZ

Applying DVFS and DCT to select application kernels to reduce power consumption and execution time for the application improves the performance of the hybrid NPB BT-MZ. DVFS is applied to the initialize solutions kernel, which sets up appropriate zones for execution, and the exchange of boundary conditions, which contains significant MPI communication. DCT is applied during the BT solver kernel reducing the power consumption during this phase for an optimal configuration using 4 threads.

Performance Comparison of hybrid BT-MZ

#Cores ( | BT-MZ Type | Runtime (s) | Average Power (W) | Energy Per Node (KJ) |
---|---|---|---|---|

6×8 | Hybrid | 655 | 348.70 | 228.401 |

Optimized-Hybrid | 632 (−3.64 %) | 324.41 (−7.49 %) | 205.027 (−11.4 %) | |

8×8 | Hybrid | 493 | 348.73 | 171.573 |

Optimized-Hybrid | 440 (−12 %) | 322.17 (−8.24 %) | 141.754 (−21.0 %) | |

16×8 | Hybrid | 339 | 347.82 | 117.911 |

Optimized-Hybrid | 319 (−6.27 %) | 323.11 (−7.65 %) | 103.072 (−14.39 %) | |

32×8 | Hybrid | 201 | 346.12 | 69.570 |

Optimized-Hybrid | 193 (−4.14 %) | 325.92 (−6.2 %) | 62.902 (−10.60 %) | |

64×8 | Hybrid | 119 | 347.27 | 41.325 |

Optimized-Hybrid | 112 (−6.25 %) | 324.45 (−7.03 %) | 36.338 (−13.7 %) |

Performance Comparison of MPI BT-MZ

#Cores ( | BT-MZ Type | Runtime (s) | Average Power (W) | Energy Per Node (KJ) |
---|---|---|---|---|

6×8 | MPI | 729 | 330.732 | 241.104 |

Optimized-MPI | 700 (−4.14 %) | 323.81 (−2.14 %) | 226.667 (−6.36 %) | |

8×8 | MPI | 545 | 327.014 | 178.223 |

Optimized-MPI | 489 (−11.45 %) | 320.17 (−2.14 %) | 156.563 (−13.83 %) | |

16×8 | MPI | 387 | 329.12 | 117.911 |

Optimized-MPI | 329 (−14.15 %) | 318.19 (−3.44 %) | 104.685 (−12.63 %) | |

32×8 | MPI | 233.14 | 328.33 | 76.547 |

Optimized-MPI | 220.78 (−5.59 %) | 310.38 (−5.72 %) | 68.526 (−11.70 %) | |

64×8 | MPI | 138.58 | 327.45 | 44.396 |

Optimized-MPI | 125.73 (−10.22 %) | 307.83 (−6.37 %) | 38.703 (−14.71 %) |

#### 5.2.2 SP-MZ

*size*=4×4). Table 7 presents the performance results for the hybrid SP-MZ application for Class D. The combination of DCT+DVFS with loop optimizations results in the average power reduction by up to 7.96 %, and saves energy by up to 19.59 %.

Performance Comparison of hybrid SP-MZ

#Cores ( | SP-MZ Type | Runtime (s) | Average Power (W) | Energy Per Node (KJ) |
---|---|---|---|---|

6×8 | Hybrid | 862 | 340.453 | 293.404 |

Optimized-Hybrid | 819 (−5.25 %) | 320.19 (−6.32 %) | 262.236 (−11.89 %) | |

8×8 | Hybrid | 653 | 340.59 | 222.408 |

Optimized-Hybrid | 607 (−7.57 %) | 323.12 (−5.4 %) | 196.134 (−13.39 %) | |

16×8 | Hybrid | 389 | 341.14 | 132.703 |

Optimized-Hybrid | 344 (−13 %) | 322.57 (−5.71 %) | 110.964 (−19.59 %) | |

32×8 | Hybrid | 225 | 340.11 | 76.525 |

Optimized-Hybrid | 205 (−9.76 %) | 315.03 (−7.96 %) | 64.581 (−18.49 %) | |

64×8 | Hybrid | 154 | 341.14 | 52.536 |

Optimized-Hybrid | 142 (−8.45 %) | 316.87 (−7.66 %) | 44.996 (−16.75 %) |

Performance Comparison of MPI SP-MZ

#Cores ( | SP-MZ Type | Runtime (s) | Average Power (W) | Energy Per Node (KJ) |
---|---|---|---|---|

6×8 | MPI | 881 | 339.14 | 298.782 |

Optimized-MPI | 807 (−9.18 %) | 315.02 (−6.59 %) | 254.221 (−17.53 %) | |

8×8 | MPI | 689 | 338.46 | 233.199 |

Optimized-MPI | 619 (−11.3 %) | 321.07 (−5.41 %) | 198.742 (−17.33 %) | |

16×8 | MPI | 413 | 337.45 | 139.367 |

Optimized-MPI | 369 (−11.9 %) | 322.45 (−4.65 %) | 118.984 (−17.13 %) | |

32×8 | MPI | 241 | 338.14 | 81.492 |

Optimized-MPI | 229.53 (−5.0 %) | 315.15 (−7.30 %) | 72.336 (−12.66 %) | |

64×8 | MPI | 173.87 | 337.63 | 58.704 |

Optimized-MPI | 165.81 (−4.84 %) | 312.29 (−8.11 %) | 51.781 (−13.37 %) |

#### 5.2.3 GTC

Performance Comparison of hybrid GTC (100PPC)

#Cores ( | GTC Type | Runtime (s) | Average Power (W) | Energy Per Node (KJ) |
---|---|---|---|---|

6×8 | Hybrid | 928 | 330.54 | 306.74 |

Optimized-Hybrid | 904 (−2.65 %) | 301.45 (−9.65 %) | 272.51 (−12.29 %) | |

8×8 | Hybrid | 934 | 333 | 311.02 |

Optimized-Hybrid | 902 (−3.55 %) | 297 (−12.12 %) | 274.21 (−13.42 %) | |

16×8 | Hybrid | 947 | 334 | 316.30 |

Optimized-Hybrid | 906 (−4.53 %) | 298 (−12.1 %) | 269.99 (−17.15 %) | |

32×8 | Hybrid | 954 | 328.89 | 313.76 |

Optimized-Hybrid | 918 (−3.92 %) | 296.80 (−10.81 %) | 272.46 (−15.16 %) | |

64×8 | Hybrid | 958 | 328.79 | 314.98 |

Optimized-Hybrid | 923 (−3.79 %) | 294.15 (−11.77 %) | 271.5 (−16.01 %) |

Performance Comparison of MPI GTC (100PPC)

#Cores ( | GTC Type | Runtime (s) | Average Power (W) | Energy Per Node (KJ) |
---|---|---|---|---|

6×8 | MPI | 1413.19 | 338.19 | 477.93 |

Optimized-MPI | 1389.38 (−1.71 %) | 302.12 (−11.94 %) | 419.76 (−13.86 %) | |

8×8 | MPI | 1440.02 | 339.32 | 488.63 |

Optimized-MPI | 1401.9 (−2.71 %) | 306.57 (−10.68 %) | 429.78 (−13.69 %) | |

16×8 | MPI | 1456 | 339.45 | 494.24 |

Optimized-MPI | 1413.34 (−3.02 %) | 306.19 (−10.86 %) | 432.75 (−14.21 %) | |

32×8 | MPI | 1483.13 | 339.12 | 502.96 |

Optimized-MPI | 1451.39 (−2.19 %) | 304.19 (−11.48 %) | 441.50 (13.92 %) | |

64×8 | MPI | 1513.39 | 339.05 | 513.14 |

Optimized-MPI | 1459.10 (−3.72 %) | 301.38 (−12.50 %) | 439.74 (−16.69 %) |

Loop blocking is applied to the MPI implementation with an optimal block size of 4×4 for both problem sizes of 50 ppc and 100 ppc. Similar to the hybrid implementation, the inner-most loops of the pushi and chargei subroutines are unrolled four times. The manual loop optimizations are able to achieve strong reductions in execution time for 50 ppc, but smaller optimization benefits are obtained in terms of execution time for 100 ppc. It is important to note that the GTC application benefits greatly from the use of OpenMP threads as shown in Tables 9 and 10.

#### 5.2.4 PMLB

Performance Comparison of Hybrid PMLB

#Cores ( | PMLB Type | Runtime (s) | Average Power (W) | Energy Per Node (KJ) |
---|---|---|---|---|

1×8 | Hybrid | 1878.15 | 281.19 | 528.12 |

Optimized-Hybrid | 1761.03 (−6.65 %) | 270.49 (−3.96 %) | 476.34 (−10.87 %) | |

2×8 | Hybrid | 935.22 | 279.45 | 261.35 |

Optimized-Hybrid | 901.71 (−3.72 %) | 268.19 (−4.2 %) | 241.83 (−8.07 %) | |

4×8 | Hybrid | 416.83 | 280.37 | 116.87 |

Optimized-Hybrid | 398.17 (−4.69 %) | 260.53 (−7.61 %) | 103.74 (−12.65 %) | |

8×8 | Hybrid | 195.31 | 281.67 | 55.01 |

Optimized-Hybrid | 184.39 (−5.92 %) | 255.19 (−10.37 %) | 47.05 (−16.9 %) | |

16×8 | Hybrid | 104.18 | 280.53 | 29.23 |

Optimized-Hybrid | 97.13 (−7.26 %) | 265.14 (−5.80 %) | 25.75 (−13.51 %) | |

32×8 | Hybrid | 57.72 | 276.71 | 15.97 |

Optimized-Hybrid | 56.81 (−1.6 %) | 270.19 (−2.41 %) | 15.34 (−4.1 %) |

Performance Comparison of MPI PMLB

#Cores ( | PMLB Type | Runtime (s) | Average Power (W) | Energy Per Node (KJ) |
---|---|---|---|---|

1×8 | MPI | 1259.87 | 302.76 | 381.44 |

Optimized-MPI | 1247.13 (−1.02 %) | 284.90 (−6.27 %) | 355.27 (−7.37 %) | |

2×8 | MPI | 689.31 | 302.98 | 208.85 |

Optimized-MPI | 664.19 (−3.78 %) | 282.14 (−7.39 %) | 187.39 (−11.45 %) | |

4×8 | MPI | 379.12 | 301.18 | 114.18 |

Optimized-MPI | 362.29 (−4.65 %) | 282.33 (−6.68 %) | 102.29 (−11.62 %) | |

8×8 | MPI | 185.35 | 300.79 | 55.75 |

Optimized-MPI | 180.13 (−2.90 %) | 281.13 (−6.99 %) | 50.64 (−10.1 %) | |

16×8 | MPI | 88.93 | 300.84 | 26.75 |

Optimized-MPI | 89.46 (0.59 %) | 285.14 (−5.51 %) | 25.51 (−4.86 %) | |

32×8 | MPI | 43.12 | 301.29 | 12.99 |

Optimized-MPI | 46.79 (7.84 %) | 286.91 (−5.01 %) | 13.42 (3.2 %) |

## 6 Related work

The use of performance counters to predict power consumption has been explored in the previous work [3–6, 15, 21, 22]. In general, the previous work identifies a set of common performance counters to be used across all of the applications considered. The same counters and correlation coefficients are used for the class or group of applications, but this approach doesn’t capture some characteristics unique to each application. In contrast, E-AMOM is focused on developing models for each application and thereby understanding the unique characteristics of each application that impact runtime and power consumption. Further, E-AMOM uses the performance counters to identify methods for reducing power consumption.

In [22] power estimations using performance counters are presented with median errors of 5.63 % for developing a power-aware thread scheduler. In [13] two energy-saving techniques, DVFS and DCT, are applied to hybrid HPC application codes to improve energy consumption with energy savings in the range of 4.1 % to 13.8 % with negligible performance loss. Our work utilizes the software-based power reduction strategies with algorithmic changes to improve application performance and save power. Our scheme makes use of performance models that are used for predicting the effects that DVFS and DCT strategies have on application performance by refining the regression model for each application’s characteristics. Our work differs from previous approaches in that we identify alternative frequency and concurrency settings for an application’s kernel to reduce power consumption.

## 7 Summary

The E-AMOM framework provides an accurate methodology for predicting and improving the performance and power consumption of HPC applications. It utilizes two software-based approaches DVFS and DCT for reducing power consumption in HPC applications. Specifically, E-AMOM determines efficient execution configurations of HPC application kernels with regard to the number of OpenMP threads to utilize to execute each application kernel in the hybrid applications. Overall, our E-AMOM predictive models have the average error rate of up to 6.79 % across six hybrid and MPI scientific applications. Our optimization approach is able to reduce the energy consumption by up to 21 % with a reduction in runtime by up to 14.15 % and a reduction in power consumption by up to 12.50 % in six hybrid and MPI HPC applications. Future research will focus on identifying appropriate optimization strategies to handle these alternative classes of applications to include in E-AMOM. E-AMOM will be further integrated into the MuMMI framework to automate the modeling and optimization processes to identify optimal configuration points.

## Acknowledgements

This work is supported by NSF grants CNS-0911023, CNS-0910899, CNS-0910784, CNS-0905187. The authors would like to acknowledge Stephane Ethier from Princeton Plasma Physics Laboratory for providing the GTC code, and Chee Wai Lee for his review comments.