1 Introduction

Rising energy costs, and the increase in data center power consumption driven by an ever increasing demand for data services, are becoming a dominating factor for the Total Cost of Ownership (TCO) over the lifetime of a computing system. Additionally, current semiconductor technology will be hitting a point where downsizing and, thus, inherent reduction of power, will no longer be possible mainly due to the increase in leakage currents [23].

Many high profile studies show the increase of data center power demands as well as the power challenges that high performance Exascale computing will provide. For example, a study sponsored by the German Federal Ministry of Economics and Technology (Bundesministeriums für Wirtschaft und Technologie, BMWi) in 2009 [20] showed that data centers in Germany used 9 TWh of electrical energy per year in 2007 with the prediction of 9.5 TWh in 2010 (corresponding to roughly 1.3 Billion Euro at that time), 10.3 TWh by 2015 and 12.3 TWh by 2020. The same trend is visible for the world wide data center power consumption from the years 2000 to 2005 [25] and from 2005 to 2010 [26].

With one eye towards future developments the High Performance Computing (HPC) community realized that energy efficiency improvements are needed for sustainable multi-peta and Exascale computing. The “Exascale Computing Study: Technology Challenges in Achieving Exascale Systems” [24] sets a definite power challenge of 20 MW which became a main driver for HPC energy efficient research. Therefore, reducing the energy consumption and overall environmental impact of data centers has become an important research area.

The last couple of years have seen a rapid increase in projects and publications covering a multitude of topics in this field. Unfortunately without some underlying basic structure it is difficult for data center operators to identify center specific optimization approaches, to find relevant research areas and publications, and to identify and apply possible solutions. Current focus areas for data center energy efficiency improvements are: energy efficient application design (which in most cases are directly related to performance improvement efforts like AutoTune [3]); application co-design (developing an HPC system that supports the application behavior in terms of either communication, memory pattern, or I/O requirements, which is the focus of Exascale projects such as DEEP [10] and MontBlanc [30]); batch scheduler controlled Dynamic Voltage and Frequency Scaling (DVFS, like in [15, 19]); and scheduling for a federation of data centers that has been investigated, for example, by the Fit4Green [12] and All4Green [1] projects. Data center instrumentation is another important research area which is concerned with defining how and where to measure certain key performance indicators (KPI’s) related to energy efficient HPC. However, current systems often miss proper instrumentation and, therefore, do not allow for the easy collection of required data. The Energy Efficient HPC working group [17] is one of the major driving forces trying to help the HPC community to define the needed HPC system and data center instrumentation and provide guidelines on how to measure the KPI’s defined by other standardization bodies such as ASHRAE [2] and The Green Grid [16].

The 4 Pillar Framework for energy efficient HPC data centers was specifically developed to address the lack of a foundation for energy efficiency efforts in high performance computing and to help the HPC data centers to identify areas of improvement and develop applicable solutions. It is completely data center independent by providing a generic way to look at the energy efficiency improvement domain. Other related works share some ideas with the 4 Pillar Framework but leave out significant parts. For example, “Energy Efficiency in Data Centers: A new Policy Frontier” [28] is referring to parts of Pillar 1 and 2 only and, thus, it is solely focusing at the data center energy efficiency from an operational point of view. Another example, the “DPPE: Holistic Framework for Data Center Energy Efficiency” [8], references pillar 1, 2, and 3 to define the data center energy flow. This flow is then used to find areas of improvement and the corresponding division (operating unit) in the data center. It considers each ‘pillar’ as a separate improvement area with their own KPI (GEC—Green Energy Coefficient, PUE—Power Usage Effectiveness, ITEE—IT Equipment Energy Efficiency, ITEU—IT Equipment Utilization) that can be measured. This approach has also a strong operational focus.

Both examples can be seen as one specific implementation of the 4 Pillar Framework. They show that the framework can be the foundation for creating energy efficiency models for specific data centers and for modeling data center energy flow chains. This is possible because the 4 Pillar Framework acknowledges that different data centers have different goals and requirements, that applications play an important part for the energy efficiency of HPC data centers, and that the addition of cross pillar interactions allow for more fine-tuned energy efficiency related decisions.

This paper will present the basis for the 4 Pillar Framework, cover the definition and goal of each pillar, and provide a representative model using activities performed at Leibniz Supercomputing Center of the Bavarian Academy of Sciences and Humanities (BAdW-LRZ).

2 The 4 Pillar Framework

The 4 Pillar Framework was developed at BAdW-LRZ mainly to solve two problems. The first was to understand and categorize all internal and external efforts in the area of improving the data center energy efficiency. The second was to provide a foundation for planning future work and for presenting current efforts to outside stakeholders. By addressing these two areas it is possible to work on the BAdW-LRZ data center goal of reducing the TCO for the lifetime of its HPC systems.

The 4 pillars originated from the need to answer the question: What HPC data center aspects play an important part for the improvement of energy efficiency? To answer this question, BAdW-LRZ looked at current research efforts, industry standardization bodies, their own goals and efforts, and at the building blocks of its data center. Using this methodology, four clusters of activities could be identified that make up the 4 Pillar Framework. The identified pillars are:

  1. 1.

    Building Infrastructure

  2. 2.

    System Hardware

  3. 3.

    System Software

  4. 4.

    Applications.

As can be seen in Fig. 1, the four pillars present individual goals and propose different activities to achieve these goals. Yet, one topic that spans all pillars is monitoring. For now, monitoring demands can be summarized for each pillar with ‘the more the better’. Without extensive monitoring, no system analysis can be done and without analysis, no optimization and, ultimately, no energy efficiency improvements are possible.

Fig. 1
figure 1

The 4 Pillar Framework for energy efficient HPC data centers

2.1 Pillar 1: Building Infrastructure

Pillar 1 “Building Infrastructure” represents the complete non-IT infrastructure needed to operate a data center. The goal of this pillar is the improvement of data center energy efficiency key performance indicators such as Power Usage Effectiveness (PUE), Water Usage Effectiveness (WUE), Carbon Usage Effectiveness (CUE), Energy Reuse Effectiveness (ERE), etc. These have been addressed in research work as well as by many standardization and industrial bodies. Some of the most noteworthy standardization organizations related to Pillar 1 are ASHRAE [2] and “The Green Grid” [16], which have developed best practices guides and evaluation matrices for efficient data center operation. Important pillar aspects are the reduction of losses in the energy supply chain, the use of more efficient cooling technologies and cooling practices, and the re-use of generated waste heat (which is possible, for example, when using hot water cooling [6]). It is expected that in particular the last item, energy reuse, will become an important part of any sustainable Exascale computing strategy.

2.2 Pillar 2: System Hardware

Pillar 2 “System Hardware” represents all IT systems, networks, and storage systems in the data center. The goal of this pillar is the reduction of hardware power consumption. It is a pillar where data center managers have only minor direct influence. The main instrument for change is the contract phase for new system acquisitions. Yet, even then, one is bound by the available technologies. Fortunately this is also an area where the data center is not required to actively be involved because vendors see efficiency improvements as a big competitive edge. For example, Intel’s Sandy Bridge technology introduces many new energy saving features [29], IBM is actively researching the re-use possibilities offered by hot water cooled supercomputers [35], and Intel is investigating dynamic voltage and frequency scaling for memory [9]. Newer generations of data center hardware usually come with higher efficiency and better monitoring capabilities and control functions. While hardware is constrained by vendor product availability, there are ways to stimulate innovation. For example, by combining system costs with operational (energy) costs into one acquisition budget for a contract, the vendor will be actively looking at how to develop products that are more energy efficient.

2.3 Pillar 3: System Software

Pillar 3 “System Software” represents all system level software that is required to use the system hardware. The goal for this pillar is to make the best use of the available resources and to fine tune the software tools for the system(s). This can be done by using a workload management system, according to the data center policies and goals, by taking advantage of energy saving features provided by the underlying hardware with respect to the application needs [4, 22], and by reducing the number of idle resources whenever possible [31]. Also, the capabilities of the system software are determined by the underlying hardware but, at the same time, some hardware functions are only possible with higher level software support. For example, without the support of P-states in the operating system, they cannot be used even though the CPU hardware supports them.

2.4 Pillar 4: Applications

Pillar 4 “Applications” represents all user applications. The goal for this pillar is the optimization of an application’s performance related to specific hardware in terms of raw performance and scalability [5]. This is usually done by selecting the most efficient algorithms, the best libraries for a particular architecture, and the best suited programming paradigms and languages. Any improvement of an application’s performance will, in most cases, improve its overall energy consumption. This behavior can be seen with the PRACE 1st Implementation Phase prototypes [21]. Nevertheless, the capability to correlate power consumption with application behavior is an important step toward an understanding of how implementation choices effect the HPC system power consumption [27, 34].

3 Pillar interactions and existing research gaps

Pillars 1 and 2 define the foundation for the data center which, by its nature, is not easily changed. Pillars 3 and 4 build upon the foundation and constitute an area in which the most energy efficient technology gains can be easily developed.

The first step of using the 4 Pillar model for efficiency improvements in the data center is to look at each pillar as being independent from the others. In fact, this is how most data centers are optimized today.

In Pillar 1 the building infrastructure is designed for a specific cooling capacity and in most cases it is optimized for the maximum expected load.

In Pillar 2 newer hardware technology will usually provide the ‘current best’ in energy efficiency for the current data center restrictions and requirements, for example, the cooling technologies selected for new systems will depend on the data center capabilities in Pillar 1. But there is also an opportunity for further investigations, for example, in the area of network power consumption. Currently the prevailing HPC networks provide no advanced power saving modes. Also the influence of the application communication pattern on the system power consumption is not well understood and requires further investigation [18].

In Pillar 3 the operating system (OS) is an important part of the energy saving possibilities. As most HPC systems use a Linux kernel, which has an active community providing updates, the HPC managers do not have to focus on developing new energy saving features (i.e. the OS updates become a pipeline for these features). The batch scheduling system, which controls the system resource allocation and workload management, has the highest customization potential in Pillar 3. For example, many available scheduling systems can be optimized via pre-defined policies. Additionally, one could implement software based support for specific energy saving policies in the scheduling system itself or as a higher level tool on top of the scheduler. This higher level tool could, for example, validate the workloads energy requirements in order to assure that certain energy-driven policies are not violated before the workload is actually scheduled.

In Pillar 4 the user applications are performance optimized for a specific system; this could include increasing the scalability of the code or just the use of specific compiler flags for the new architecture. In some cases, hand-tuned libraries and compiler intrinsics are used to further optimize the applications for the given platform.

Optimization which focuses on a single Pillar is a valid approach. Effort can then be focused on the Pillars that are the easiest to optimize for individual data centers. However, current results and experiences at BAdW-LRZ show that this approach is hitting an upper boundary in terms of achievable energy efficiency. In most cases further improvements are only possible if optimization tools can use insight gained from other pillars which will result in more prudent decisions.

A good example showing the necessity of cross-pillar interactions can be given when looking at SuperMUC, the latest supercomputer deployed at BAdW-LRZ. SuperMUC consists of over 150,000 CPU cores in 9400 compute nodes, more than 320 TB RAM, and serves as a Tier-0 supercomputer in PRACE [32]. The power profile of SuperMUC is radically different from previous generations of super computers. Its power consumption varies greatly over time and, consequently, SuperMUC puts a strongly variable heat load on the cooling infrastructure. Figure 2 shows this behavior. The top curve shows the complete data center power draw and, as can be seen, the SuperMUC power draw (middle curve) is dictating the overall power profile. In comparison, one of the older cluster systems (bottom line) shows a nearly constant power draw. This change in behavior is a result of the SuperMUC hardware energy efficiency. Unfortunately, the infrastructure is still optimized for a relative constant power draw. This is one area requiring research.

Fig. 2
figure 2

Power consumption profiles of BAdW-LRZ

Another Pillar-spanning concept is heat re-use. Assuming that, for example, in order to heat an office building a water temperature of 40 °C is required, then the outlet temperature of the cooling system (Pillar 1) needs to be kept at least 40 °C. However, the generated heat of the system is in close correlation with the workload profiles. If the work load power draw goes down, a means of retaining the desired outlet temperature could be the deactivation of hardware energy saving features (Pillar 2). On the other hand, if there is less demand for the heat because of warmer weather, the power saving features of the system should be re-enabled.

A last example for a Pillar-spanning scenario is power capping. Here the goal is to control the workload and system resource usage depending on some restrictive data center policies (an example can be found in [11]). These restrictions could be infrastructure limitations (e.g. during a cooling tower maintenance) or business related constraints (e.g. a predefined monthly allowance for power costs). The resource scheduling system (Pillar 3) would require information from Pillar 1 (e.g. available cooling capacity), Pillar 2 (e.g. power consumption of the HPC system hardware with different energy saving techniques) and Pillar 4 (e.g. estimated consumption of applications). Only if this information is available can the scheduler provide support for a specific power capping goal.

4 Application of the 4 Pillar Framework at BAdW-LRZ

Figure 3 shows the current high level activities at BAdW-LRZ. BAdW-LRZ is concerned with heat reuse solutions, collecting data for power monitoring of applications, tuning HPC systems for energy efficiency, and is actively involved in the development of application tuning tools

Fig. 3
figure 3

High level energy efficiency activities at BAdW-LRZ

Figure 4 shows the current activities for each pillar at BAdW-LRZ and the groups responsible for work in each pillar. The CS (Central Services) group is in charge of the BAdW-LRZ data center infrastructure, IBM is responsible for everything related to the SuperMUC hardware, the HPC (High Performance Computing) group is responsible for HPC system administration and low level tools, and the APP (Applications) group is responsible for scaling and performance improvements of specific HPC applications. This view provides an easy way to identify the data center groups that need to coordinate with each other concerning a specific cross pillar activity.

Fig. 4
figure 4

Current energy efficiency activities at BAdW-LRZ

For heat reuse solutions BAdW-LRZ currently uses two different technologies. The waste heat from the SuperMUC supercomputer is used to heat a new office building whereas a smaller HPC system, the so called CooLMUC prototype, uses an adsorption chiller to indirectly cool an additional rack using energy from the system’s waste heat.

In order to measure the energy consumed by a specific scientific application, a tool was needed that could collect power data from the HPC system hardware and correlate them with job information from the batch scheduling system. To address this, the PowerDAM V1.0 software was developed which collects and correlates data from pillars 2 and 3 [33].

For improved system tuning, DVFS support was enabled for the Slurm resource management system. Also, BAdW-LRZ is currently working with IBM to enable advanced energy aware scheduling techniques in the LoadLeveler resource manager. BAdW-LRZ is also actively involved in application performance improvements through participation in research projects such as AutoTune [3].

5 Mapping of future efforts at BAdW-LRZ to the 4 Pillar Framework

The 4 Pillar Framework for energy efficient HPC data centers can also be used to define and plan future activities. Figure 5 shows the needed developments and connections in and between each pillar in order to achieve the optimal energy efficiency for the BAdW-LRZ data center. Only a tight information exchange between all four pillars will enable tools that can provide substantial data center optimization to reduce the overall energy consumption. The aim is a data acquisition infrastructure that can collect important data from all 4 pillars as well as from the outside and the utility providers. This data can then be used by models and simulations to find optimized cross-pillar solutions that can be used by the infrastructure aware global resource manager to optimize the overall data center operation and to adjust the data center performance according to data center goals and policies.

Fig. 5
figure 5

Future energy efficiency plan at BAdW-LRZ

6 Practical application

The 4 Pillar Framework can be used to classify current research projects. The following example shows a possible classifying for the Fit4Green project. The items below are taken from the ‘Technical Approach’ from the projects overview page [13].

  1. 1.

    “Fit4Green will introduce an optimization layer on top of existing data centre automation frameworks—integrating them as a modular “plug-in”—to guide the allocation decisions based on the optimized energy model based policies.”—Usually the scheduling system makes the allocation decisions which is part of Pillar 3.

  2. 2.

    “Energy consumption models will be developed, and validated with real cases, for all ICT components in an IT solution chain, including the effects due to hosting data centres in sites with particular energy related characteristics, like alternative power availability and energy waste/recycle options, etc.”—‘all ICT components’ could be interpreted as data center infrastructure and IT equipment which are part of Pillar 1 and 2, but because the emphasis is on the development of models and not tools there is no impact on pillar 1 and 2 from a practical (solution) point of view.

  3. 3.

    “Optimizations, based on policy modeling descriptions able to capture the variety of deployment that are possible for a given application or a set of applications, integrated with semantic attributes supporting the evaluation of energy consumption models, will guide the deployments decisions on which / when equipments needs to stay on, where/when applications should be deployed/relocated, also capitalizing on the intrinsic non linear behaviors of energy consumption grow with respect to the load of ICT components.”—‘Applications’ relate to Pillar 4, but the main goal of Fit4Green is a model for describing all possible deployments for an application. Therefore, this item will not impact Pillar 4 from a practical point of view. The deployment decisions relate to the scheduling system which is part of Pillar 3.

  4. 4.

    “Scenarios will be developed for several IT solutions deployed through there major categories of computing styles: Service/Enterprise Portal (a set of public services and/or private enterprise applications, typically accessible through Web interface, for a large variety of user agents); Grid (a supercomputer system accessible through the Grid middleware); and Cloud (a lab with cloud computing infrastructure based on open systems frameworks)”—This statement means that any solution developed by Fit4Green is only applicable to data centers that work in Service/Enterprise Portal, Grid, or Cloud computing style.

  5. 5.

    “There will be one pilot site for each computing style. Service/Enterprise Portal, Grid and Cloud pilots will support both single site and federated sites scenarios: multiple collaborating data centres inside the same organization for the Portal pilot; federation of supercomputer systems for Grid and open cloud federation of multiple labs for the Cloud.”—The main target for the project seems to be a federation of data centers.

Using the information found in the ‘Technical Approach’ page, the Fit4Green project activities can be classified under Pillar 3 and are only applicable to a cloud, grid, or service portal based federation of data centers. This classification approach can be used on any research activity related to energy efficiency improvements for data centers.

Table 1 shows a possible classification for a couple of current research projects (CoolEMAll [7], All4Green [1], Fit4Green [12] and GAMES [14]). Column 2 to 4 (Pillar 1 to Pillar 4) shows which pillars are the main focus of the project (indicated by “covered”) and any project specific assumptions or restrictions for each pillar. Column 5 “Comments” lists assumptions or restrictions related to the project in general.

Table 1 Example project classification using the 4 pillar framework

This classification can help to find projects that are interesting for a specific data center. If, for example, one is looking for improvements in Pillar 1, one can remove Fit4Green from the list of project to look into. Similarly if one is not in a federation of data centers, one can focus on CoolEMAll and GAMES.

7 Conclusion

As demonstrated in this paper the “4 Pillar Framework for Energy Efficient HPC data centers” can be used by individual data centers, as well as research communities, as a foundation for all efforts related to the improvement of data center energy efficiency. It can be used to:

  1. 1.

    Identify data center specific areas of improvement

  2. 2.

    Identify required company resources and stakeholders

  3. 3.

    Guide current energy efficiency efforts

  4. 4.

    Plan future work

  5. 5.

    Present results and plans to stakeholders

  6. 6.

    Classify current research efforts

It is flexible enough to be a foundation for efforts related to:

  1. 1.

    Federation of data centers

  2. 2.

    Individual data centers

  3. 3.

    Individual data center parts and systems

  4. 4.

    Individual data center goals and policies

The 4 Pillar Framework has already shown its practical applicability and continues to play a significant role in the strategic planning of research activities and modernization efforts at BAdW-LRZ.