The 4 Pillar Framework for energy efficient HPC data centers
- First Online:
- Cite this article as:
- Wilde, T., Auweter, A. & Shoukourian, H. Comput Sci Res Dev (2014) 29: 241. doi:10.1007/s00450-013-0244-6
- 1.3k Downloads
Improving energy efficiency has become a major research area not just for commercial data centers but also for high performance computing (HPC) data centers. While many approaches for reducing the energy consumption in data centers and HPC sites have been proposed and implemented, as of today, many research teams focused on improving the energy efficiency of data centers are working independently from others. The main reason being that there is no underlying framework that would allow them to relate their work to achievements made elsewhere. Also, without some frame of correlation, the produced results are either not easily applicable beyond their origin or it is not clear if, when, where, and for whom else they are actually useful. This paper introduces the “4 Pillar Framework for Energy Efficient HPC Data Centers” which can be used by HPC center managers to wholistically evaluate their site, find specific focus areas, classify current research activities, and identify areas for further improvement and research. The 4 pillars are: 1. Building Infrastructure; 2. HPC Hardware; 3. HPC System Software; and 4. HPC Applications. While most HPC centers already implement optimizations within each of the pillars, optimization efforts crossing the pillar boundaries are still rare. The 4 Pillar Framework, however, specifically encourages such cross-pillar optimization efforts. Besides introducing the framework itself, this paper shows its applicability by mapping current research activities in the field of energy efficient HPC conducted at Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities to the framework as reference.
KeywordsEnergy efficiencyHPCData centersFramework4 Pillars
Rising energy costs, and the increase in data center power consumption driven by an ever increasing demand for data services, are becoming a dominating factor for the Total Cost of Ownership (TCO) over the lifetime of a computing system. Additionally, current semiconductor technology will be hitting a point where downsizing and, thus, inherent reduction of power, will no longer be possible mainly due to the increase in leakage currents .
Many high profile studies show the increase of data center power demands as well as the power challenges that high performance Exascale computing will provide. For example, a study sponsored by the German Federal Ministry of Economics and Technology (Bundesministeriums für Wirtschaft und Technologie, BMWi) in 2009  showed that data centers in Germany used 9 TWh of electrical energy per year in 2007 with the prediction of 9.5 TWh in 2010 (corresponding to roughly 1.3 Billion Euro at that time), 10.3 TWh by 2015 and 12.3 TWh by 2020. The same trend is visible for the world wide data center power consumption from the years 2000 to 2005  and from 2005 to 2010 .
With one eye towards future developments the High Performance Computing (HPC) community realized that energy efficiency improvements are needed for sustainable multi-peta and Exascale computing. The “Exascale Computing Study: Technology Challenges in Achieving Exascale Systems”  sets a definite power challenge of 20 MW which became a main driver for HPC energy efficient research. Therefore, reducing the energy consumption and overall environmental impact of data centers has become an important research area.
The last couple of years have seen a rapid increase in projects and publications covering a multitude of topics in this field. Unfortunately without some underlying basic structure it is difficult for data center operators to identify center specific optimization approaches, to find relevant research areas and publications, and to identify and apply possible solutions. Current focus areas for data center energy efficiency improvements are: energy efficient application design (which in most cases are directly related to performance improvement efforts like AutoTune ); application co-design (developing an HPC system that supports the application behavior in terms of either communication, memory pattern, or I/O requirements, which is the focus of Exascale projects such as DEEP  and MontBlanc ); batch scheduler controlled Dynamic Voltage and Frequency Scaling (DVFS, like in [15, 19]); and scheduling for a federation of data centers that has been investigated, for example, by the Fit4Green  and All4Green  projects. Data center instrumentation is another important research area which is concerned with defining how and where to measure certain key performance indicators (KPI’s) related to energy efficient HPC. However, current systems often miss proper instrumentation and, therefore, do not allow for the easy collection of required data. The Energy Efficient HPC working group  is one of the major driving forces trying to help the HPC community to define the needed HPC system and data center instrumentation and provide guidelines on how to measure the KPI’s defined by other standardization bodies such as ASHRAE  and The Green Grid .
The 4 Pillar Framework for energy efficient HPC data centers was specifically developed to address the lack of a foundation for energy efficiency efforts in high performance computing and to help the HPC data centers to identify areas of improvement and develop applicable solutions. It is completely data center independent by providing a generic way to look at the energy efficiency improvement domain. Other related works share some ideas with the 4 Pillar Framework but leave out significant parts. For example, “Energy Efficiency in Data Centers: A new Policy Frontier”  is referring to parts of Pillar 1 and 2 only and, thus, it is solely focusing at the data center energy efficiency from an operational point of view. Another example, the “DPPE: Holistic Framework for Data Center Energy Efficiency” , references pillar 1, 2, and 3 to define the data center energy flow. This flow is then used to find areas of improvement and the corresponding division (operating unit) in the data center. It considers each ‘pillar’ as a separate improvement area with their own KPI (GEC—Green Energy Coefficient, PUE—Power Usage Effectiveness, ITEE—IT Equipment Energy Efficiency, ITEU—IT Equipment Utilization) that can be measured. This approach has also a strong operational focus.
Both examples can be seen as one specific implementation of the 4 Pillar Framework. They show that the framework can be the foundation for creating energy efficiency models for specific data centers and for modeling data center energy flow chains. This is possible because the 4 Pillar Framework acknowledges that different data centers have different goals and requirements, that applications play an important part for the energy efficiency of HPC data centers, and that the addition of cross pillar interactions allow for more fine-tuned energy efficiency related decisions.
This paper will present the basis for the 4 Pillar Framework, cover the definition and goal of each pillar, and provide a representative model using activities performed at Leibniz Supercomputing Center of the Bavarian Academy of Sciences and Humanities (BAdW-LRZ).
2 The 4 Pillar Framework
The 4 Pillar Framework was developed at BAdW-LRZ mainly to solve two problems. The first was to understand and categorize all internal and external efforts in the area of improving the data center energy efficiency. The second was to provide a foundation for planning future work and for presenting current efforts to outside stakeholders. By addressing these two areas it is possible to work on the BAdW-LRZ data center goal of reducing the TCO for the lifetime of its HPC systems.
2.1 Pillar 1: Building Infrastructure
Pillar 1 “Building Infrastructure” represents the complete non-IT infrastructure needed to operate a data center. The goal of this pillar is the improvement of data center energy efficiency key performance indicators such as Power Usage Effectiveness (PUE), Water Usage Effectiveness (WUE), Carbon Usage Effectiveness (CUE), Energy Reuse Effectiveness (ERE), etc. These have been addressed in research work as well as by many standardization and industrial bodies. Some of the most noteworthy standardization organizations related to Pillar 1 are ASHRAE  and “The Green Grid” , which have developed best practices guides and evaluation matrices for efficient data center operation. Important pillar aspects are the reduction of losses in the energy supply chain, the use of more efficient cooling technologies and cooling practices, and the re-use of generated waste heat (which is possible, for example, when using hot water cooling ). It is expected that in particular the last item, energy reuse, will become an important part of any sustainable Exascale computing strategy.
2.2 Pillar 2: System Hardware
Pillar 2 “System Hardware” represents all IT systems, networks, and storage systems in the data center. The goal of this pillar is the reduction of hardware power consumption. It is a pillar where data center managers have only minor direct influence. The main instrument for change is the contract phase for new system acquisitions. Yet, even then, one is bound by the available technologies. Fortunately this is also an area where the data center is not required to actively be involved because vendors see efficiency improvements as a big competitive edge. For example, Intel’s Sandy Bridge technology introduces many new energy saving features , IBM is actively researching the re-use possibilities offered by hot water cooled supercomputers , and Intel is investigating dynamic voltage and frequency scaling for memory . Newer generations of data center hardware usually come with higher efficiency and better monitoring capabilities and control functions. While hardware is constrained by vendor product availability, there are ways to stimulate innovation. For example, by combining system costs with operational (energy) costs into one acquisition budget for a contract, the vendor will be actively looking at how to develop products that are more energy efficient.
2.3 Pillar 3: System Software
Pillar 3 “System Software” represents all system level software that is required to use the system hardware. The goal for this pillar is to make the best use of the available resources and to fine tune the software tools for the system(s). This can be done by using a workload management system, according to the data center policies and goals, by taking advantage of energy saving features provided by the underlying hardware with respect to the application needs [4, 22], and by reducing the number of idle resources whenever possible . Also, the capabilities of the system software are determined by the underlying hardware but, at the same time, some hardware functions are only possible with higher level software support. For example, without the support of P-states in the operating system, they cannot be used even though the CPU hardware supports them.
2.4 Pillar 4: Applications
Pillar 4 “Applications” represents all user applications. The goal for this pillar is the optimization of an application’s performance related to specific hardware in terms of raw performance and scalability . This is usually done by selecting the most efficient algorithms, the best libraries for a particular architecture, and the best suited programming paradigms and languages. Any improvement of an application’s performance will, in most cases, improve its overall energy consumption. This behavior can be seen with the PRACE 1st Implementation Phase prototypes . Nevertheless, the capability to correlate power consumption with application behavior is an important step toward an understanding of how implementation choices effect the HPC system power consumption [27, 34].
3 Pillar interactions and existing research gaps
Pillars 1 and 2 define the foundation for the data center which, by its nature, is not easily changed. Pillars 3 and 4 build upon the foundation and constitute an area in which the most energy efficient technology gains can be easily developed.
The first step of using the 4 Pillar model for efficiency improvements in the data center is to look at each pillar as being independent from the others. In fact, this is how most data centers are optimized today.
In Pillar 1 the building infrastructure is designed for a specific cooling capacity and in most cases it is optimized for the maximum expected load.
In Pillar 2 newer hardware technology will usually provide the ‘current best’ in energy efficiency for the current data center restrictions and requirements, for example, the cooling technologies selected for new systems will depend on the data center capabilities in Pillar 1. But there is also an opportunity for further investigations, for example, in the area of network power consumption. Currently the prevailing HPC networks provide no advanced power saving modes. Also the influence of the application communication pattern on the system power consumption is not well understood and requires further investigation .
In Pillar 3 the operating system (OS) is an important part of the energy saving possibilities. As most HPC systems use a Linux kernel, which has an active community providing updates, the HPC managers do not have to focus on developing new energy saving features (i.e. the OS updates become a pipeline for these features). The batch scheduling system, which controls the system resource allocation and workload management, has the highest customization potential in Pillar 3. For example, many available scheduling systems can be optimized via pre-defined policies. Additionally, one could implement software based support for specific energy saving policies in the scheduling system itself or as a higher level tool on top of the scheduler. This higher level tool could, for example, validate the workloads energy requirements in order to assure that certain energy-driven policies are not violated before the workload is actually scheduled.
In Pillar 4 the user applications are performance optimized for a specific system; this could include increasing the scalability of the code or just the use of specific compiler flags for the new architecture. In some cases, hand-tuned libraries and compiler intrinsics are used to further optimize the applications for the given platform.
Optimization which focuses on a single Pillar is a valid approach. Effort can then be focused on the Pillars that are the easiest to optimize for individual data centers. However, current results and experiences at BAdW-LRZ show that this approach is hitting an upper boundary in terms of achievable energy efficiency. In most cases further improvements are only possible if optimization tools can use insight gained from other pillars which will result in more prudent decisions.
Another Pillar-spanning concept is heat re-use. Assuming that, for example, in order to heat an office building a water temperature of 40 °C is required, then the outlet temperature of the cooling system (Pillar 1) needs to be kept at least 40 °C. However, the generated heat of the system is in close correlation with the workload profiles. If the work load power draw goes down, a means of retaining the desired outlet temperature could be the deactivation of hardware energy saving features (Pillar 2). On the other hand, if there is less demand for the heat because of warmer weather, the power saving features of the system should be re-enabled.
A last example for a Pillar-spanning scenario is power capping. Here the goal is to control the workload and system resource usage depending on some restrictive data center policies (an example can be found in ). These restrictions could be infrastructure limitations (e.g. during a cooling tower maintenance) or business related constraints (e.g. a predefined monthly allowance for power costs). The resource scheduling system (Pillar 3) would require information from Pillar 1 (e.g. available cooling capacity), Pillar 2 (e.g. power consumption of the HPC system hardware with different energy saving techniques) and Pillar 4 (e.g. estimated consumption of applications). Only if this information is available can the scheduler provide support for a specific power capping goal.
4 Application of the 4 Pillar Framework at BAdW-LRZ
For heat reuse solutions BAdW-LRZ currently uses two different technologies. The waste heat from the SuperMUC supercomputer is used to heat a new office building whereas a smaller HPC system, the so called CooLMUC prototype, uses an adsorption chiller to indirectly cool an additional rack using energy from the system’s waste heat.
In order to measure the energy consumed by a specific scientific application, a tool was needed that could collect power data from the HPC system hardware and correlate them with job information from the batch scheduling system. To address this, the PowerDAM V1.0 software was developed which collects and correlates data from pillars 2 and 3 .
For improved system tuning, DVFS support was enabled for the Slurm resource management system. Also, BAdW-LRZ is currently working with IBM to enable advanced energy aware scheduling techniques in the LoadLeveler resource manager. BAdW-LRZ is also actively involved in application performance improvements through participation in research projects such as AutoTune .
5 Mapping of future efforts at BAdW-LRZ to the 4 Pillar Framework
6 Practical application
“Fit4Green will introduce an optimization layer on top of existing data centre automation frameworks—integrating them as a modular “plug-in”—to guide the allocation decisions based on the optimized energy model based policies.”—Usually the scheduling system makes the allocation decisions which is part of Pillar 3.
“Energy consumption models will be developed, and validated with real cases, for all ICT components in an IT solution chain, including the effects due to hosting data centres in sites with particular energy related characteristics, like alternative power availability and energy waste/recycle options, etc.”—‘all ICT components’ could be interpreted as data center infrastructure and IT equipment which are part of Pillar 1 and 2, but because the emphasis is on the development of models and not tools there is no impact on pillar 1 and 2 from a practical (solution) point of view.
“Optimizations, based on policy modeling descriptions able to capture the variety of deployment that are possible for a given application or a set of applications, integrated with semantic attributes supporting the evaluation of energy consumption models, will guide the deployments decisions on which / when equipments needs to stay on, where/when applications should be deployed/relocated, also capitalizing on the intrinsic non linear behaviors of energy consumption grow with respect to the load of ICT components.”—‘Applications’ relate to Pillar 4, but the main goal of Fit4Green is a model for describing all possible deployments for an application. Therefore, this item will not impact Pillar 4 from a practical point of view. The deployment decisions relate to the scheduling system which is part of Pillar 3.
“Scenarios will be developed for several IT solutions deployed through there major categories of computing styles: Service/Enterprise Portal (a set of public services and/or private enterprise applications, typically accessible through Web interface, for a large variety of user agents); Grid (a supercomputer system accessible through the Grid middleware); and Cloud (a lab with cloud computing infrastructure based on open systems frameworks)”—This statement means that any solution developed by Fit4Green is only applicable to data centers that work in Service/Enterprise Portal, Grid, or Cloud computing style.
“There will be one pilot site for each computing style. Service/Enterprise Portal, Grid and Cloud pilots will support both single site and federated sites scenarios: multiple collaborating data centres inside the same organization for the Portal pilot; federation of supercomputer systems for Grid and open cloud federation of multiple labs for the Cloud.”—The main target for the project seems to be a federation of data centers.
Example project classification using the 4 pillar framework
covered (air cooled)
covered (air cooled custom hardware)
covered (cloud based and service/enterprise portal oriented)
federation of data centers (multiple 4 pillar framework for each associated data center)
covered (connects with utility provider)
builds on Fit4Green, data center federation (multiple 4 pillar framework for each associated data center)
cloud computing paradigm
This classification can help to find projects that are interesting for a specific data center. If, for example, one is looking for improvements in Pillar 1, one can remove Fit4Green from the list of project to look into. Similarly if one is not in a federation of data centers, one can focus on CoolEMAll and GAMES.
Identify data center specific areas of improvement
Identify required company resources and stakeholders
Guide current energy efficiency efforts
Plan future work
Present results and plans to stakeholders
Classify current research efforts
Federation of data centers
Individual data centers
Individual data center parts and systems
Individual data center goals and policies
The 4 Pillar Framework has already shown its practical applicability and continues to play a significant role in the strategic planning of research activities and modernization efforts at BAdW-LRZ.
The work presented here has been carried out within the PRACE project, which has received funding from the European Community’s Seventh Framework Program (FP7/2007-2013) under grant agreements Nos. RI-261557 and RI-283493, and at the Leibniz Supercomputing Centre (BAdW-LRZ) with support of the State of Bavaria, Germany, and the Gauss Centre for Supercomputing (GCS). The authors would like to thank Jeanette Wilde (BAdW-LRZ) and Prof. Dr. Arndt Bode (BAdW-LRZ, TUM) for their valuable comments and support.
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.