Abstract
In the last decades, supercomputers have become a necessity in science and industry. Huge data centers consume enormous amounts of electricity and we are at a point where newer, faster computers must no longer drain more power than their predecessors. The fact that user demand for compute capabilities has not declined in any way has led to studies of the feasibility of exaflop systems. Heterogeneous clusters with highly-efficient accelerators such as GPUs are one approach to higher efficiency. We present the new L-CSC cluster, a commodity hardware compute cluster dedicated to Lattice QCD simulations at the GSI research facility. L-CSC features a multi-GPU design with four FirePro S9150 GPUs per node providing 320 GB/s memory bandwidth and 2.6 TFLOPS peak performance each. The high bandwidth makes it ideally suited for memory-bound LQCD computations while the multi-GPU design ensures superior power efficiency. The November 2014 Green500 list awarded L-CSC the most power-efficient supercomputer in the world with 5270 MFLOPS/W in the Linpack benchmark. This paper presents optimizations to our Linpack implementation HPL-GPU and other power efficiency improvements which helped L-CSC reach this benchmark. It describes our approach for an accurate Green500 power measurement and unveils some problems with the current measurement methodology. Finally, it gives an overview of the Lattice QCD application on L-CSC.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See http://www.gsi.de.
- 2.
See e.g. http://lattice.github.io/quda/.
- 3.
Compute Abstraction Layer is the assembler language of former AMD GPUs.
References
Rohr, D., Kalcher, S., Bach, M., Alaqeeli, A., Alzaid, H., et al.: An energy-efficient multi-GPU supercomputer. In: Proceedings of the 16th IEEE International Conference on High Performance Computing and Communications, IEEE, Paris, France (2014)
Gupta, R.: Introduction to Lattice QCD (1998). http://arxiv.org/abs/hep-lat/9807028
Babich, R., Clark, M., Joó, B., Shi, G., Brower, R. C., Gottlieb, S.: Scaling lattice QCD beyond 100 GPUs. In: SC 2011 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 70:1–70:11 (2011)
Bach, M., Lindenstruth, V., Philipsen, O., Pinke, C.: Lattice QCD based on OpenCL. Comput. Phys. Commun. 184, 2042–2052 (2013)
Bach, M., Lindenstruth, V., Pinke, C., Philipsen, O.: Twisted-Mass Lattice QCD using OpenCL. In: PoS LATTICE2013, p. 032 (2013)
Philipsen, O., Pinke, C., Sciarra, A., Bach, M.: CL2QCD - lattice QCD based on OpenCL. In: PoS LATTICE2014, p. 038 (2014)
Khronos OpenCL Registry, OpenCL API and C Language Specifications. https://www.khronos.org/registry/cl/
NVIDIA, CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/index.html
Philipsen, O., Pinke, C.: The nature of the Roberge-Weiss transition in \(N_f=2\). Phys. Rev. D 89(9), 094504 (2014)
Philipsen, O., Bach, M., Lindenstruth, V., Pinke, C.: The thermal quark hadron transition in lattice QCD with two quark flavours. In: Proceedings of Conference: C14–02-12.1, pp. 33–40
Dongarra, J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurrency Comput.: Pract. Experience 15(9), 803–820 (2003)
TOP500 Supercomputer Sites. http://www.top500.org
Bach, M., Kretz, M., Lindenstruth, V., Rohr, D.: Optimized HPL for AMD GPU and multi-core CPU usage. Comput. Sci. - Res. Dev. 26(3–4), 153–164 (2011)
Rohr, D., Bach, M., Kretz, M., Lindenstruth, V.: Multi-GPU DGEMM and HPL on highly energy efficient clusters. In: IEEE Micro, Special Issue, CPU, GPU, and Hybrid Computing (2011)
Sharma, S., Hsu, C., Feng, W.: Making a case for a Green500 list. In: Proceedings of the 20th IEEE International Parallel Distributed Processing Symposium p. 343 (2006)
The Green500 List. http://www.green500.org
Bach, M., De Cuveland, J., Ebermann, H., Eschweiler, D., Kretz, M., et al.: The LOEWE-CSC: a comprehensive approach for a power efficient general purpose supercomputer. In: 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (2013)
Rohr, D., Nescovic, G., Radtke, M., Lindenstruth, V.: The L-CSC cluster: greenest supercomputer in the world in Green500 list of November 2014. In: Proceedings of Supercomputing Frontiers (2015)
High Energy Accelerator Research Organization. http://www.kek.jp
PEZY Computing, PEZY-SC Many Core Processor (2014). http://www.pezy.co.jp/en/products/pezy-sc.html
Sterling, T.L.: How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. MIT Press, Cambridge (1999)
Intel Corporation, Intel MKL BLAS Library. https://software.intel.com/en-us/intel-mkl
Rohr, D., Lindenstruth, V.: A flexible and portable large-scale DGEMM library for linpack on next-generation multi-GPU systems. In: 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (2015)
Kidd, T.I.: What exactly is a P-state? (2008). https://software.intel.com/en-us/blogs/2008/05/29/what-exactly-is-a-p-state-pt-1
EEHPC Working Group: Energy Efficient High Performance Computing Power Measurement Methodology v1.2 RC 2
ZES Zimmer: LMG95 1 Phase Power Analyzer. http://www.zes.com/en/Products/Precision-Power-Analyzer/LMG95
Rohr, D.: On Development, Feasibility, and Limits of Highly Efficient CPU and GPU Programs in Several Fields. Dissertation Thesis (2013)
Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010)
Battista, C., Cabasino, S., Marzano, F., Paolucci, P., Pech, J., et al.: APE-100 computer: (i) the architecture. Int. J. High Speed Comput. 05(04), 637–656 (1993)
Boyle, P. A., Chen, D., Christ, N. H., Clark, M. A., Cohen, S. D., et al.: QCDOC: a 10 teraflops computer for tightly-coupled calculations. In: SC 2004 Proceedings of 2004 International Conference for High Performance Computing, Networking, Storage and Analysis (2004)
Baier, H., Boettiger, H., Drochner, M., Eicker, N., Fischer, U.: QPACE - a QCD parallel computer based on cell processors. In: Proceedings of Science, p. 21, November 2009
Vranas, P.: QCD and the BlueGene. J. Phys.: Conf. Ser. 78, 012080 (2007)
Smelyanskiy, M., Vaidyanathan, K., Choi, J., Joó, B., Chhugani, J., et al.: High-performance lattice QCD for Multi-Core based parallel systems using a cache-friendly hybrid threaded-MPI approach. In: SC 2011 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)
Winter, F. T., Clark, M. A., Edwards, R. G., Joó, B.: A framework for lattice QCD calculations on GPUs. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1073–1082 (2014)
Joó, B., Kalamkar, D.D., Vaidyanathan, K., Smelyanskiy, M., Pamnany, K., et al.: Supercomputing. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. Lecture Notes in Computer Science, vol. 7905, pp. 40–54. Springer, Heidelberg (2014)
Acknowledgments
We would like to thank Advanced Micro Devices, Inc. (AMD) and ASUSTeK Computer Inc. (Asus) for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Rohr, D., Bach, M., Nešković, G., Lindenstruth, V., Pinke, C., Philipsen, O. (2015). Lattice-CSC: Optimizing and Building an Efficient Supercomputer for Lattice-QCD and to Achieve First Place in Green500. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-20119-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20118-4
Online ISBN: 978-3-319-20119-1
eBook Packages: Computer ScienceComputer Science (R0)