PMBS 2013: High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation pp 217-238 | Cite as
Tuning HipGISAXS on Multi and Many Core Supercomputers
Abstract
With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.
Keywords
Thread Block Many Integrate Core Strong Scaling OpenMP Thread Kernel FusionPreview
Unable to display preview. Download preview PDF.
References
- 1.Tesla Kepler GPU Accelerators. Datasheet (2012)Google Scholar
- 2.Intel Xeon Phi Coprocessor. Developer’s Quick Start Guide. Version 1.5. White Paper (2013)Google Scholar
- 3.Performance Application Programming Interface (PAPI) (2013), http://icl.cs.utk.edu/papi
- 4.Top500 Supercomputers (June 2013), http://www.top500.org
- 5.Chourou, S., Sarje, A., Li, X., Chan, E., Hexemer, A.: HipGISAXS: A High Performance Computing Code for Simulating Grazing Incidence X-Ray Scattering Data. Submitted to the Journal of Applied Crystallography (2013)Google Scholar
- 6.Intel Corp.: Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual (September 2012)Google Scholar
- 7.Kim, C., Satish, N., Chhugani, J., et al.: Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology. Tech. Rep. (2011)Google Scholar
- 8.Pommier, J.: SIMD implementation of sin, cos, exp and log. Tech. Rep. (2007), http://gruntthepeon.free.fr/ssemath
- 9.Sarje, A., Li, X., Chourou, S., Chan, E., Hexemer, A.: Massively Parallel X-ray Scattering Simulations. In: Supercomputing (SC 2012) (2012)Google Scholar
- 10.Satish, N., Kim, C., Chhugani, J., et al.: Can traditional programming bridge the Ninja performance gap for parallel computing applications? SIGARCH Computer Architecture News 40(3), 440–451 (2012). http://doi.acm.org/10.1145/2366231.2337210 CrossRefGoogle Scholar