Porting and scaling OpenACC applications on massively-parallel, GPU-accelerated supercomputers
- First Online:
- Cite this article as:
- Hart, A., Ansaloni, R. & Gray, A. Eur. Phys. J. Spec. Top. (2012) 210: 5. doi:10.1140/epjst/e2012-01634-y
- 449 Downloads
An increasing number of massively-parallel supercomputers are based on heterogeneous node architectures combining traditional, powerful multicore CPUs with energy-efficient GPU accelerators. Such systems offer high computational performance with modest power consumption. As the industry trend of closer integration of CPU and GPU silicon continues, these architectures are a possible template for future exascale systems. Given the longevity of large-scale parallel HPC applications, it is important that there is a mechanism for easy migration to such hybrid systems. The OpenACC programming model offers a directive-based method for porting existing codes to run on hybrid architectures. In this paper, we describe our experiences in porting the Himeno benchmark to run on the Cray XK6 hybrid supercomputer. We describe the OpenACC programming model and the changes needed in the code, both to port the functionality and to tune the performance. Despite the additional PCIe-related overheads when transferring data from one GPU to another over the Cray Gemini interconnect, we find the application gives very good performance and scales well. Of particular interest is the facility to launch OpenACC kernels and data transfers asynchronously, which speeds the Himeno benchmark by 5%–10%. Comparing performance with an optimised code on a similar CPU-based system (using 32 threads per node), we find the OpenACC GPU version to be just under twice the speed in a node-for-node comparison. This speed-up is limited by the computational simplicity of the Himeno benchmark and is likely to be greater for more complicated applications.