Advertisement

Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

  • Ada SedovaEmail author
  • Andreas F. Tillack
  • Arnold Tharrington
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11381)

Abstract

Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance within an acceptable range for these types of programs in the HPC setting. We find that for easily parallelizable kernels, performance on the GPU remains within this range. On the CPU, OpenACC-parallelized pairwise distance kernels would not meet the performance standards required, when using AMD Opteron “Interlagos” processors, but with IBM Power 9 processors, performance remains within an acceptable range for small batch sizes. These kernels provide a test for achieving performance portability with compiler directives for problems with memory-intensive components as are often found in scientific applications.

Keywords

Performance portability OpenACC Compiler directives Pairwise distance Molecular simulation 

References

  1. 1.
  2. 2.
    https://gerrit.gromacs.org/. Accessed 22 Aug 2018
  3. 3.
    https://lammps.sandia.gov/. Accessed 20 Aug 2018
  4. 4.
  5. 5.
  6. 6.
  7. 7.
    thrust.github.io. Accessed 19 July 2017
  8. 8.
  9. 9.
    https://www.cp2k.org/performance. Accessed 27 Aug 2018
  10. 10.
  11. 11.
    icl.cs.utk.edu/magma. Accessed 19 July 2017
  12. 12.
    BLAS (basic linear algebra subprograms). www.netlib.org/blas. Accessed 19 July 2017
  13. 13.
    Computational and data-enabled science and engineering. https://www.nsf.gov. Accessed 14 July 2017
  14. 14.
    Introducing batch GEMM operations. https://software.intel.com/en-us/articles/introducing-batch-gemm-operations. Accessed 6 Sept 2018
  15. 15.
    NSF/Intel partnership on computer assisted programming for heterogeneous architectures (CAPA). https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=505319. Accessed 20 Aug 2018
  16. 16.
    www.openacc.org (2017). Accessed 14 July 2017
  17. 17.
    www.openmp.org (2017). Accessed 14 July 2017
  18. 18.
    www.gnu.org (2017). Accessed 14 July 2017
  19. 19.
    Abraham, M.J., et al.: GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015)CrossRefGoogle Scholar
  20. 20.
    Al-Neama, M.W., Reda, N.M., Ghaleb, F.F.: An improved distance matrix computation algorithm for multicore clusters. BioMed Res. Int. 2014, 1–12 (2014)CrossRefGoogle Scholar
  21. 21.
    Arefin, A.S., Riveros, C., Berretta, R., Moscato, P.: Computing large-scale distance matrices on GPU. In: 2012 7th International Conference on Computer Science & Education (ICCSE), pp. 576–580. IEEE (2012)Google Scholar
  22. 22.
    Barrett, R.F., Vaughan, C.T., Heroux, M.A.: MiniGhost: a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing. Technical report. SAND 5294832, Sandia National Laboratories (2011)Google Scholar
  23. 23.
    Bonati, C., et al.: Design and optimization of a portable LQCD Monte Carlo code using OpenACC. Int. J. Mod. Phys. C 28(05), 1750063 (2017)CrossRefGoogle Scholar
  24. 24.
    Bowers, K.J., Dror, R.O., Shaw, D.E.: Zonal methods for the parallel execution of range-limited N-body simulations. J. Comput. Phys. 221(1), 303–329 (2007)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Brown, W.M., Carrillo, J.M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J.: Optimizing legacy molecular dynamics software with directive-based offload. Comput. Phys. Commun. 195, 95–101 (2015)CrossRefGoogle Scholar
  26. 26.
    Brown, W.M., Wang, P., Plimpton, S.J., Tharrington, A.N.: Implementing molecular dynamics on hybrid high performance computers-short range forces. Comput. Phys. Commun. 182(4), 898–911 (2011)CrossRefGoogle Scholar
  27. 27.
    Brown, W.M., Yamada, M.: Implementing molecular dynamics on hybrid high performance computers—three-body potentials. Comput. Phys. Commun. 184(12), 2785–2793 (2013)CrossRefGoogle Scholar
  28. 28.
    Calore, E., Gabbana, A., Kraus, J., Schifano, S.F., Tripiccione, R.: Performance and portability of accelerated lattice Boltzmann applications with OpenACC. Concurr. Comput. Pract. Exp. 28(12), 3485–3502 (2016)CrossRefGoogle Scholar
  29. 29.
    Chandrasekaran, S., Juckeland, G.: OpenACC for Programmers: Concepts and Strategies. Addison-Wesley Professional, Boston (2017)Google Scholar
  30. 30.
    Ciccotti, G., Ferrario, M., Schuette, C.: Molecular dynamics simulation. Entropy 16, 233 (2014)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Codreanu, V., et al.: Evaluating automatically parallelized versions of the support vector machine. Concurr. Comput. Pract. Exp. 28(7), 2274–2294 (2016)CrossRefGoogle Scholar
  32. 32.
    PGI Compilers and Tools: OpenACC getting started guide. https://www.pgroup.com/resources/docs/18.5/pdf/openacc18_gs.pdf. Accessed 31 Aug 2018
  33. 33.
    Decyk, V.K., Singh, T.V.: Particle-in-cell algorithms for emerging computer architectures. Comput. Phys. Commun. 185(3), 708–719 (2014)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Dongarra, J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched blas on modern high-performance computing systems. Procedia Comput. Sci. 108, 495–504 (2017)CrossRefGoogle Scholar
  35. 35.
    Garvey, J.D., Abdelrahman, T.S.: A strategy for automatic performance tuning of stencil computations on GPUs. Sci. Programm. 2018, 1–24 (2018)CrossRefGoogle Scholar
  36. 36.
    Götz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized born. J. Chem. Theory Comput. 8(5), 1542–1555 (2012)CrossRefGoogle Scholar
  37. 37.
    Guo, X., Rogers, B.D., Lind, S., Stansby, P.K.: New massively parallel scheme for incompressible smoothed particle hydrodynamics (ISPH) for highly nonlinear and distorted flow. Comput. Phys. Commun. 233, 16–28 (2018)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Hardy, D.J.: Improving NAMD performance on multi-GPU platforms. In: 16th Annual Workshop on Charm++ and its Applications. https://charm.cs.illinois.edu/workshops/charmWorkshop2018/slides/CharmWorkshop2018_namd_hardy.pdf (2018)
  39. 39.
    Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA, chapter 39. In: Nguyen, H. (ed.) GPU Gems 3. Addison-Wesley, Boston (2008)Google Scholar
  40. 40.
    Huber, J., Hernandez, O., Lopez, G.: Effective vectorization with OpenMP 4.5, ORNL/TM-2016/391. Technical report, Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF) (2017)Google Scholar
  41. 41.
    Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)CrossRefGoogle Scholar
  42. 42.
    Jocksch, A., Hariri, F., Tran, T.-M., Brunner, S., Gheller, C., Villard, L.: A bucket sort algorithm for the particle-in-cell method on manycore architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 43–52. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-32149-3_5CrossRefGoogle Scholar
  43. 43.
    Juckeland, G., et al.: From describing to prescribing parallelism: translating the SPEC ACCEL OpenACC suite to OpenMP target directives. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 470–488. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46079-6_33CrossRefGoogle Scholar
  44. 44.
    Kale, V., Solomonik, E.: Parallel sorting pattern. In: Proceedings of the 2010 Workshop on Parallel Programming Patterns, p. 10. ACM (2010)Google Scholar
  45. 45.
    Kirk, R.O., Mudalige, G.R., Reguly, I.Z., Wright, S.A., Martineau, M.J., Jarvis, S.A.: Achieving performance portability for a heat conduction solver mini-application on modern multi-core systems. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 834–841. IEEE (2017)Google Scholar
  46. 46.
    Kutzner, C., Páll, S., Fechner, M., Esztermann, A., de Groot, B.L., Grubmüller, H.: Best bang for your buck: GPU nodes for GROMACS biomolecular simulations. J. Comput. Chem. 36(26), 1990–2008 (2015)CrossRefGoogle Scholar
  47. 47.
    Larrea, V.V., Joubert, W., Lopez, M.G., Hernandez, O.: Early experiences writing performance portable OpenMP 4 codes. In: Proceedings of Cray User Group Meeting, London, England (2016)Google Scholar
  48. 48.
    Lashgar, A., Baniasadi, A.: Employing software-managed caches in OpenACC: opportunities and benefits. ACM Trans. Model. Perform. Eval. Comput. Syst. 1(1), 2 (2016)CrossRefGoogle Scholar
  49. 49.
    Li, Q., Kecman, V., Salman, R.: A chunking method for Euclidean distance matrix calculation on large dataset using multi-GPU. In: 2010 Ninth International Conference on Machine Learning and Applications (ICMLA), pp. 208–213. IEEE (2010)Google Scholar
  50. 50.
    Li, X., Shih, P.C., Overbey, J., Seals, C., Lim, A.: Comparing programmer productivity in OpenACC and CUDA: an empirical investigation. Int. J. Comput. Sci. Eng. Appl. (IJCSEA) 6(5), 1–15 (2016)Google Scholar
  51. 51.
    Lopez, M.G., et al.: Towards achieving performance portability using directives for accelerators. In: 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), pp. 13–24. IEEE (2016)Google Scholar
  52. 52.
    Memeti, S., Li, L., Pllana, S., Kołodziej, J., Kessler, C.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 1–6. ACM (2017)Google Scholar
  53. 53.
    Milic, U., Gelado, I., Puzovic, N., Ramirez, A., Tomasevic, M.: Parallelizing general histogram application for CUDA architectures. In: 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), pp. 11–18. IEEE (2013)Google Scholar
  54. 54.
    Mooney, J.D.: Bringing portability to the software process. Department of Statistics and Computer Science, West Virginia University, Morgantown WV (1997)Google Scholar
  55. 55.
    Mooney, J.D.: Developing portable software. In: Reis, R. (ed.) Information Technology. IIFIP, vol. 157, pp. 55–84. Springer, Boston, MA (2004).  https://doi.org/10.1007/1-4020-8159-6_3CrossRefGoogle Scholar
  56. 56.
    Nicolini, M., Miller, J., Wienke, S., Schlottke-Lakemper, M., Meinke, M., Müller, M.S.: Software cost analysis of GPU-accelerated aeroacoustics simulations in C++ with OpenACC. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 524–543. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46079-6_36CrossRefGoogle Scholar
  57. 57.
    Nori, R., Karodiya, N., Reza, H.: Portability testing of scientific computing software systems. In: 2013 IEEE International Conference on Electro/Information Technology (EIT), pp. 1–8. IEEE (2013)Google Scholar
  58. 58.
    Páll, S., Abraham, M.J., Kutzner, C., Hess, B., Lindahl, E.: Tackling exascale software challenges in molecular dynamics simulations with GROMACS. In: Markidis, S., Laure, E. (eds.) EASC 2014. LNCS, vol. 8759, pp. 3–27. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-15976-8_1CrossRefGoogle Scholar
  59. 59.
    Van der Pas, R., Stotzer, E., Terboven, C.: Using OpenMP–The Next Step: Affinity, Accelerators, Tasking, and SIMD. MIT Press, Cambridge (2017)Google Scholar
  60. 60.
    Pennycook, S.J., Sewall, J.D., Lee, V.: A metric for performance portability. arXiv preprint arXiv:1611.07409 (2016)
  61. 61.
    Phillips, J.C., et al.: Scalable molecular dynamics with namd. J. Comput. Chem. 26(16), 1781–1802 (2005)CrossRefGoogle Scholar
  62. 62.
    Phillips, J.C., Kale, L., Buch, R., Acun, B.: NAMD: scalable molecular dynamics based on the charm++ parallel runtime system. In: Exascale Scientific Applications, pp. 119–144. Chapman and Hall/CRC (2017)Google Scholar
  63. 63.
    Phillips, J.C., Sun, Y., Jain, N., Bohm, E.J., Kalé, L.V.: Mapping to irregular torus topologies and other techniques for petascale biomolecular simulation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 81–91. IEEE Press (2014)Google Scholar
  64. 64.
    Pino, S., Pollock, L., Chandrasekaran, S.: Exploring translation of OpenMP to OpenACC 2.5: lessons learned. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 673–682. IEEE (2017)Google Scholar
  65. 65.
    Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117(1), 1–19 (1995)CrossRefGoogle Scholar
  66. 66.
    Plimpton, S.J.: The LAMMPS molecular dynamics engine (2017). https://www.osti.gov/servlets/purl/1458156
  67. 67.
    Salomon-Ferrer, R., Götz, A.W., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald. J. Chem. Theory Comput. 9(9), 3878–3888 (2013)CrossRefGoogle Scholar
  68. 68.
    Schach, S.R.: Object-oriented and Classical Software Engineering, pp. 215–255. McGraw-Hill, New York (2002)Google Scholar
  69. 69.
    Schlick, T.: Molecular Modeling and Simulation: An Interdisciplinary Guide, vol. 21. Springer, New York (2010).  https://doi.org/10.1007/978-1-4419-6351-2CrossRefzbMATHGoogle Scholar
  70. 70.
    Sedova, A., Banavali, N.K.: Geometric patterns for neighboring bases near the stacked state in nucleic acid strands. Biochemistry 56(10), 1426–1443 (2017)CrossRefGoogle Scholar
  71. 71.
    Shi, T., Belkin, M., Yu, B., et al.: Data spectroscopy: eigenspaces of convolution operators and clustering. Ann. Stat. 37(6B), 3960–3984 (2009)MathSciNetCrossRefGoogle Scholar
  72. 72.
    Solomonik, E., Kale, L.V.: Highly scalable parallel sorting. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)Google Scholar
  73. 73.
    Stone, J.E., Hynninen, A.-P., Phillips, J.C., Schulten, K.: Early experiences porting the NAMD and VMD molecular simulation and analysis software to GPU-accelerated OpenPOWER platforms. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 188–206. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46079-6_14CrossRefGoogle Scholar
  74. 74.
    Sultana, N., Calvert, A., Overbey, J.L., Arnold, G.: From OpenACC to OpenMP 4: toward automatic translation. In: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale, p. 44. ACM (2016)Google Scholar
  75. 75.
    Sun, Y., et al.: Evaluating performance tradeoffs on the Radeon open compute platform. In: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 209–218. IEEE (2018)Google Scholar
  76. 76.
    Tedre, M., Denning, P.J.: Shifting identities in computing: from a useful tool to a new method and theory of science. Informatics in the Future, pp. 1–16. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-55735-9_1CrossRefGoogle Scholar
  77. 77.
    Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC—first experiences with real-world applications. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 859–870. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-32820-6_85CrossRefGoogle Scholar
  78. 78.
    Wienke, S., Terboven, C., Beyer, J.C., Müller, M.S.: A pattern-based comparison of OpenACC and OpenMP for accelerator computing. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 812–823. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-09873-9_68CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Scientific Computing Group, National Center for Computational SciencesOak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations