Using C++ AMP to Accelerate HPC Applications on Multiple Platforms

  • M. Graham LopezEmail author
  • Christopher Bergstrom
  • Ying Wai Li
  • Wael Elwasif
  • Oscar Hernandez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9945)


Many high-end HPC systems support accelerators in their compute nodes to target a variety of workloads including high-performance computing simulations, big data / data analytics codes and visualization. To program both the CPU cores and attached accelerators, users now have multiple programming models available such as CUDA, OpenMP 4, OpenACC, C++14, etc., but some of these models fall short in their support for C++ on accelerators because they can have difficulty supporting advanced C++ features e.g. templating, class members, loops with iterators, lambdas, deep copy, etc. Usually, they either rely on unified memory, or the programming language is not aware of accelerators (e.g. C++14). In this paper, we explore a base-language solution called C++ Accelerated Massive Parallelism (AMP), which was developed by Microsoft and implemented by the PathScale ENZO compiler to program GPUs on a variety of HPC architectures including OpenPOWER and Intel Xeon. We report some prelminary in-progress results using C++ AMP to accelerate a matrix multiplication and quantum Monte Carlo application kernel, examining its expressiveness and performance using NVIDIA GPUs and the PathScale ENZO compiler. We hope that this preliminary report will provide a data point that will inform the functionality needed for future C++ standards to support accelerators with discrete memory spaces.


HPC C++ for Accelerators C++ AMP Accelerator programming 



This material is based upon work supported by the U.S. Department of Energy, Office of science, and this research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.


  1. 1.
    Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). Domain-Specific Languages and High-Level Frameworks High-Performance Computing. CrossRefGoogle Scholar
  2. 2.
    Hornung, R.D., Keasler, J.A.: The RAJA portability layer: Overview and status (2014).
  3. 3.
    Hoberock, J.: Working draft, technical specification for C++ extensions for parallelism (2014).
  4. 4.
    Beyer, J.C., Stotzer, E.J., Hart, A., de Supinski, B.R.: OpenMP for accelerators. In: Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S. (eds.) IWOMP 2011. LNCS, vol. 6665, pp. 108–121. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  5. 5.
    Liao, C., Yan, Y., de Supinski, B.R., Quinlan, D.J., Chapman, B.: Early experiences with the OpenMP accelerator model. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 84–98. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  6. 6.
    CAPS, CRAY and NVIDIA, PGI: The OpenACC application programming interface (2013).
  7. 7.
  8. 8.
    Microsoft Corporation “Reference (C++ AMP)” (2012).
  9. 9.
    PathSCale Inc.: PathScale EKOPath Compiler & ENZO GPGPU Solutions (2016).
  10. 10.
    Sharlet, D., Kunze, A., Junkins, S., Joshi, D.: Shevlin Park: ImplementingC++ AMP with Clang/LLVM and OpenCL 2012 LLVM Developers’ Meeting (2012).
  11. 11.
    HSA Foundation: Bringing C++ AMP Beyond Windows via CLANG and LLVM (2013).
  12. 12.
  13. 13.
  14. 14.
    Bland, A.S., Wells, J.C., Messer, O.E., Hernandez, O.R., Rogers, J.H.: Titan: early experience with the cray XK6 at Oak Ridge National Laboratory. In: Proceedings of Cray User Group Conference (CUG) (2012)Google Scholar
  15. 15.
    SUMMIT: Scale new heights. Discover new solutions.
  16. 16.
    Walkthrough: Matrix multiplication.
  17. 17.
    Kim, J., Esler, K.P., McMinis, J., Morales, M.A., Clark, B.K., Shulenburger, L., Ceperley, D.M.: Hybrid algorithms in quantum Monte Carlo. J. Phys.: Conf. Ser. 402(1), 012008 (2012). Google Scholar
  18. 18.
    Esler, K.P., Kim, J., Schulenburger, L., Ceperley, D.: Fully accelerating quantum monte carlo simulations of real materials on GPU clusters. Comput. Sci. Eng. 13(5), 1–9 (2011)CrossRefGoogle Scholar
  19. 19.
    Wong, M., Kaiser, H., Heller, T.: Towards Massive Parallelism (aka Heterogeneous Devices/Accelerator/GPGPU) support in C++ with HPX (2015).
  20. 20.
    Wong, M., Richards, A., Rovatsou, M., Reyes, R.: Kronos’s OpenCL SYCL to support Heterogeneous Devices for C++ (2016).
  21. 21.
    Kaiser, H., Heller, T., Adelstein-Lelbach, B., Serio, A., Fey, D.: HPX: a task based programming model in a global address space. In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, ser PGAS 2014, pp. 6:1–6:11. ACM, New York (2014).
  22. 22.
    Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. IEEE Des. Test 12(3), 66–73 (2010). Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • M. Graham Lopez
    • 1
    Email author
  • Christopher Bergstrom
    • 2
  • Ying Wai Li
    • 3
  • Wael Elwasif
    • 1
  • Oscar Hernandez
    • 1
  1. 1.Computer Science and Mathematics DivisionOak Ridge National LaboratoryOak RidgeUSA
  2. 2.Pathscale Inc.WilmingtonUSA
  3. 3.National Center for Computational SciencesOak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations