GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

  • Tom DeakinEmail author
  • James Price
  • Matt Martineau
  • Simon McIntosh-Smith
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9945)


Many scientific codes consist of memory bandwidth bound kernels — the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before results are written back to memory. One major advantage of many-core devices such as General Purpose Graphics Processing Units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. However, as with CPUs, this peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance.

The choice of one programming model over another should ideally not limit the performance that can be achieved on a device. GPU-STREAM has been updated to incorporate a wide variety of the latest parallel programming models, all implementing the same parallel scheme. As such this tool can be used as a kind of Rosetta Stone which provides both a cross-platform and cross-programming model array of results of achievable memory bandwidth.


Programming Model Application Program Interface Memory Bandwidth Parallel Programming Model General Purpose Graphic Processing Unit 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We would like to thank Cray Inc. for providing access to the Cray XC40 supercomputer, Swan, and the Cray CS cluster, Falcon. Our thanks to Codeplay for access to the ComputeCpp SYCL compiler and to Douglas Miles at PGI (NVIDIA) for access to the PGI compiler. We would also like to that the University of Bristol Intel Parallel Computing Center (IPCC). This work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol - Thanks also go to the University of Oxford for access to the Power 8 system.


  1. 1.
    Bhat, K.: clpeak (2015).
  2. 2.
  3. 3.
    Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pp. 63–74. ACM, New York (2010).
  4. 4.
    Deakin, T., McIntosh-Smith, S.: GPU-STREAM: benchmarking the achievable memory bandwidth of graphics processing units (poster). In: Supercomputing, Austin, Texas (2015)Google Scholar
  5. 5.
    Edwards, H.C., Sunderland, D.: Kokkos array performance-portable manycore programming model. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM 2012), pp. 1–10. ACM (2012)Google Scholar
  6. 6.
    Heroux, M., Doerfler, D., et al.: Improving performance via mini-applications. Technical report, SAND2009-5574, Sandia National Laboratories (2009)Google Scholar
  7. 7.
    Hornung, R.D., Keasler, J.A.: The RAJA Portability Layer: Overview and Status (2014)Google Scholar
  8. 8.
    Khronos OpenCL Working Group SYCL subgroup: SYCL Provisional Specification (2016)Google Scholar
  9. 9.
    Martineau, M., McIntosh-Smith, S., Boulton, M., Gaudin, W.: An evaluation of emerging many-core parallel programming models. In: Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycore, PMAM 2016, pp. 1–10. ACM, New York (2016).
  10. 10.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. (TCCA) Newslett. 19–25 (1995)Google Scholar
  11. 11.
    Munshi, A.: The OpenCL Specification, Version 1.1 (2011)Google Scholar
  12. 12.
    NVIDIA: CUDA Toolkit 7.5Google Scholar
  13. 13. The OpenACC Application Programming Interface - Version 2.5 (2015)Google Scholar
  14. 14.
    OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.5 (2015)Google Scholar
  15. 15.
    Reguly, I.Z., Keita, A.K., Giles, M.B.: Benchmarking the IBM Power8 processor. In: Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, pp. 61–69. IBM Corporation, Riverton (2015)Google Scholar
  16. 16.
    Standard Performance Evaluation Corporation: SPEC Accel (2016).

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Tom Deakin
    • 1
    Email author
  • James Price
    • 1
  • Matt Martineau
    • 1
  • Simon McIntosh-Smith
    • 1
  1. 1.Department of Computer ScienceUniversity of BristolBristolUK

Personalised recommendations