Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

  • Peter Thoman
  • Klaus Kofler
  • Heiko Studt
  • John Thomson
  • Thomas Fahringer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6853)


The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench.

We present measurements for eight hardware architectures – four GPUs, three CPUs and one accelerator – and illustrate how the results accurately reflect unique characteristics of the respective platform. In addition to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested.


Local Memory Global Memory Memory Bandwidth Work Item Very Long Instruction Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Buck, I., Fatahalian, K., Hanrahan, P.: GPUBench (2004)Google Scholar
  2. 2.
    Bull, J.M.: Measuring synchronisation and scheduling overheads in openmp. In: Proc. of 1st Europ. Workshop on OpenMP, pp. 99–105 (1999)Google Scholar
  3. 3.
    Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE Workload Characterization Symposium, pp. 44–54 (2009)Google Scholar
  4. 4.
    Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (shoc) benchmark suite. In: GPGPU 2010: Proc., pp. 63–74. ACM, New York (2010)Google Scholar
  5. 5.
    Fisher, J.A.: Very long instruction word architectures and the ELI-512. In: Proceedings of the 10th Annual International Symposium on Computer Architecture, pp. 140–150. ACM, New York (1983)Google Scholar
  6. 6.
    Fung, W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: MICRO 40, pp. 407–420. IEEE Computer Society, Washington, DC, USA (2007)Google Scholar
  7. 7.
    MPI Intel. Benchmarks: Users Guide and Methodology Description. Intel GmbH, Germany (2004)Google Scholar
  8. 8.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comp. Soc. Tech. Comm. on Computer Architecture (TCCA) Newsletter, pp. 19–25 (December 1995)Google Scholar
  9. 9.
    Olukotun, K., Hammond, L.: The future of microprocessors. Queue 3(7), 26–29 (2005)CrossRefGoogle Scholar
  10. 10.
    Sibai, F.N.: Performance analysis and workload characterization of the 3dmark05 benchmark on modern parallel computer platforms. ACM SIGARCH Computer Architecture News 35(3), 44–52 (2007)CrossRefGoogle Scholar
  11. 11.
    Torrellas, J., Lam, M.S., Hennessy, J.L.: False sharing and spatial locality in multiprocessor caches. IEEE Transactions on Computers 43(6), 651–663 (1994)CrossRefzbMATHGoogle Scholar
  12. 12.
    Trottenberg, U., Oosterlee, C.W., Schueller, A.: Multigrid. Academic Press, London (2001)Google Scholar
  13. 13.
    Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC 2008, pp. 1–11. IEEE Press, Piscataway (2008)Google Scholar
  14. 14.
    Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying gpu microarchitecture through microbenchmarking. In: ISPASS, pp. 235–246 (2010)Google Scholar
  15. 15.
    Yang, Y., Xiang, P., Kong, J., Zhou, H.: A gpgpu compiler for memory optimization and parallelism management. In: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI 2010, pp. 86–97. ACM, New York (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Peter Thoman
    • 1
  • Klaus Kofler
    • 1
  • Heiko Studt
    • 1
  • John Thomson
    • 1
  • Thomas Fahringer
    • 1
  1. 1.University of InnsbruckAustria

Personalised recommendations