Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Thoman, Peter; Kofler, Klaus; Studt, Heiko; Thomson, John; Fahringer, Thomas

doi:10.1007/978-3-642-23397-5_43

Peter Thoman¹⁸,
Klaus Kofler¹⁸,
Heiko Studt¹⁸,
John Thomson¹⁸ &
…
Thomas Fahringer¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6853))

Included in the following conference series:

European Conference on Parallel Processing

1734 Accesses
16 Citations

Abstract

The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench.

We present measurements for eight hardware architectures – four GPUs, three CPUs and one accelerator – and illustrate how the results accurately reflect unique characteristics of the respective platform. In addition to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested.

Download to read the full chapter text

Chapter PDF

From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives

Topic 15: GPU and Accelerator Computing

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Buck, I., Fatahalian, K., Hanrahan, P.: GPUBench (2004)
Google Scholar
Bull, J.M.: Measuring synchronisation and scheduling overheads in openmp. In: Proc. of 1st Europ. Workshop on OpenMP, pp. 99–105 (1999)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE Workload Characterization Symposium, pp. 44–54 (2009)
Google Scholar
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (shoc) benchmark suite. In: GPGPU 2010: Proc., pp. 63–74. ACM, New York (2010)
Google Scholar
Fisher, J.A.: Very long instruction word architectures and the ELI-512. In: Proceedings of the 10th Annual International Symposium on Computer Architecture, pp. 140–150. ACM, New York (1983)
Google Scholar
Fung, W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: MICRO 40, pp. 407–420. IEEE Computer Society, Washington, DC, USA (2007)
Google Scholar
MPI Intel. Benchmarks: Users Guide and Methodology Description. Intel GmbH, Germany (2004)
Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comp. Soc. Tech. Comm. on Computer Architecture (TCCA) Newsletter, pp. 19–25 (December 1995)
Google Scholar
Olukotun, K., Hammond, L.: The future of microprocessors. Queue 3(7), 26–29 (2005)
Article Google Scholar
Sibai, F.N.: Performance analysis and workload characterization of the 3dmark05 benchmark on modern parallel computer platforms. ACM SIGARCH Computer Architecture News 35(3), 44–52 (2007)
Article Google Scholar
Torrellas, J., Lam, M.S., Hennessy, J.L.: False sharing and spatial locality in multiprocessor caches. IEEE Transactions on Computers 43(6), 651–663 (1994)
Article MATH Google Scholar
Trottenberg, U., Oosterlee, C.W., Schueller, A.: Multigrid. Academic Press, London (2001)
Google Scholar
Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC 2008, pp. 1–11. IEEE Press, Piscataway (2008)
Google Scholar
Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying gpu microarchitecture through microbenchmarking. In: ISPASS, pp. 235–246 (2010)
Google Scholar
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A gpgpu compiler for memory optimization and parallelism management. In: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI 2010, pp. 86–97. ACM, New York (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Innsbruck, Austria
Peter Thoman, Klaus Kofler, Heiko Studt, John Thomson & Thomas Fahringer

Authors

Peter Thoman
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Kofler
View author publications
You can also search for this author in PubMed Google Scholar
Heiko Studt
View author publications
You can also search for this author in PubMed Google Scholar
John Thomson
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Fahringer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Emmanuel Jeannot & Raymond Namyst &
Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Jean Roman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thoman, P., Kofler, K., Studt, H., Thomson, J., Fahringer, T. (2011). Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6853. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23397-5_43

Download citation

DOI: https://doi.org/10.1007/978-3-642-23397-5_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23396-8
Online ISBN: 978-3-642-23397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Abstract

Chapter PDF

Similar content being viewed by others

From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives

Topic 15: GPU and Accelerator Computing

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Abstract

Chapter PDF

Similar content being viewed by others

From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives

Topic 15: GPU and Accelerator Computing

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation