Abstract
The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench.
We present measurements for eight hardware architectures – four GPUs, three CPUs and one accelerator – and illustrate how the results accurately reflect unique characteristics of the respective platform. In addition to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Buck, I., Fatahalian, K., Hanrahan, P.: GPUBench (2004)
Bull, J.M.: Measuring synchronisation and scheduling overheads in openmp. In: Proc. of 1st Europ. Workshop on OpenMP, pp. 99–105 (1999)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE Workload Characterization Symposium, pp. 44–54 (2009)
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (shoc) benchmark suite. In: GPGPU 2010: Proc., pp. 63–74. ACM, New York (2010)
Fisher, J.A.: Very long instruction word architectures and the ELI-512. In: Proceedings of the 10th Annual International Symposium on Computer Architecture, pp. 140–150. ACM, New York (1983)
Fung, W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: MICRO 40, pp. 407–420. IEEE Computer Society, Washington, DC, USA (2007)
MPI Intel. Benchmarks: Users Guide and Methodology Description. Intel GmbH, Germany (2004)
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comp. Soc. Tech. Comm. on Computer Architecture (TCCA) Newsletter, pp. 19–25 (December 1995)
Olukotun, K., Hammond, L.: The future of microprocessors. Queue 3(7), 26–29 (2005)
Sibai, F.N.: Performance analysis and workload characterization of the 3dmark05 benchmark on modern parallel computer platforms. ACM SIGARCH Computer Architecture News 35(3), 44–52 (2007)
Torrellas, J., Lam, M.S., Hennessy, J.L.: False sharing and spatial locality in multiprocessor caches. IEEE Transactions on Computers 43(6), 651–663 (1994)
Trottenberg, U., Oosterlee, C.W., Schueller, A.: Multigrid. Academic Press, London (2001)
Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC 2008, pp. 1–11. IEEE Press, Piscataway (2008)
Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying gpu microarchitecture through microbenchmarking. In: ISPASS, pp. 235–246 (2010)
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A gpgpu compiler for memory optimization and parallelism management. In: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI 2010, pp. 86–97. ACM, New York (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Thoman, P., Kofler, K., Studt, H., Thomson, J., Fahringer, T. (2011). Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6853. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23397-5_43
Download citation
DOI: https://doi.org/10.1007/978-3-642-23397-5_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23396-8
Online ISBN: 978-3-642-23397-5
eBook Packages: Computer ScienceComputer Science (R0)