Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design
The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench.
We present measurements for eight hardware architectures – four GPUs, three CPUs and one accelerator – and illustrate how the results accurately reflect unique characteristics of the respective platform. In addition to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested.
Unable to display preview. Download preview PDF.
- 1.Buck, I., Fatahalian, K., Hanrahan, P.: GPUBench (2004)Google Scholar
- 2.Bull, J.M.: Measuring synchronisation and scheduling overheads in openmp. In: Proc. of 1st Europ. Workshop on OpenMP, pp. 99–105 (1999)Google Scholar
- 3.Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE Workload Characterization Symposium, pp. 44–54 (2009)Google Scholar
- 4.Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (shoc) benchmark suite. In: GPGPU 2010: Proc., pp. 63–74. ACM, New York (2010)Google Scholar
- 6.Fung, W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: MICRO 40, pp. 407–420. IEEE Computer Society, Washington, DC, USA (2007)Google Scholar
- 7.MPI Intel. Benchmarks: Users Guide and Methodology Description. Intel GmbH, Germany (2004)Google Scholar
- 8.McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comp. Soc. Tech. Comm. on Computer Architecture (TCCA) Newsletter, pp. 19–25 (December 1995)Google Scholar
- 13.Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC 2008, pp. 1–11. IEEE Press, Piscataway (2008)Google Scholar
- 14.Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying gpu microarchitecture through microbenchmarking. In: ISPASS, pp. 235–246 (2010)Google Scholar