Abstract
Performance portability has rapidly become one of the key concerns for application developers targeting modern computer architectures. Although there are various programming models that can offer functional portability when moving application code between different devices, it remains an open research question as to whether it is possible to guarantee some degree of performance portability in these situations. Automatic performance tuning approaches have been shown to be effective tools for removing the burden of code optimization from the developer, but somewhat sidestep the issue of performance portability by enabling an environment where code is repeatedly optimized for each architecture individually.
In this work, we present an in-depth analysis of the performance portability of code that has been highly optimized for specific devices via auto-tuning. We perform this analysis across a wide range of modern, many-core architectures from multiple hardware vendors, examining performance portability both across different vendors and between devices from the same vendor. We then demonstrate how the auto-tuning process can be modified to bring performance portability into the equation, in order to automatically generate a single implementation that achieves high efficiency across many different devices.
Similar content being viewed by others
References
Ansel, J., O’Reilly, U.M.: OpenTuner: an extensible framework for program autotuning. MIT CSAIL Technical report MIT-CSAIL-TR-2013-026 (2013)
Balaprakash, P., Tiwari, A., Wild, S.M.: Multi objective optimization of HPC kernels for performance, power, and energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 239–260. Springer, Cham (2014). doi:10.1007/978-3-319-10214-6_12
Balaprakash, P., Wild, S.M., Hovland, P.D.: Can search algorithms save large-scale automatic performance tuning? Procedia Comput. Sci. 4, 2136–2145 (2011). Proceedings of the International Conference on Computational Science ICCS 2011
Bilmes, J., Asanovic, K.C.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: International Conference on Supercomputing, pp. 340–347 (1997)
Bolme, D.S., Beveridge, J.R., Draper, B.A., Phillips, P.J., Lui, Y.M.: Automatically searching for optimal parameter settings using a genetic algorithm. In: Crowley, J.L., Draper, B.A., Thonnat, M. (eds.) ICVS 2011. LNCS, vol. 6962, pp. 213–222. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23968-7_22
Clint Whaley, R., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)
Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: Evaluating attainable memory bandwidth of parallel programming models via BabelStream (2017, in press)
Dolbeau, R., Bodin, F., de Verdire, G.C.: One OpenCL to rule them all? In: 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS), pp. 1–6, September 2013
Falch, T.L., Elster, A.C.: Machine learning based auto-tuning for enhanced OpenCL performance portability. CoRR abs/1506.00842 (2015). http://arxiv.org/abs/1506.00842
Fang, J., Varbanescu, A.L., Sips, H.: An auto-tuning solution to data streams clustering in OpenCL. In: Proceedings of the 14th IEEE International Conference on Computational Science and Engineering, CSE 2011 and 11th International Symposium on Pervasive Systems, Algorithms, and Networks, I-SPA 2011 and 10th IEEE International Conference on IUCC 2011, pp. 587–594 (2011)
Garvey, J.D., Abdelrahman, T.S.: Automatic performance tuning of stencil computations on GPUs. In: 2015 44th International Conference on Parallel Processing (ICPP), pp. 300–309, September 2015
Goldstine, H.H., Murray, F.J., von Neumann, J.: The Jacobi method for real symmetric matrices. J. ACM 6(1), 59–96 (1959). doi:10.1145/320954.320960
Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor (1975). http://www.citeulike.org/group/664/article/400721
Hoste, K., Eeckhout, L.: Cole: Compiler optimization level exploration. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2008, NY, USA, pp. 165–174 (2008). doi:10.1145/1356058.1356080
Jordan, H., Thoman, P., Durillo, J.J., Pellegrini, S., Gschwandtner, P., Fahringer, T., Moritsch, H.: A multi-objective auto-tuning framework for parallel codes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, CA, USA, pp. 10:1–10:12 (2012). http://dl.acm.org/citation.cfm?id=2388996.2389010
Khronos OpenCL Working Group: The OpenCL Specification, Version 1.2 (2012). https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf
Klöckner, A.: PyOpenCL. https://mathema.tician.de/software/pyopencl/
Li, Y., Zhang, Y.Q., Liu, Y.Q., Long, G.P., Jia, H.P.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28, 90–105 (2013)
Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009). doi:10.1007/978-3-642-01970-8_89
Lokuciejewski, P., Plazar, S., Falk, H., Marwedel, P., Thiele, L.: Multi-objective exploration of compiler optimizations for real-time systems. In: 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, pp. 115–122, May 2010
McIntosh-Smith, S., Boulton, M., Curran, D., Price, J.: On the performance portability of structured grid codes on many-core computer architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 53–75. Springer, Cham (2014). doi:10.1007/978-3-319-07518-1_4
McIntosh-Smith, S., Price, J., Sessions, R.B., Ibarra, A.A.: High performance in silico virtual drug screening on many-core processors. Int. J. High Perform. Comput. Appl. 29(2), 119–134 (2015). http://hpc.sagepub.com/content/29/2/119.abstract
Muralidharan, S., Roy, A., Hall, M., Garland, M., Rai, P.: Architecture-adaptive code variant tuning. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016, NY, USA, pp. 325–338 (2016). doi:10.1145/2872362.2872411
Price, J., McIntosh-Smith, S.: Improving auto-tuning convergence times with dynamically generated predictive performance models. In: 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 211–218, September 2015
Steuer, R.: Multiple Criteria Optimization: Theory, Computation, and Application. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1986). https://books.google.co.uk/books?id=0H9jQgAACAAJ
Stratton, J.A., Kim, H.S., Jablin, T.B., Hwu, W.M.W.: Performance portability in accelerated parallel kernels. Technical report IMPACT-13-01, University of Illinois at Urbana-Champaign, Urbana, May 2013
Tzannes, A.: Enhancing Productivity and Performance Portability of General-purpose Parallel Programming. Ph.D. thesis, College Park, MD, USA (2012). aAI3543143
Zhang, Y., Sinclair, M., Chien, A.A.: Improving performance portability in OpenCL programs. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 136–150. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38750-0_11
Acknowledgment
We would like to thank Imagination Technologies for providing funding for this work. We also give thanks to Tom Deakin from the University of Bristol for providing valuable feedback on this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Price, J., McIntosh-Smith, S. (2017). Exploiting Auto-tuning to Analyze and Improve Performance Portability on Many-Core Architectures. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-67630-2_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67629-6
Online ISBN: 978-3-319-67630-2
eBook Packages: Computer ScienceComputer Science (R0)