Exploiting Auto-tuning to Analyze and Improve Performance Portability on Many-Core Architectures

Price, James; McIntosh-Smith, Simon

doi:10.1007/978-3-319-67630-2_38

James Price¹⁷ &
Simon McIntosh-Smith¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10524))

Included in the following conference series:

International Conference on High Performance Computing

1817 Accesses
5 Citations

Abstract

Performance portability has rapidly become one of the key concerns for application developers targeting modern computer architectures. Although there are various programming models that can offer functional portability when moving application code between different devices, it remains an open research question as to whether it is possible to guarantee some degree of performance portability in these situations. Automatic performance tuning approaches have been shown to be effective tools for removing the burden of code optimization from the developer, but somewhat sidestep the issue of performance portability by enabling an environment where code is repeatedly optimized for each architecture individually.

In this work, we present an in-depth analysis of the performance portability of code that has been highly optimized for specific devices via auto-tuning. We perform this analysis across a wide range of modern, many-core architectures from multiple hardware vendors, examining performance portability both across different vendors and between devices from the same vendor. We then demonstrate how the auto-tuning process can be modified to bring performance portability into the equation, in order to automatically generate a single implementation that achieves high efficiency across many different devices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

A Beginner’s Guide to Estimating and Improving Performance Portability

AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications

A multi-aspect online tuning framework for HPC applications

Article 16 May 2017

Notes

1.
https://github.com/jrprice/jacobi-ocl.

References

Ansel, J., O’Reilly, U.M.: OpenTuner: an extensible framework for program autotuning. MIT CSAIL Technical report MIT-CSAIL-TR-2013-026 (2013)
Google Scholar
Balaprakash, P., Tiwari, A., Wild, S.M.: Multi objective optimization of HPC kernels for performance, power, and energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 239–260. Springer, Cham (2014). doi:10.1007/978-3-319-10214-6_12
Google Scholar
Balaprakash, P., Wild, S.M., Hovland, P.D.: Can search algorithms save large-scale automatic performance tuning? Procedia Comput. Sci. 4, 2136–2145 (2011). Proceedings of the International Conference on Computational Science ICCS 2011
Google Scholar
Bilmes, J., Asanovic, K.C.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: International Conference on Supercomputing, pp. 340–347 (1997)
Google Scholar
Bolme, D.S., Beveridge, J.R., Draper, B.A., Phillips, P.J., Lui, Y.M.: Automatically searching for optimal parameter settings using a genetic algorithm. In: Crowley, J.L., Draper, B.A., Thonnat, M. (eds.) ICVS 2011. LNCS, vol. 6962, pp. 213–222. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23968-7_22
Chapter Google Scholar
Clint Whaley, R., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)
Article MATH Google Scholar
Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: Evaluating attainable memory bandwidth of parallel programming models via BabelStream (2017, in press)
Google Scholar
Dolbeau, R., Bodin, F., de Verdire, G.C.: One OpenCL to rule them all? In: 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS), pp. 1–6, September 2013
Google Scholar
Falch, T.L., Elster, A.C.: Machine learning based auto-tuning for enhanced OpenCL performance portability. CoRR abs/1506.00842 (2015). http://arxiv.org/abs/1506.00842
Fang, J., Varbanescu, A.L., Sips, H.: An auto-tuning solution to data streams clustering in OpenCL. In: Proceedings of the 14th IEEE International Conference on Computational Science and Engineering, CSE 2011 and 11th International Symposium on Pervasive Systems, Algorithms, and Networks, I-SPA 2011 and 10th IEEE International Conference on IUCC 2011, pp. 587–594 (2011)
Google Scholar
Garvey, J.D., Abdelrahman, T.S.: Automatic performance tuning of stencil computations on GPUs. In: 2015 44th International Conference on Parallel Processing (ICPP), pp. 300–309, September 2015
Google Scholar
Goldstine, H.H., Murray, F.J., von Neumann, J.: The Jacobi method for real symmetric matrices. J. ACM 6(1), 59–96 (1959). doi:10.1145/320954.320960
Article MathSciNet MATH Google Scholar
Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor (1975). http://www.citeulike.org/group/664/article/400721
Google Scholar
Hoste, K., Eeckhout, L.: Cole: Compiler optimization level exploration. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2008, NY, USA, pp. 165–174 (2008). doi:10.1145/1356058.1356080
Jordan, H., Thoman, P., Durillo, J.J., Pellegrini, S., Gschwandtner, P., Fahringer, T., Moritsch, H.: A multi-objective auto-tuning framework for parallel codes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, CA, USA, pp. 10:1–10:12 (2012). http://dl.acm.org/citation.cfm?id=2388996.2389010
Khronos OpenCL Working Group: The OpenCL Specification, Version 1.2 (2012). https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf
Klöckner, A.: PyOpenCL. https://mathema.tician.de/software/pyopencl/
Li, Y., Zhang, Y.Q., Liu, Y.Q., Long, G.P., Jia, H.P.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28, 90–105 (2013)
Article Google Scholar
Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009). doi:10.1007/978-3-642-01970-8_89
Chapter Google Scholar
Lokuciejewski, P., Plazar, S., Falk, H., Marwedel, P., Thiele, L.: Multi-objective exploration of compiler optimizations for real-time systems. In: 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, pp. 115–122, May 2010
Google Scholar
McIntosh-Smith, S., Boulton, M., Curran, D., Price, J.: On the performance portability of structured grid codes on many-core computer architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 53–75. Springer, Cham (2014). doi:10.1007/978-3-319-07518-1_4
Google Scholar
McIntosh-Smith, S., Price, J., Sessions, R.B., Ibarra, A.A.: High performance in silico virtual drug screening on many-core processors. Int. J. High Perform. Comput. Appl. 29(2), 119–134 (2015). http://hpc.sagepub.com/content/29/2/119.abstract
Article Google Scholar
Muralidharan, S., Roy, A., Hall, M., Garland, M., Rai, P.: Architecture-adaptive code variant tuning. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016, NY, USA, pp. 325–338 (2016). doi:10.1145/2872362.2872411
Price, J., McIntosh-Smith, S.: Improving auto-tuning convergence times with dynamically generated predictive performance models. In: 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 211–218, September 2015
Google Scholar
Steuer, R.: Multiple Criteria Optimization: Theory, Computation, and Application. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1986). https://books.google.co.uk/books?id=0H9jQgAACAAJ
MATH Google Scholar
Stratton, J.A., Kim, H.S., Jablin, T.B., Hwu, W.M.W.: Performance portability in accelerated parallel kernels. Technical report IMPACT-13-01, University of Illinois at Urbana-Champaign, Urbana, May 2013
Google Scholar
Tzannes, A.: Enhancing Productivity and Performance Portability of General-purpose Parallel Programming. Ph.D. thesis, College Park, MD, USA (2012). aAI3543143
Google Scholar
Zhang, Y., Sinclair, M., Chien, A.A.: Improving performance portability in OpenCL programs. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 136–150. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38750-0_11
Chapter Google Scholar

Download references

Acknowledgment

We would like to thank Imagination Technologies for providing funding for this work. We also give thanks to Tom Deakin from the University of Bristol for providing valuable feedback on this paper.

Author information

Authors and Affiliations

Department of Computer Science, University of Bristol, Bristol, UK
James Price & Simon McIntosh-Smith

Authors

James Price
View author publications
You can also search for this author in PubMed Google Scholar
Simon McIntosh-Smith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to James Price .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Hamburg, Germany
Julian M. Kunkel
TITECH, Tokyo, Japan
Rio Yokota
Department of Computer Science, University of Delaware, Newark, Delaware, USA
Michela Taufer
Lawrence Berkeley National Laboratory, Berkeley, California, USA
John Shalf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Price, J., McIntosh-Smith, S. (2017). Exploiting Auto-tuning to Analyze and Improve Performance Portability on Many-Core Architectures. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-67630-2_38
Published: 20 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67629-6
Online ISBN: 978-3-319-67630-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploiting Auto-tuning to Analyze and Improve Performance Portability on Many-Core Architectures

Abstract

Access this chapter

Similar content being viewed by others

A Beginner’s Guide to Estimating and Improving Performance Portability

AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications

A multi-aspect online tuning framework for HPC applications

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Exploiting Auto-tuning to Analyze and Improve Performance Portability on Many-Core Architectures

Abstract

Access this chapter

Similar content being viewed by others

A Beginner’s Guide to Estimating and Improving Performance Portability

AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications

A multi-aspect online tuning framework for HPC applications

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation