Skip to main content

Exploiting Auto-tuning to Analyze and Improve Performance Portability on Many-Core Architectures

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10524))

Included in the following conference series:

Abstract

Performance portability has rapidly become one of the key concerns for application developers targeting modern computer architectures. Although there are various programming models that can offer functional portability when moving application code between different devices, it remains an open research question as to whether it is possible to guarantee some degree of performance portability in these situations. Automatic performance tuning approaches have been shown to be effective tools for removing the burden of code optimization from the developer, but somewhat sidestep the issue of performance portability by enabling an environment where code is repeatedly optimized for each architecture individually.

In this work, we present an in-depth analysis of the performance portability of code that has been highly optimized for specific devices via auto-tuning. We perform this analysis across a wide range of modern, many-core architectures from multiple hardware vendors, examining performance portability both across different vendors and between devices from the same vendor. We then demonstrate how the auto-tuning process can be modified to bring performance portability into the equation, in order to automatically generate a single implementation that achieves high efficiency across many different devices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/jrprice/jacobi-ocl.

References

  1. Ansel, J., O’Reilly, U.M.: OpenTuner: an extensible framework for program autotuning. MIT CSAIL Technical report MIT-CSAIL-TR-2013-026 (2013)

    Google Scholar 

  2. Balaprakash, P., Tiwari, A., Wild, S.M.: Multi objective optimization of HPC kernels for performance, power, and energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 239–260. Springer, Cham (2014). doi:10.1007/978-3-319-10214-6_12

    Google Scholar 

  3. Balaprakash, P., Wild, S.M., Hovland, P.D.: Can search algorithms save large-scale automatic performance tuning? Procedia Comput. Sci. 4, 2136–2145 (2011). Proceedings of the International Conference on Computational Science ICCS 2011

    Google Scholar 

  4. Bilmes, J., Asanovic, K.C.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: International Conference on Supercomputing, pp. 340–347 (1997)

    Google Scholar 

  5. Bolme, D.S., Beveridge, J.R., Draper, B.A., Phillips, P.J., Lui, Y.M.: Automatically searching for optimal parameter settings using a genetic algorithm. In: Crowley, J.L., Draper, B.A., Thonnat, M. (eds.) ICVS 2011. LNCS, vol. 6962, pp. 213–222. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23968-7_22

    Chapter  Google Scholar 

  6. Clint Whaley, R., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)

    Article  MATH  Google Scholar 

  7. Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: Evaluating attainable memory bandwidth of parallel programming models via BabelStream (2017, in press)

    Google Scholar 

  8. Dolbeau, R., Bodin, F., de Verdire, G.C.: One OpenCL to rule them all? In: 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS), pp. 1–6, September 2013

    Google Scholar 

  9. Falch, T.L., Elster, A.C.: Machine learning based auto-tuning for enhanced OpenCL performance portability. CoRR abs/1506.00842 (2015). http://arxiv.org/abs/1506.00842

  10. Fang, J., Varbanescu, A.L., Sips, H.: An auto-tuning solution to data streams clustering in OpenCL. In: Proceedings of the 14th IEEE International Conference on Computational Science and Engineering, CSE 2011 and 11th International Symposium on Pervasive Systems, Algorithms, and Networks, I-SPA 2011 and 10th IEEE International Conference on IUCC 2011, pp. 587–594 (2011)

    Google Scholar 

  11. Garvey, J.D., Abdelrahman, T.S.: Automatic performance tuning of stencil computations on GPUs. In: 2015 44th International Conference on Parallel Processing (ICPP), pp. 300–309, September 2015

    Google Scholar 

  12. Goldstine, H.H., Murray, F.J., von Neumann, J.: The Jacobi method for real symmetric matrices. J. ACM 6(1), 59–96 (1959). doi:10.1145/320954.320960

    Article  MathSciNet  MATH  Google Scholar 

  13. Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor (1975). http://www.citeulike.org/group/664/article/400721

    Google Scholar 

  14. Hoste, K., Eeckhout, L.: Cole: Compiler optimization level exploration. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2008, NY, USA, pp. 165–174 (2008). doi:10.1145/1356058.1356080

  15. Jordan, H., Thoman, P., Durillo, J.J., Pellegrini, S., Gschwandtner, P., Fahringer, T., Moritsch, H.: A multi-objective auto-tuning framework for parallel codes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, CA, USA, pp. 10:1–10:12 (2012). http://dl.acm.org/citation.cfm?id=2388996.2389010

  16. Khronos OpenCL Working Group: The OpenCL Specification, Version 1.2 (2012). https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf

  17. Klöckner, A.: PyOpenCL. https://mathema.tician.de/software/pyopencl/

  18. Li, Y., Zhang, Y.Q., Liu, Y.Q., Long, G.P., Jia, H.P.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28, 90–105 (2013)

    Article  Google Scholar 

  19. Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009). doi:10.1007/978-3-642-01970-8_89

    Chapter  Google Scholar 

  20. Lokuciejewski, P., Plazar, S., Falk, H., Marwedel, P., Thiele, L.: Multi-objective exploration of compiler optimizations for real-time systems. In: 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, pp. 115–122, May 2010

    Google Scholar 

  21. McIntosh-Smith, S., Boulton, M., Curran, D., Price, J.: On the performance portability of structured grid codes on many-core computer architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 53–75. Springer, Cham (2014). doi:10.1007/978-3-319-07518-1_4

    Google Scholar 

  22. McIntosh-Smith, S., Price, J., Sessions, R.B., Ibarra, A.A.: High performance in silico virtual drug screening on many-core processors. Int. J. High Perform. Comput. Appl. 29(2), 119–134 (2015). http://hpc.sagepub.com/content/29/2/119.abstract

    Article  Google Scholar 

  23. Muralidharan, S., Roy, A., Hall, M., Garland, M., Rai, P.: Architecture-adaptive code variant tuning. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016, NY, USA, pp. 325–338 (2016). doi:10.1145/2872362.2872411

  24. Price, J., McIntosh-Smith, S.: Improving auto-tuning convergence times with dynamically generated predictive performance models. In: 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 211–218, September 2015

    Google Scholar 

  25. Steuer, R.: Multiple Criteria Optimization: Theory, Computation, and Application. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1986). https://books.google.co.uk/books?id=0H9jQgAACAAJ

    MATH  Google Scholar 

  26. Stratton, J.A., Kim, H.S., Jablin, T.B., Hwu, W.M.W.: Performance portability in accelerated parallel kernels. Technical report IMPACT-13-01, University of Illinois at Urbana-Champaign, Urbana, May 2013

    Google Scholar 

  27. Tzannes, A.: Enhancing Productivity and Performance Portability of General-purpose Parallel Programming. Ph.D. thesis, College Park, MD, USA (2012). aAI3543143

    Google Scholar 

  28. Zhang, Y., Sinclair, M., Chien, A.A.: Improving performance portability in OpenCL programs. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 136–150. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38750-0_11

    Chapter  Google Scholar 

Download references

Acknowledgment

We would like to thank Imagination Technologies for providing funding for this work. We also give thanks to Tom Deakin from the University of Bristol for providing valuable feedback on this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Price .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Price, J., McIntosh-Smith, S. (2017). Exploiting Auto-tuning to Analyze and Improve Performance Portability on Many-Core Architectures. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67630-2_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67629-6

  • Online ISBN: 978-3-319-67630-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics