Advertisement

Towards an Achievable Performance for the Loop Nests

  • Aniket ShivamEmail author
  • Neftali Watkinson
  • Alexandru Nicolau
  • David Padua
  • Alexander V. Veidenbaum
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11882)

Abstract

Numerous code optimization techniques, including loop nest optimizations, have been developed over the last four decades. Loop optimization techniques transform loop nests to improve the performance of the code on a target architecture, including exposing parallelism. Finding and evaluating an optimal, semantic-preserving sequence of transformations is a complex problem. The sequence is guided using heuristics and/or analytical models and there is no way of knowing how close it gets to optimal performance or if there is any headroom for improvement.

This paper makes two contributions. First, it uses a comparative analysis of loop optimizations/transformations across multiple compilers to determine how much headroom may exist for each compiler. And second, it presents an approach to characterize the loop nests based on their hardware performance counter values and a Machine Learning approach that predicts which compiler will generate the fastest code for a loop nest. The prediction is made for both auto-vectorized, serial compilation and for auto-parallelization. The results show that the headroom for state-of-the-art compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to 1.71x for the auto-parallelized code. These results are based on the Machine Learning predictions.

Notes

Acknowledgments

This work was supported by NSF award XPS 1533926.

References

  1. 1.
    Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9(4), 491–542 (1987)CrossRefGoogle Scholar
  2. 2.
    Ashouri, A.H., et al.: MiCOMP: mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning. ACM Trans. Arch. Code Optim. (TACO) 14(3), 29 (2017)Google Scholar
  3. 3.
    Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: Hendren, L. (ed.) CC 2008. LNCS, vol. 4959, pp. 132–146. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-78791-4_9CrossRefGoogle Scholar
  4. 4.
    Callahan, D., Dongarra, J., Levine, D.: Vectorizing compilers: a test suite and results. In: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Supercomputing 1988, pp. 98–105. IEEE Computer Society Press, Los Alamitos (1988)Google Scholar
  5. 5.
    Cammarota, R., Beni, L.A., Nicolau, A., Veidenbaum, A.V.: Optimizing program performance via similarity, using a feature-agnostic approach. In: Wu, C., Cohen, A. (eds.) APPT 2013. LNCS, vol. 8299, pp. 199–213. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-45293-2_15CrossRefGoogle Scholar
  6. 6.
    Cavazos, J., et al.: Rapidly selecting good compiler optimizations using performance counters. In: International Symposium on Code Generation and Optimization, CGO 2007, pp. 185–197. IEEE (2007)Google Scholar
  7. 7.
    Darte, A., Robert, Y., Vivien, F.: Scheduling and Automatic Parallelization. Springer, New York (2012).  https://doi.org/10.1007/978-1-4612-1362-8CrossRefzbMATHGoogle Scholar
  8. 8.
    Demšar, J., et al.: Orange: data mining toolbox in python. J. Mach. Learn. Res. 14(1), 2349–2353 (2013)zbMATHGoogle Scholar
  9. 9.
    Fursin, G., et al.: Milepost GCC: machine learning enabled self-tuning compiler. Int. J. Parallel Program. 39(3), 296–327 (2011)CrossRefGoogle Scholar
  10. 10.
    Gong, Z., et al.: An empirical study of the effect of source-level loop transformations on compiler stability. Proc. ACM Program. Lang. 2(OOPSLA), 126:1–126:29 (2018)CrossRefGoogle Scholar
  11. 11.
    Grosser, T., Groesslinger, A., Lengauer, C.: Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22(04), 1250010 (2012)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco (2002)Google Scholar
  13. 13.
    Li, W., Pingali, K.: A singular loop transformation framework based on non-singular matrices. In: Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) LCPC 1992. LNCS, vol. 757, pp. 391–405. Springer, Heidelberg (1993).  https://doi.org/10.1007/3-540-57502-2_60CrossRefGoogle Scholar
  14. 14.
    Lim, A.W., Cheong, G.I., Lam, M.S.: An affine partitioning algorithm to maximize parallelism and minimize communication. In: Proceedings of the 13th International Conference on Supercomputing, ICS 1999, pp. 228–237. ACM, New York (1999)Google Scholar
  15. 15.
    Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Comput. 24(3–4), 445–475 (1998)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Lim, A.W., Liao, S.-W., Lam, M.S.: Blocking and array contraction across arbitrarily nested loops using affine partitioning. In: Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, PPoPP 2001, pp. 103–112. ACM, New York (2001)Google Scholar
  17. 17.
    Maleki, S., et al.: An evaluation of vectorizing compilers. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 372–382, October 2011Google Scholar
  18. 18.
    Padua, D.A., Kuck, D.J., Lawrie, D.H.: High-speed multiprocessors and compilation techniques. IEEE Trans. Comput. C-29(9), 763–776 (1980)CrossRefGoogle Scholar
  19. 19.
    Padua, D.A., Wolfe, M.: Advanced compiler optimizations for supercomputers. Commun. ACM 29(12), 1184–1201 (1986)CrossRefGoogle Scholar
  20. 20.
    Polly: LLVM Framework for High-Level Loop and Data-Locality Optimizations. http://polly.llvm.org
  21. 21.
  22. 22.
    Stock, K., Pouchet, L.-N., Sadayappan, P.: Using machine learning to improve automatic vectorization. ACM Trans. Arch. Code Optim. (TACO) 8(4), 50 (2012)Google Scholar
  23. 23.
    Tournavitis, G., et al.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2009, pp. 177–187. ACM, New York (2009)CrossRefGoogle Scholar
  24. 24.
    Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2009, pp. 75–84. ACM, New York (2009)Google Scholar
  25. 25.
    Watkinson, N., et al.: Using hardware counters to predict vectorization. In: Languages and Compilers for Parallel Computing, LCPC 2017. Springer, in PressGoogle Scholar
  26. 26.
    Wolfe, M.J.: High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co. Inc., Boston (1995)zbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Aniket Shivam
    • 1
    Email author
  • Neftali Watkinson
    • 1
  • Alexandru Nicolau
    • 1
  • David Padua
    • 2
  • Alexander V. Veidenbaum
    • 1
  1. 1.Department of Computer ScienceUniversity of California, IrvineIrvineUSA
  2. 2.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignChampaignUSA

Personalised recommendations