A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Abstract

Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. 1.

    Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’03. IEEE Computer Society, Washington, DC, pp 204–215. http://dl.acm.org/citation.cfm?id=776261.776284

  2. 2.

    Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st Century. J Supercomput 23:2002

    MATH  Google Scholar 

  3. 3.

    Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the Second International Symposium on High Performance Computing, ISHPC ’99. Springer, London, UK, pp 121–132. http://dl.acm.org/citation.cfm?id=646347.690219

  4. 4.

    Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1–36. doi:10.1145/1509864.1509865

    Article  Google Scholar 

  5. 5.

    Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi:10.1145/996893.996863

    Article  Google Scholar 

  6. 6.

    Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems, CASES ’11. ACM, New York, NY, pp 65–74. doi:10.1145/2038698.2038711

  7. 7.

    Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA ’02. Springer, London, pp 41–50. http://dl.acm.org/citation.cfm?id=646053.677574

  8. 8.

    Stephenson M, Amarasinghe S, Martin M, O’Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90. doi:10.1145/780822.781141

    Article  Google Scholar 

  9. 9.

    Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi:10.1145/2400682.2400705

  10. 10.

    Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’06. IEEE Computer Society, Washington, DC, pp 295–305. doi:10.1109/CGO.2006.37

  11. 11.

    Almagor L, Cooper KD, Grosul A, Harvey TJ, Reeves SW, Subramanian D, Torczon L, Waterman T (2004) Finding effective compilation sequences. SIGPLAN Not 39(7):231–239. doi:10.1145/998300.997196

    Article  Google Scholar 

  12. 12.

    Cooper KD, Schielke PJ, Subramanian D (1999) Optimizing for reduced code space using genetic algorithms. SIGPLAN Not 34(7):1–9. doi:10.1145/315253.314414

    Article  Google Scholar 

  13. 13.

    Cooper KD, Grosul A, Harvey TJ, Reeves S, Subramanian D, Torczon L, Waterman T (2005) ACME: adaptive compilation made efficient. SIGPLAN Not 40(7):69–77. doi:10.1145/1070891.1065921

    Article  Google Scholar 

  14. 14.

    Cooper KD, Grosul A, Harvey TJ, Reeves S, Subramanian D, Torczon L, Waterman T (2006) Exploring the structure of the space of compilation sequences using randomized search algorithms. J Supercomput 36(2):135–151. doi:10.1007/s11227-006-7954-5

    Article  Google Scholar 

  15. 15.

    Kulkarni PA, Whalley DB, Tyson GS (2007) Evaluating heuristic optimization phase order search algorithms. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07. IEEE Computer Society, Washington, DC, pp 157–169. doi:10.1109/CGO.2007.9

  16. 16.

    Chen C, Chame J, Hall M (2005) Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp 111–122. doi:10.1109/CGO.2005.10

  17. 17.

    Haneda M, Khnijnenburg PMW, Wijshoff HAG (2005) Automatic selection of compiler options using non-parametric inferential statistics. In: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT ’05. IEEE Computer Society, Washington, DC, pp 123–132. doi:10.1109/PACT.2005.9

  18. 18.

    Cavazos J, Fursin G, Agakov F, Bonilla E, O’Boyle MFP, Temam O (2007) Rapidly selecting good compiler optimizations using performance counters. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07. IEEE Computer Society, Washington, DC, pp 185–197. doi:10.1109/CGO.2007.32

  19. 19.

    de Mesmay F, Voronenko Y, Püschel M (2010) Offline library adaptation using automatically generated heuristics. In: International Parallel and Distributed Processing Symposium (IPDPS), pp 1–10

  20. 20.

    Dubach C, Jones TM, Bonilla EV, Fursin G, O’Boyle MFP (2009) Portable compiler optimisation across embedded programs and microarchitectures using machine learning. In: Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42. ACM, New York, NY, pp 78–88. doi:10.1145/1669112.1669124

  21. 21.

    Knijnenburg PMW, Kisuki T, Gallivan K, O’Boyle MFP (2004) The effect of cache models on iterative compilation for combined tiling and unrolling. J CCPE 16(2–3):247–270

    Google Scholar 

  22. 22.

    Kim D, Renganarayanan L, Rostron D, Rajopadhye S, Strout MM (2007) Multi-level Tiling: M for the Price of One. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC ’07. ACM, New York, NY, pp 51:1–51:12. doi:10.1145/1362622.1362691

  23. 23.

    Renganarayanan L, Kim D, Rajopadhye S, Strout MM (2007) Parameterized tiled loops for free. SIGPLAN Not 42(6):405–414. doi:10.1145/1273442.1250780

    Article  Google Scholar 

  24. 24.

    Fursin G, O’Boyle MFP, Knijnenburg PMW (2002) Evaluating iterative compilation. In: Languages and Compilers for Parallel Computing, 15th Workshop, (LCPC), Revised Papers. College Park, MD, , pp 362–376

  25. 25.

    Leather H, Bonilla E, O’Boyle M (2009) Automatic feature generation for machine learning based optimizing compilation. In: Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’09. IEEE Computer Society, Washington, DC, pp 81–91. doi:10.1109/CGO.2009.21

  26. 26.

    Stephenson M, Amarasinghe S (2005) Predicting Unroll Factors Using Supervised Classification. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05. IEEE Computer Society, Washington, DC, pp 123–134. doi:10.1109/CGO.2005.29

  27. 27.

    Park E, Pouche LN, Cavazos J, Cohen A, Sadayappan P (2011) Predictive Modeling in a Polyhedral Optimization Space. In: Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’11. IEEE Computer Society, Washington, DC, pp 119–129. http://dl.acm.org/citation.cfm?id=2190025.2190059

  28. 28.

    Jantz MR, Kulkarni PA (2013) Exploiting phase inter-dependencies for faster iterative compiler optimization phase order searches. In: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES ’13. IEEE Press, Piscataway, NJ, pp 7:1–7:10. http://dl.acm.org/citation.cfm?id=2555729.2555736

  29. 29.

    Kulkarni S, Cavazos J (2012) Mitigating the compiler optimization phase-ordering problem using machine learning. SIGPLAN Not 47(10):147–162. doi:10.1145/2398857.2384628

    Article  Google Scholar 

  30. 30.

    Parello D, Temam O, Cohen A, Verdun JM (2004) Towards a Systematic, Pragmatic and Architecture-Aware Program Optimization Process for Complex Processors. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC ’04. IEEE Computer Society, Washington, DC, p 15. doi:10.1109/SC.2004.61

  31. 31.

    Rong H, Douillet A, Gao GR (2005) Register allocation for software pipelined multi-dimensional loops. SIGPLAN Not 40(6):154–167. doi:10.1145/1064978.1065030

    Article  Google Scholar 

  32. 32.

    Hack S, Goos G (2006) Optimal register allocation for SSA-form Programs in polynomial time. Inf Process Lett 98(4):150–155. doi:10.1016/j.ipl.2006.01.008

    MathSciNet  Article  MATH  Google Scholar 

  33. 33.

    Nagarakatte SG, Govindarajan R (2007) Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation. In: Proceedings of the 16th International Conference on Compiler Construction, CC’07. Springer, Berlin, Heidelberg, pp 126–140. http://dl.acm.org/citation.cfm?id=1759937.1759949

  34. 34.

    Sarkar V, Barik R (2007) Extended Linear Scan: An Alternate Foundation for Global Register Allocation. Compiler Construction, 16th International Conference. Proceedings. Braga, Portugal, pp 141–155

  35. 35.

    Smith MD, Ramsey N, Holloway G (2004) A generalized algorithm for graph-coloring register allocation. SIGPLAN Not 39(6):277–288. doi:10.1145/996893.996875

    Article  Google Scholar 

  36. 36.

    Wang L, Yang X, Xue J, Deng Y, Yan X, Tang T, Nguyen QH (2008) Optimizing scientific application loops on stream processors. SIGPLAN Not 43(7):161–170. doi:10.1145/1379023.1375679

    Article  Google Scholar 

  37. 37.

    Baradaran N, Diniz PC A register allocation algorithm in the presence of scalar replacement for fine-grain configurable architectures. arXiv:0710.4702

  38. 38.

    Lozano RC, Schulte C Survey on Combinatorial Register Allocation and Instruction Scheduling. arXiv:1409.7628

  39. 39.

    Sherwood T, Calder B, Emer J (1999) Reducing cache misses using hardware and software page placement. In: Proceedings of the 13th international conference on Supercomputing, ICS ’99. ACM, New York, NY, pp 155–164. doi:10.1145/305138.305189

  40. 40.

    Cacaval C, Padua DA (2003) Estimating cache misses and locality using stack distances. In: Proceedings of the 17th annual international conference on Supercomputing, ICS ’03. ACM, New York, NY, pp 150–159. doi:10.1145/782814.782836

  41. 41.

    Ghosh S, Martonosi M, Malik S (1997) Cache miss equations: an analytical representation of cache misses. In: Proceedings of the 11th international conference on Supercomputing, ICS ’97. ACM, New York, NY, pp 317–324. doi:10.1145/263580.263657

  42. 42.

    Song F, Moore S, Dongarra J (2007) L2 cache modeling for scientific applications on chip multi-processors. In: Parallel Processing, International Conference on 51. doi:10.1109/ICPP.2007.52

  43. 43.

    Hankins RA, Patel JM (2003) Data morphing: an adaptive, cache-conscious storage technique. In: Proceedings of the 29th international conference on Very large data bases—vol 29, VLDB ’2003, VLDB Endowment, 2003, pp 417–428. http://dl.acm.org/citation.cfm?id=1315451.1315488

  44. 44.

    Franz M, Kistler T (1998) Splitting data objects to increase cache utilization. Department of Information and Computer Science University of California, Tech. rep

  45. 45.

    Rubin S, Bodík R, Chilimbi T (2002) An efficient profile-analysis framework for data-layout optimizations. SIGPLAN Not 37:140–153. doi:10.1145/565816.503287

    Article  Google Scholar 

  46. 46.

    Chang SK (2003) Data structures and algorithms, vol 13 of series on software engineering and knowledge engineering. World Scientific, Singapore

  47. 47.

    Kelefouras V, Kritikakou A, Goutis C (2015) A methodology for speeding up loop kernels by exploiting the software information and the memory architecture. Comput Lang Syst Struct 41:21–41. http://dblp.uni-trier.de/db/journals/cl/cl41.html

  48. 48.

    Whaley RC, Petitet A, Dongarra JJ (2001) Automated Empirical Optimization of Software and the ATLAS Project. Parallel Comput 27(1–2):3–35

    Article  MATH  Google Scholar 

  49. 49.

    Kelefouras V, Kritikakou A, Mporas I, Vasileios K (2016) A high-performance matrix-matrix multiplication methodology for CPU and gpu architectures. J Supercomput 72(3):804–844. doi:10.1007/s11227-015-1613-7

  50. 50.

    Kelefouras VI, Kritikakou A, Papadima E, Goutis CE (2015) A methodology for speeding up matrix vector multiplication for single/multi-core architectures. J Supercomput 71(7):2644–2667. http://dblp.uni-trier.de/db/journals/tjs/tjs71.html

  51. 51.

    O. S. University, Polybench/c benchmark suite (2012). http://web.cs.ucla.edu/~pouchet/software/polybench/

  52. 52.

    Kelefouras VI, Kritikakou A, Goutis C (2014) A methodology for speeding up edge and line detection algorithms focusing on memory architecture utilization. Supercomput Springer. doi:10.1007/s11227-013-1049-x

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Vasilios Kelefouras.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kelefouras, V. A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details. Computing 99, 865–888 (2017). https://doi.org/10.1007/s00607-016-0535-4

Download citation

Keywords

  • Loop unroll
  • Loop tiling
  • Scalar replacement
  • Register allocation
  • Data reuse
  • Cache
  • Loop transformations
  • Iterative compilation

Mathematics Subject Classification

  • 68N20 Compilers and interpreters