Skip to main content
Log in

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved in general, resulting in weak scalability and disappointing sustained performance. We address this challenge by working on the program representation itself, using a semi-automatic optimization approach to demonstrate that current compilers offen suffer from unnecessary constraints and intricacies that can be avoided in a semantically richer transformation framework. Technically, the purpose of this paper is threefold: (1) to show that syntactic code representations close to the operational semantics lead to rigid phase ordering and cumbersome expression of architecture-aware loop transformations, (2) to illustrate how complex transformation sequences may be needed to achieve significant performance benefits, (3) to facilitate the automatic search for program transformation sequences, improving on classical polyhedral representations to better support operation research strategies in a simpler, structured search space. The proposed framework relies on a unified polyhedral representation of loops and statements, using normalization rules to allow flexible and expressive transformation sequencing. Thisrepresentation allows to extend the scalability of polyhedral dependence analysis, and to delay the (automatic) legality checks until the end of a transformation sequence. Our work leverages on algorithmic advances in polyhedral code generation and has been implemented in a modern research compiler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. T. Kisuki, P. Knijnenburg, M. O’Boyle, and H. Wijshoff, Iterative Compilation in Program Optimization, Proc. CPC’10 (Compilers for Parallel Computers), pp. 35–44 (2000).

  2. Cooper K.D., Subramanian D., Torczon L. (2002). Adaptive Optimizing Compilers for the 21st Century. J. Supercomput. 23(1):7–22

    Article  MATH  Google Scholar 

  3. S. Long and M. O’Boyle, Adaptive Java Optimisation Using Instance-based Learning, ACM Intl. Conf. Supercomput. (ICS’04), pp. 237–246, St-Malo, France (June 2004).

  4. D. Parello, O. Temam, and J.-M. Verdun, On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance? Matrix-Multiply Revisited, SuperComputing’02, Baltimore, Maryland (November 2002).

  5. D. Parello, O. Temam, A. Cohen, and J.-M. Verdun, Towards a Systematic, Pragmatic and Architecture-Aware Program Optimization Process for Complex Processors, ACM Supercomputing’04, p. 15, Pittsburgh, Pennsylvania (November 2004).

  6. M. J. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley (1996).

  7. A. Cohen, S. Girbal, D. Parello, M. Sigler, O. Temam, and N. Vasilache, Facilitating the Search for Compositions of Program Transformations, ACM Intl. Conf. on Supercomputing (ICS’05), pp. 151–160, Boston, Massachusetts (June 2005).

  8. P. Feautrier, Some Efficient Solutions to the Affine Scheduling Problem, Part II, multidimensional time, Intl. J. Parallel Program, 21(6):389–420 (December 1992), see also Part I, one dimensional time, 21(5):315–348.

  9. M. E. Wolf, Improving Locality and Parallelism in Nested Loops, Ph.D. thesis, Stanford University (August 1992), published as CSL-TR-92-538.

  10. W. Kelly, Optimization within a Unified Transformation Framework, Technical Report CS-TR-3725, University of Maryland (1996).

  11. A. W. Lim and M. S. Lam, Communication-Free Parallelization via Affine Transformations, 24th ACM Symp. on Principles of Programming Languages, pp. 201–214, Paris, France (Jan 1997).

  12. N. Ahmed, N. Mateev, and K. Pingali, Synthesizing Transformations for Locality Enhancement of Imperfectly-nested Loop Nests, ACM Supercomputing’00 (May 2000).

  13. A. W. Lim, S.-W. Liao, and M. S. Lam, Blocking and Array Contraction Across Arbitrarily Nested Loops Using Affine Partitioning, ACM Symp. on Principles and Practice of Parallel Programming (PPoPP’01), pp. 102–112 (2001).

  14. W. Pugh, Uniform Techniques for Loop Optimization, ACM Intl. Conf. on Supercomputing (ICS’91), pp. 341–352, Cologne, Germany (June 1991).

  15. Li W., Pingali K. (April 1994) A Singular Loop Transformation Framework Based on Non-singular Matrices. Intl. J. Parallel Program 22(2):183–205

    Article  MathSciNet  MATH  Google Scholar 

  16. Open Research Compiler, http://ipf-orc.sourceforge.net.

  17. A. Phansalkar, A. Joshi, L. Eeckhout, and L. John, Four Generations of SPEC CPU Benchmarks: What Has Changed and What Has Not, Technical Report , University of Texas Austin (2004).

  18. KAP C/OpenMP for Tru64 UNIX and KAP DEC Fortran for Digital UNIX, http://www.hp.com/techsevers/software/kap.html.

  19. E. Visser, Stratego: A Language for Program Transformation based on Rewriting Strategies. System Description of Stratego 0.5, A. Middeldorp (ed.), Rewriting Techniques and Applications (RTA’01), Lecture Notes in Computer Science, Vol. 2051, pp. 357–361, Springer-Verlag (May 2001).

  20. M. O’Boyle, MARS: a Distributed Memory Approach to Shared Memory Compilation, Proc. Language, Compilers and Runtime Systems for Scalable Computing, Springer-Verlag, Pittsburgh (May 1998).

  21. C. Bastoul, Efficient Code Generation for Automatic Parallelization and Optimization, ISPDC’2 IEEE International Symposium on Parallel and Distributed Computing, Ljubjana, Slovenia (October 2003).

  22. C. Bastoul, Code Generation in the Polyhedral Model Is Easier Than You Think, Parallel Architectures and Compilation Techniques (PACT’04), Antibes, France (September 2004).

  23. Quilleré F., Rajopadhye S., Wilde D. (October 2000). Generation of Efficient Nested Loops from Polyhedra. Intl. J. Parallel Program 28(5):469–498

    Article  Google Scholar 

  24. G.-R. Perrin and A. Darte (eds.), The Data Parallel Programming Model, number 1132 in LNCS, Springer-Verlag (1996).

  25. A. Cohen, S. Girbal, and O. Temam, A Polyhedral Approach to Ease the Composition of Program Transformations, Euro-Par’04, number 3149 in LNCS, pp. 292–303, Springer-Verlag, Pisa, Italy (August 2004).

  26. R. Triolet, P. Feautrier, and P. Jouvelot, Automatic parallelization of Fortran programs in the presence of procedure calls, Proc. of the 1st European Symp. on Programming (ESOP’86), number 213 in LNCS, pp. 210–222, Springer-Verlag (March 1986).

  27. M. Griebl and J.-F. Collard, Generation of Synchronous Code for Automatic Parallelization of while Loops, in S. Haridi, K. Ali, and P. Magnusson (eds.), EuroPar’95, LNCS, Vol. 966, pp. 315–326, Springer-Verlag (1995).

  28. Collard J.-F. (April 1995) Automatic Parallelization of While-Loops using Speculative Execution. Intl. J. Parallel Program 23(2):191–219

    Article  Google Scholar 

  29. D. G. Wonnacott, Constraint-Based Array Dependence Analysis, Ph.D. thesis, University of Maryland (1995).

  30. B. Creusillet, Array Region Analyses and Applications, Ph.D. thesis, École Nationale Supérieure des Mines de Paris (ENSMP), France (December 1996).

  31. Barthou D., Collard J.-F., Feautrier P. (1997). Fuzzy Array Dataflow Analysis. J. Parallel Distributed Comput. 40:210–226

    Article  MATH  Google Scholar 

  32. L. Rauchwerger and D. Padua, The LRPD Test: Speculative Run–Time Parallelization of Loops with Privatization and Reduction Parallelization, IEEE Trans. Parallel Distribut. Syst. Special Issue Comp. Lang. Parallel Distribut. Comput., 10(2):160–180 (1999).

  33. Barthou D., Cohen A., Collard J.-F. (Juen 2000) Maximal Static Expansion. Intl. J. Parallel Program 28(3):213–243

    Article  Google Scholar 

  34. A. Cohen, Program Analysis and Transformation: from the Polytope Model to Formal Languages, PhD Thesis, Université de Versailles, France (December 1999).

  35. J.-F. Collard, Reasoning About Program Transformations, Springer-Verlag (2002).

  36. Darte A., Robert Y., Vivien F. (2000). Scheduling and Automatic Parallelization. Birkhaüser, Boston

    MATH  Google Scholar 

  37. Darte A., Robert Y. (1994). Mapping Uniform Loop Nests onto Distributed Memory Architectures. Parallel Comput. 20(5):679–710

    Article  MATH  Google Scholar 

  38. N. Vasilache, C. Bastoul, and A. Cohen, Polyhedral Code Generation in the Real World, Proceedings of the International Conference on Compiler Construction (ETAPS CC’06), LNCS, Springer-Verlag, Vienna, Austria (March 2006), to appear.

  39. Allen J., Kennedy K. (October 1987). Automatic Translation of Fortran Programs to Vector Form. ACM Trans. on Programming Languages and Systems 9(4):491–542

    Article  MATH  Google Scholar 

  40. Cooper K.D., Hall M.W., Hood R.T., Kennedy K., McKinley K.S., Mellor-Crummy J.M., Torczon L., Warren S.K. (1993). The ParaScope Parallel Programming Environment. Proc. IEEE 81(2):244–263

    Article  Google Scholar 

  41. Blume W., Eigenmann R., Faigin K., Grout J., Hoeflinger J., Padua D., Petersen P., Pottenger W., Rauchwerger L., Tu P., Weatherford S. (December 1996) Parallel Programming with Polaris. IEEE Comput. 29(12):78–82

    Google Scholar 

  42. Hall M. et al. (December 1996) Maximizing Multiprocessor Performance with the SUIF Compiler. IEEE Comput. 29(12):84–89

    Google Scholar 

  43. S. Carr, C. Ding, and P. Sweany, Improving Software Pipelining With Unroll-and-Jam, Proceedings of the 29th Hawaii Intl. Conf. on System Sciences (HICSS’96) Volume 1: Software Technology and Architecture, IEEE Computer Society (1996).

  44. Bik A.J.C., Girkar M., Grey P.M., Tian X. (2002). Automatic Intra-Register Vectorization for the Intel Architecture. Intl. J. Parallel Program 30(2):65–98

    Article  MATH  Google Scholar 

  45. D. Naishlos, Autovectorization in GCC, Proceedings of the 2004 GCC Developers Summit, pp. 105–118 (2004), http://www.gccsummit.org/2004.

  46. A. E. Eichenberger, P. Wu, and K. O’Brien, Vectorization for SIMD Architectures with Alignment Constraints, ACM Symp. on Programming Language Design and Implementation (PLDI ’04), pp. 82–93 (2004).

  47. D. E. Maydan, S. P. Amarasinghe, and M. S. Lam, Array Dataflow Analysis and its Use in Array Privatization, 20th ACM Symp. on Principles of Programming Languages, pp. 2–15, Charleston, South Carolina (January 1993).

  48. P. Tu and D. Padua, Automatic Array Privatization, 6th Workshop on Languages and Compilers for Parallel Computing, number 768 in LNCS, pp. 500–521, Portland, Oregon (August 1993).

  49. Banerjee U. (1988). Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Boston

    Google Scholar 

  50. W. Pugh, The Omega Test: A Fast and Practical Integer Programming Algorithm for Dependence Analysis, ACM/IEEE Conf. Supercomput., pp. 4–13, Albuquerque (August 1991).

  51. Xue J. (1994) Automating Non-unimodular Loop Transformations for Massive Parallelism. Parallel Computing 20(5):711–728

    Article  MathSciNet  MATH  Google Scholar 

  52. A.-C. Guillou, F. Quilleré, P. Quinton, S. Rajopadhye, and T. Risset, Hardware Design Methodology with the Alpha Language, FDL’01, Lyon, France (September 2001).

  53. R. Schreiber, S. Aditya, B. Rau, V. Kathail, S. Mahlke, S. Abraham, and G. Snider, High-level Synthesis of Nonprogrammable Hardware Accelerators, Technical report, Hewlett-Packard (May 2000).

  54. P. Feautrier, Array Expansion, ACM Intl. Conf. Supercomputing, pp. 429–441, St. Malo, France (July 1988).

  55. D. Barthou, A. Cohen, and J.-F. Collard, Maximal Static Expansion, 25th ACM Symp. on Principles of Programming Languages (PoPL’98), pp. 98–106, San Diego, California (January 1998).

  56. Lefebvre V., Feautrier P. (1998). Automatic Storage Management for Parallel Programs. Parallel Comput. 24(3):649–671

    Article  MATH  Google Scholar 

  57. M. M. Strout, L. Carter, J. Ferrante, and B. Simon, Schedule-Independant Storage Mapping for Loops, ACM Symp. on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98), 8 (1998).

  58. F. Quilleré and S. Rajopadhye, Optimizing Memory Usage in the Polyhedral Model, Technical Report 1228, Institut de Recherche en Informatique et Systémes Aléatoires, Université de Rennes, France (January 1999).

  59. Feautrier P. (February 1991) Dataflow Analysis of Scalar and Array References. Intl. J. Parallel Program 20(1):23–53

    Article  MATH  Google Scholar 

  60. J.-F. Collard, D. Barthou, and P. Feautrier, Fuzzy array dataflow analysis, ACM Symp. Principles and Practice of Parallel Programming, pp. 92–102, Santa Barbara, CA (July 1995).

  61. D. Wonnacott and W. Pugh, Nonlinear array dependence analysis, Proc. Third Workshop on Languages, Compilers and Run-Time Systems for Scalable Computers (1995), troy, New York.

  62. S. Rus, D. Zhang, and L. Rauchwerger, The Value Evolution Graph and its Use in Memory Reference Analysis, Parallel Architectures and Compilation Techniques (PACT’04), IEEE Computer Society, Antibes, France (2004).

  63. C. Bastoul and P. Feautrier, More Legal Transformations for Locality, Euro-Par’10, number 3149 in LNCS, pp. 272–283, Pisa (August 2004).

  64. C. Bastoul and P. Feautrier, Improving Data Locality by Chunking, CC Intl. Conf. on Compiler Construction, number 2622 in LNCS, pp. 320–335, Warsaw, Poland (April 2003).

  65. Standard Performance Evaluation Corp., http://www.spec.org.

  66. F. Chow, Maximizing Application Performance Through Interprocedural Optimization with the PathScale EKO compiler suite, http://www.pathscale.com/whitepapers.html (August 2004).

  67. C. Bell, W.-Y. Chen, D. Bonachea, and K. Yelick, Evaluating Support for Global Address Space Languages on the Cray X1, ACM Intl. Conf. on Supercomputing (ICS’04), St-Malo, France (June 2004).

  68. C. Coarfa, F. Zhao, N. Tallent, J. Mellor-Crummey, and Y. Dotsenko, Open-source Compiler Technology for Source-to-Source Optimization, http://www.cs.rice.edu/~johnmc/research.html (project page).

  69. C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam, Putting Polyhedral Loop Transformations to Work, Workshop on Languages and Compilers for Parallel Computing (LCPC’03), LNCS, pp. 23–30, Springer-Verlag, College Station, Texas (October 2003).

  70. C. Ancourt and F. Irigoin, Scanning Polyhedra with DO Loop, ACM Symp. on Principles and Practice of Parallel Programming (PPoPP’91), pp. 39–50 (June 1991).

  71. F. Irigoin, P. Jouvelot, and R. Triolet, Semantical Interprocedural Parallelization: An Overview of the PIPS Project, ACM Intl. Conf. on Supercomputing (ICS’91), Cologne, Germany (June 1991).

  72. T. Kisuki, P. Knijnenburg, K. Gallivan, and M. O’Boyle, The Effect of Cache Models on Iterative Compilation for Combined Tiling and Unrolling, Parallel Architectures and Compilation Techniques (PACT’00), IEEE Computer Society (October 2001).

  73. W. Kelly, W. Pugh, and E. Rosser, Code Generation for Multiple Mappings, Frontiers’95 Symp. on the Frontiers of Massively Parallel Computation, McLean (1995).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Albert Cohen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Girbal, S., Vasilache, N., Bastoul, C. et al. Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies. Int J Parallel Prog 34, 261–317 (2006). https://doi.org/10.1007/s10766-006-0012-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-006-0012-3

Keywords

Navigation