Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

Girbal, Sylvain; Vasilache, Nicolas; Bastoul, Cédric; Cohen, Albert; Parello, David; Sigler, Marc; Temam, Olivier

doi:10.1007/s10766-006-0012-3

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

Published: 21 July 2006

Volume 34, pages 261–317, (2006)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Sylvain Girbal¹,
Nicolas Vasilache¹,
Cédric Bastoul¹,
Albert Cohen¹,
David Parello²,
Marc Sigler¹ &
…
Olivier Temam¹

550 Accesses
140 Citations
6 Altmetric
Explore all metrics

Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved in general, resulting in weak scalability and disappointing sustained performance. We address this challenge by working on the program representation itself, using a semi-automatic optimization approach to demonstrate that current compilers offen suffer from unnecessary constraints and intricacies that can be avoided in a semantically richer transformation framework. Technically, the purpose of this paper is threefold: (1) to show that syntactic code representations close to the operational semantics lead to rigid phase ordering and cumbersome expression of architecture-aware loop transformations, (2) to illustrate how complex transformation sequences may be needed to achieve significant performance benefits, (3) to facilitate the automatic search for program transformation sequences, improving on classical polyhedral representations to better support operation research strategies in a simpler, structured search space. The proposed framework relies on a unified polyhedral representation of loops and statements, using normalization rules to allow flexible and expressive transformation sequencing. Thisrepresentation allows to extend the scalability of polyhedral dependence analysis, and to delay the (automatic) legality checks until the end of a transformation sequence. Our work leverages on algorithmic advances in polyhedral code generation and has been implemented in a modern research compiler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Effective Framework of Program Optimization for High Performance Computing

Distributing and Parallelizing Non-canonical Loops

An Affine Scheduling Framework for Integrating Data Layout and Loop Transformations

References

T. Kisuki, P. Knijnenburg, M. O’Boyle, and H. Wijshoff, Iterative Compilation in Program Optimization, Proc. CPC’10 (Compilers for Parallel Computers), pp. 35–44 (2000).
Cooper K.D., Subramanian D., Torczon L. (2002). Adaptive Optimizing Compilers for the 21st Century. J. Supercomput. 23(1):7–22
Article MATH Google Scholar
S. Long and M. O’Boyle, Adaptive Java Optimisation Using Instance-based Learning, ACM Intl. Conf. Supercomput. (ICS’04), pp. 237–246, St-Malo, France (June 2004).
D. Parello, O. Temam, and J.-M. Verdun, On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance? Matrix-Multiply Revisited, SuperComputing’02, Baltimore, Maryland (November 2002).
D. Parello, O. Temam, A. Cohen, and J.-M. Verdun, Towards a Systematic, Pragmatic and Architecture-Aware Program Optimization Process for Complex Processors, ACM Supercomputing’04, p. 15, Pittsburgh, Pennsylvania (November 2004).
M. J. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley (1996).
A. Cohen, S. Girbal, D. Parello, M. Sigler, O. Temam, and N. Vasilache, Facilitating the Search for Compositions of Program Transformations, ACM Intl. Conf. on Supercomputing (ICS’05), pp. 151–160, Boston, Massachusetts (June 2005).
P. Feautrier, Some Efficient Solutions to the Affine Scheduling Problem, Part II, multidimensional time, Intl. J. Parallel Program, 21(6):389–420 (December 1992), see also Part I, one dimensional time, 21(5):315–348.
M. E. Wolf, Improving Locality and Parallelism in Nested Loops, Ph.D. thesis, Stanford University (August 1992), published as CSL-TR-92-538.
W. Kelly, Optimization within a Unified Transformation Framework, Technical Report CS-TR-3725, University of Maryland (1996).
A. W. Lim and M. S. Lam, Communication-Free Parallelization via Affine Transformations, 24th ACM Symp. on Principles of Programming Languages, pp. 201–214, Paris, France (Jan 1997).
N. Ahmed, N. Mateev, and K. Pingali, Synthesizing Transformations for Locality Enhancement of Imperfectly-nested Loop Nests, ACM Supercomputing’00 (May 2000).
A. W. Lim, S.-W. Liao, and M. S. Lam, Blocking and Array Contraction Across Arbitrarily Nested Loops Using Affine Partitioning, ACM Symp. on Principles and Practice of Parallel Programming (PPoPP’01), pp. 102–112 (2001).
W. Pugh, Uniform Techniques for Loop Optimization, ACM Intl. Conf. on Supercomputing (ICS’91), pp. 341–352, Cologne, Germany (June 1991).
Li W., Pingali K. (April 1994) A Singular Loop Transformation Framework Based on Non-singular Matrices. Intl. J. Parallel Program 22(2):183–205
Article MathSciNet MATH Google Scholar
Open Research Compiler, http://ipf-orc.sourceforge.net.
A. Phansalkar, A. Joshi, L. Eeckhout, and L. John, Four Generations of SPEC CPU Benchmarks: What Has Changed and What Has Not, Technical Report , University of Texas Austin (2004).
KAP C/OpenMP for Tru64 UNIX and KAP DEC Fortran for Digital UNIX, http://www.hp.com/techsevers/software/kap.html.
E. Visser, Stratego: A Language for Program Transformation based on Rewriting Strategies. System Description of Stratego 0.5, A. Middeldorp (ed.), Rewriting Techniques and Applications (RTA’01), Lecture Notes in Computer Science, Vol. 2051, pp. 357–361, Springer-Verlag (May 2001).
M. O’Boyle, MARS: a Distributed Memory Approach to Shared Memory Compilation, Proc. Language, Compilers and Runtime Systems for Scalable Computing, Springer-Verlag, Pittsburgh (May 1998).
C. Bastoul, Efficient Code Generation for Automatic Parallelization and Optimization, ISPDC’2 IEEE International Symposium on Parallel and Distributed Computing, Ljubjana, Slovenia (October 2003).
C. Bastoul, Code Generation in the Polyhedral Model Is Easier Than You Think, Parallel Architectures and Compilation Techniques (PACT’04), Antibes, France (September 2004).
Quilleré F., Rajopadhye S., Wilde D. (October 2000). Generation of Efficient Nested Loops from Polyhedra. Intl. J. Parallel Program 28(5):469–498
Article Google Scholar
G.-R. Perrin and A. Darte (eds.), The Data Parallel Programming Model, number 1132 in LNCS, Springer-Verlag (1996).
A. Cohen, S. Girbal, and O. Temam, A Polyhedral Approach to Ease the Composition of Program Transformations, Euro-Par’04, number 3149 in LNCS, pp. 292–303, Springer-Verlag, Pisa, Italy (August 2004).
R. Triolet, P. Feautrier, and P. Jouvelot, Automatic parallelization of Fortran programs in the presence of procedure calls, Proc. of the 1st European Symp. on Programming (ESOP’86), number 213 in LNCS, pp. 210–222, Springer-Verlag (March 1986).
M. Griebl and J.-F. Collard, Generation of Synchronous Code for Automatic Parallelization of while Loops, in S. Haridi, K. Ali, and P. Magnusson (eds.), EuroPar’95, LNCS, Vol. 966, pp. 315–326, Springer-Verlag (1995).
Collard J.-F. (April 1995) Automatic Parallelization of While-Loops using Speculative Execution. Intl. J. Parallel Program 23(2):191–219
Article Google Scholar
D. G. Wonnacott, Constraint-Based Array Dependence Analysis, Ph.D. thesis, University of Maryland (1995).
B. Creusillet, Array Region Analyses and Applications, Ph.D. thesis, École Nationale Supérieure des Mines de Paris (ENSMP), France (December 1996).
Barthou D., Collard J.-F., Feautrier P. (1997). Fuzzy Array Dataflow Analysis. J. Parallel Distributed Comput. 40:210–226
Article MATH Google Scholar
L. Rauchwerger and D. Padua, The LRPD Test: Speculative Run–Time Parallelization of Loops with Privatization and Reduction Parallelization, IEEE Trans. Parallel Distribut. Syst. Special Issue Comp. Lang. Parallel Distribut. Comput., 10(2):160–180 (1999).
Barthou D., Cohen A., Collard J.-F. (Juen 2000) Maximal Static Expansion. Intl. J. Parallel Program 28(3):213–243
Article Google Scholar
A. Cohen, Program Analysis and Transformation: from the Polytope Model to Formal Languages, PhD Thesis, Université de Versailles, France (December 1999).
J.-F. Collard, Reasoning About Program Transformations, Springer-Verlag (2002).
Darte A., Robert Y., Vivien F. (2000). Scheduling and Automatic Parallelization. Birkhaüser, Boston
MATH Google Scholar
Darte A., Robert Y. (1994). Mapping Uniform Loop Nests onto Distributed Memory Architectures. Parallel Comput. 20(5):679–710
Article MATH Google Scholar
N. Vasilache, C. Bastoul, and A. Cohen, Polyhedral Code Generation in the Real World, Proceedings of the International Conference on Compiler Construction (ETAPS CC’06), LNCS, Springer-Verlag, Vienna, Austria (March 2006), to appear.
Allen J., Kennedy K. (October 1987). Automatic Translation of Fortran Programs to Vector Form. ACM Trans. on Programming Languages and Systems 9(4):491–542
Article MATH Google Scholar
Cooper K.D., Hall M.W., Hood R.T., Kennedy K., McKinley K.S., Mellor-Crummy J.M., Torczon L., Warren S.K. (1993). The ParaScope Parallel Programming Environment. Proc. IEEE 81(2):244–263
Article Google Scholar
Blume W., Eigenmann R., Faigin K., Grout J., Hoeflinger J., Padua D., Petersen P., Pottenger W., Rauchwerger L., Tu P., Weatherford S. (December 1996) Parallel Programming with Polaris. IEEE Comput. 29(12):78–82
Google Scholar
Hall M. et al. (December 1996) Maximizing Multiprocessor Performance with the SUIF Compiler. IEEE Comput. 29(12):84–89
Google Scholar
S. Carr, C. Ding, and P. Sweany, Improving Software Pipelining With Unroll-and-Jam, Proceedings of the 29th Hawaii Intl. Conf. on System Sciences (HICSS’96) Volume 1: Software Technology and Architecture, IEEE Computer Society (1996).
Bik A.J.C., Girkar M., Grey P.M., Tian X. (2002). Automatic Intra-Register Vectorization for the Intel Architecture. Intl. J. Parallel Program 30(2):65–98
Article MATH Google Scholar
D. Naishlos, Autovectorization in GCC, Proceedings of the 2004 GCC Developers Summit, pp. 105–118 (2004), http://www.gccsummit.org/2004.
A. E. Eichenberger, P. Wu, and K. O’Brien, Vectorization for SIMD Architectures with Alignment Constraints, ACM Symp. on Programming Language Design and Implementation (PLDI ’04), pp. 82–93 (2004).
D. E. Maydan, S. P. Amarasinghe, and M. S. Lam, Array Dataflow Analysis and its Use in Array Privatization, 20th ACM Symp. on Principles of Programming Languages, pp. 2–15, Charleston, South Carolina (January 1993).
P. Tu and D. Padua, Automatic Array Privatization, 6th Workshop on Languages and Compilers for Parallel Computing, number 768 in LNCS, pp. 500–521, Portland, Oregon (August 1993).
Banerjee U. (1988). Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Boston
Google Scholar
W. Pugh, The Omega Test: A Fast and Practical Integer Programming Algorithm for Dependence Analysis, ACM/IEEE Conf. Supercomput., pp. 4–13, Albuquerque (August 1991).
Xue J. (1994) Automating Non-unimodular Loop Transformations for Massive Parallelism. Parallel Computing 20(5):711–728
Article MathSciNet MATH Google Scholar
A.-C. Guillou, F. Quilleré, P. Quinton, S. Rajopadhye, and T. Risset, Hardware Design Methodology with the Alpha Language, FDL’01, Lyon, France (September 2001).
R. Schreiber, S. Aditya, B. Rau, V. Kathail, S. Mahlke, S. Abraham, and G. Snider, High-level Synthesis of Nonprogrammable Hardware Accelerators, Technical report, Hewlett-Packard (May 2000).
P. Feautrier, Array Expansion, ACM Intl. Conf. Supercomputing, pp. 429–441, St. Malo, France (July 1988).
D. Barthou, A. Cohen, and J.-F. Collard, Maximal Static Expansion, 25th ACM Symp. on Principles of Programming Languages (PoPL’98), pp. 98–106, San Diego, California (January 1998).
Lefebvre V., Feautrier P. (1998). Automatic Storage Management for Parallel Programs. Parallel Comput. 24(3):649–671
Article MATH Google Scholar
M. M. Strout, L. Carter, J. Ferrante, and B. Simon, Schedule-Independant Storage Mapping for Loops, ACM Symp. on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98), 8 (1998).
F. Quilleré and S. Rajopadhye, Optimizing Memory Usage in the Polyhedral Model, Technical Report 1228, Institut de Recherche en Informatique et Systémes Aléatoires, Université de Rennes, France (January 1999).
Feautrier P. (February 1991) Dataflow Analysis of Scalar and Array References. Intl. J. Parallel Program 20(1):23–53
Article MATH Google Scholar
J.-F. Collard, D. Barthou, and P. Feautrier, Fuzzy array dataflow analysis, ACM Symp. Principles and Practice of Parallel Programming, pp. 92–102, Santa Barbara, CA (July 1995).
D. Wonnacott and W. Pugh, Nonlinear array dependence analysis, Proc. Third Workshop on Languages, Compilers and Run-Time Systems for Scalable Computers (1995), troy, New York.
S. Rus, D. Zhang, and L. Rauchwerger, The Value Evolution Graph and its Use in Memory Reference Analysis, Parallel Architectures and Compilation Techniques (PACT’04), IEEE Computer Society, Antibes, France (2004).
C. Bastoul and P. Feautrier, More Legal Transformations for Locality, Euro-Par’10, number 3149 in LNCS, pp. 272–283, Pisa (August 2004).
C. Bastoul and P. Feautrier, Improving Data Locality by Chunking, CC Intl. Conf. on Compiler Construction, number 2622 in LNCS, pp. 320–335, Warsaw, Poland (April 2003).
Standard Performance Evaluation Corp., http://www.spec.org.
F. Chow, Maximizing Application Performance Through Interprocedural Optimization with the PathScale EKO compiler suite, http://www.pathscale.com/whitepapers.html (August 2004).
C. Bell, W.-Y. Chen, D. Bonachea, and K. Yelick, Evaluating Support for Global Address Space Languages on the Cray X1, ACM Intl. Conf. on Supercomputing (ICS’04), St-Malo, France (June 2004).
C. Coarfa, F. Zhao, N. Tallent, J. Mellor-Crummey, and Y. Dotsenko, Open-source Compiler Technology for Source-to-Source Optimization, http://www.cs.rice.edu/~johnmc/research.html (project page).
C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam, Putting Polyhedral Loop Transformations to Work, Workshop on Languages and Compilers for Parallel Computing (LCPC’03), LNCS, pp. 23–30, Springer-Verlag, College Station, Texas (October 2003).
C. Ancourt and F. Irigoin, Scanning Polyhedra with DO Loop, ACM Symp. on Principles and Practice of Parallel Programming (PPoPP’91), pp. 39–50 (June 1991).
F. Irigoin, P. Jouvelot, and R. Triolet, Semantical Interprocedural Parallelization: An Overview of the PIPS Project, ACM Intl. Conf. on Supercomputing (ICS’91), Cologne, Germany (June 1991).
T. Kisuki, P. Knijnenburg, K. Gallivan, and M. O’Boyle, The Effect of Cache Models on Iterative Compilation for Combined Tiling and Unrolling, Parallel Architectures and Compilation Techniques (PACT’00), IEEE Computer Society (October 2001).
W. Kelly, W. Pugh, and E. Rosser, Code Generation for Multiple Mappings, Frontiers’95 Symp. on the Frontiers of Massively Parallel Computation, McLean (1995).

Download references

Author information

Authors and Affiliations

ALCHEMY Group, INRIA Futurs and LRI, Paris-Sud 11 University, Orsay Cedex, France
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, Marc Sigler & Olivier Temam
DALI Group, LP2A, University of Perpignan, Perpignan, France
David Parello

Authors

Sylvain Girbal
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Vasilache
View author publications
You can also search for this author in PubMed Google Scholar
Cédric Bastoul
View author publications
You can also search for this author in PubMed Google Scholar
Albert Cohen
View author publications
You can also search for this author in PubMed Google Scholar
David Parello
View author publications
You can also search for this author in PubMed Google Scholar
Marc Sigler
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Temam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Albert Cohen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Girbal, S., Vasilache, N., Bastoul, C. et al. Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies. Int J Parallel Prog 34, 261–317 (2006). https://doi.org/10.1007/s10766-006-0012-3

Download citation

Received: 10 May 2006
Accepted: 02 June 2006
Published: 21 July 2006
Issue Date: June 2006
DOI: https://doi.org/10.1007/s10766-006-0012-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

Access this article

Similar content being viewed by others

An Effective Framework of Program Optimization for High Performance Computing

Distributing and Parallelizing Non-canonical Loops

An Affine Scheduling Framework for Integrating Data Layout and Loop Transformations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

Access this article

Similar content being viewed by others

An Effective Framework of Program Optimization for High Performance Computing

Distributing and Parallelizing Non-canonical Loops

An Affine Scheduling Framework for Integrating Data Layout and Loop Transformations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation