Optimized Unrolling of Nested Loops
- Vivek Sarkar
- … show all 1 hide
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).
- F. E. Allen and J. Cocke, A catalogue of optimizing transformations, in Design and Optimization of Compilers, Prentice-Hall, pp. 1-30 (1972).
- Dongarra, J. J., Hinds, A. R. (1979) Unrolling Loops in Fortran. Software-Practice and Experience 9: pp. 219-226
- J. A. Fisher, J. R. Ellis, J. C. Ruttenberg, and A. Nicolau, Parallel Processing: A Smart Compiler and a Dumb Machine, Proc. ACM Symp. Compiler Construction, pp. 37-47 (June 1984).
- Bacon, D. F., Graham, S. L., Sharp, O. J. (1994) Compiler Transformations for High-Performance Computing. ACM Computing Surveys 26: pp. 345-420
- Steve Carr and Ken Kennedy, Scalar Replacement in the Presence of Conditional Control Flow, Software-Practice and Experience (1):51-77 (January 1994).
- Michael J. Alexander, Mark W. Bailey, Bruce R. Childers, Jack W. Davidson, and Sanjay Jinturkar, Memory bandwidth optimizations for wide-bus machines, Proc. 26th Hawaii Int'l. Conf. Syst. Sci., Wailea, Hawaii, pp. 466-475 (January 1993).
- T. C. Mowry, Tolerating Latency Through Software-Controlled Data Prefetching, Ph.D. thesis, Stanford University (March 1994).
- Mauricio Breternitz, Michael Lai, Vivek Sarkar, and Barbara Simons, Compiler Solutions for the Stale-Data and False-Sharing Problems, Technical report, TR 03.466, IBM Santa Teresa Laboratory (April 1993).
- Steve Carr and Ken Kennedy, Improving the Ratio of Memory Operations to Floating-Point Operations in Loops, ACM TOPLAS 16(4) (November 1994).
- Jack W, D., Sanjay, J. (1996) Aggressive Loop Unrolling in a Retargetable, Optimizing Compiler. Compiler Construction, Proc. Sixth Int'l. Conf. Linkoping, Sweden, Vol. 1060, Lecture Notes in Computer Science. Springer-Verlag, New York
- David Callahan, Steve Carr, and Ken Kennedy, Improving Register Allocation for Subscripted Variables, Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation, White Plains, New York, pp. 53-65 (June 1990).
- S. Carr and Y. Guan, Unroll-and-Jam Using Uniformly Generated Sets, Proc. MICRO-30, pp. 349-357 (December 1997).
- Allan K. Porterfield, Software Methods for Improvement of Cache Performance on Supercomputer Applications, Ph.D. thesis, Rice University, Rice COMP TR89-93 (May 1989).
- Michael E. Wolf and Monica S. Lam, A Data Locality Optimization Algorithm, Proc. ACM SIGPLAN Symp. Progr. Lang. Design and Implementation, pp. 30-44 (June 1991).
- Vivek Sarkar, Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. Dev. 41(3) (May 1997).
- Michael J. Wolfe, Optimizing Supercompilers for Supercomputers, Pitman, London and The MIT Press, Cambridge, Massachusetts (1989). In the series, Research Monographs in Parallel and Distributed Computing.
- Vivek Sarkar and Radhika Thekkath, A General Framework for Iteration-Reordering Loop Transformations, Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation, pp. 175-187 (June 1992).
- Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash, On Estimating and Enhancing Cache Effectiveness, Lecture Notes in Computer Science (589):328-343 (1991). Proc. Fourth Int'l. Workshop Lang. Compilers for Parallel Computing, Santa Clara, California (August 1991).
- B. Ramakrishna Rau, Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops, Proc. 27th Ann. Int'l. Symp. Microarchitecture, San Jose, California, pp. 63-74 (November 1994).
- Vivek Sarkar and Barbara Simons, Don't Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops, Video Lecture in University Video Communication's Distinguished Lecture Series IX (August 1994).
- Sarkar, V. (1989) Determining Average Program Execution Times and their Variance. Proc. SIGPLAN Conf. Prog. Lang. Design and Implementation 24: pp. 298-312
- Vivek Sarkar, Automatic Partitioning of a Program Dependence Graph into Parallel Tasks, IBM J. Res. Dev 35(5/6) (1991).
- The Standard Performance Evaluation Corporation, SPEC CPU95 Benchmarks, http://open.specbench.org/osg/cpu95/ (1997).
- POWER2 and PowerPC. IBM J. Res. Dev. 38: pp. 489-648
- Barbara Simons, Vivek Sarkar, Jr. Mauricio Breternitz, and Michael Lai, An Optimal Asynchronous Scheduling Algorithm for Software Cache Consistency, Proc. Hawaii Int'l. Conf. Syst. Sci. (January 1994).
- Max Hailperin, Improving the Ratio of Memory Operations to Floating-Point operations in loops, Computing Reviews. Copy of review can be found in the ACM digital library at http://www.acm.org/pubs/citations/journals/toplas/1994-16-6/p1768-carr/.
- S. Weiss and J. E. Smith, A Study of Scalar Compilation Techniques for Pipelined Supercomputers, Proc. Second Int'l Conf. Architectural Support Progr. Lang. Oper. Syst. (ASPLOS), pp. 105-109 (October 1987).
- Reese B. Jones and Vicki H. Allan, Software Pipelining: An Evaluation of Enhanced Pipelining, Proc. 24th Ann. Int'l. symp. Microarchitecture, pp. 82-92 (December 1990).
- Bogong Su, Shiyuan Ding, Jian Wang, and Jinshi Xia, GURPR-A Method for Global Software Piplining; Proc. 20th Ann. Int'l. Symp. Microarchitecture, pp. 88-96 (December 1986).
- Daniel M. Lavery and Wen-Mei W.Hwu, Unrolling-Based Optmizations for Modulo Sheduling, Proc. MICRO-28, pp. 327-337 (December 1995).
- Optimized Unrolling of Nested Loops
International Journal of Parallel Programming
Volume 29, Issue 5 , pp 545-581
- Cover Date
- Print ISSN
- Online ISSN
- Kluwer Academic Publishers-Plenum Publishers
- Additional Links
- loop transformations
- loop unrolling
- unroll factors
- Industry Sectors
- Vivek Sarkar (1)
- Author Affiliations
- 1. IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, New York, 10598