F. E. Allen and J. Cocke, A catalogue of optimizing transformations, in Design and Optimization of Compilers, Prentice-Hall, pp. 1-30 (1972).
J. J. Dongarra and A. R. Hinds, Unrolling Loops in Fortran, Software-Practice and Experience
9(3):219-226 (March 1979).
J. A. Fisher, J. R. Ellis, J. C. Ruttenberg, and A. Nicolau, Parallel Processing: A Smart Compiler and a Dumb Machine, Proc. ACM Symp. Compiler Construction, pp. 37-47 (June 1984).
D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler Transformations for High-Performance Computing, ACM Computing Surveys
26(4):345-420 (December 1994).
Steve Carr and Ken Kennedy, Scalar Replacement in the Presence of Conditional Control Flow, Software-Practice and Experience (1):51-77 (January 1994).
Michael J. Alexander, Mark W. Bailey, Bruce R. Childers, Jack W. Davidson, and Sanjay Jinturkar, Memory bandwidth optimizations for wide-bus machines, Proc. 26th Hawaii Int'l. Conf. Syst. Sci., Wailea, Hawaii, pp. 466-475 (January 1993).
T. C. Mowry, Tolerating Latency Through Software-Controlled Data Prefetching, Ph.D. thesis, Stanford University (March 1994).
Mauricio Breternitz, Michael Lai, Vivek Sarkar, and Barbara Simons, Compiler Solutions for the Stale-Data and False-Sharing Problems, Technical report, TR 03.466, IBM Santa Teresa Laboratory (April 1993).
Steve Carr and Ken Kennedy, Improving the Ratio of Memory Operations to Floating-Point Operations in Loops, ACM TOPLAS
16(4) (November 1994).
Jack W. Davidson and Sanjay Jinturkar, Aggressive Loop Unrolling in a Retargetable, Optimizing Compiler, In Compiler Construction, Proc. Sixth Int'l. Conf. Linkoping, Sweden, Vol. 1060, Lecture Notes in Computer Science, Springer-Verlag, New York (April 1996).
David Callahan, Steve Carr, and Ken Kennedy, Improving Register Allocation for Subscripted Variables, Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation, White Plains, New York, pp. 53-65 (June 1990).
S. Carr and Y. Guan, Unroll-and-Jam Using Uniformly Generated Sets, Proc. MICRO-30, pp. 349-357 (December 1997).
Allan K. Porterfield, Software Methods for Improvement of Cache Performance on Supercomputer Applications, Ph.D. thesis, Rice University, Rice COMP TR89-93 (May 1989).
Michael E. Wolf and Monica S. Lam, A Data Locality Optimization Algorithm, Proc. ACM SIGPLAN Symp. Progr. Lang. Design and Implementation, pp. 30-44 (June 1991).
Vivek Sarkar, Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. Dev.
41(3) (May 1997).
Michael J. Wolfe, Optimizing Supercompilers for Supercomputers, Pitman, London and The MIT Press, Cambridge, Massachusetts (1989). In the series, Research Monographs in Parallel and Distributed Computing.
Vivek Sarkar and Radhika Thekkath, A General Framework for Iteration-Reordering Loop Transformations, Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation, pp. 175-187 (June 1992).
Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash, On Estimating and Enhancing Cache Effectiveness, Lecture Notes in Computer Science (589):328-343 (1991). Proc. Fourth Int'l. Workshop Lang. Compilers for Parallel Computing, Santa Clara, California (August 1991).
B. Ramakrishna Rau, Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops, Proc. 27th Ann. Int'l. Symp. Microarchitecture, San Jose, California, pp. 63-74 (November 1994).
Vivek Sarkar and Barbara Simons, Don't Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops, Video Lecture in University Video Communication's Distinguished Lecture Series IX (August 1994).
Vivek Sarkar, Determining Average Program Execution Times and their Variance, Proc. SIGPLAN Conf. Prog. Lang. Design and Implementation
24(7):298-312 (July 1989).
Vivek Sarkar, Automatic Partitioning of a Program Dependence Graph into Parallel Tasks, IBM J. Res. Dev
The Standard Performance Evaluation Corporation, SPEC CPU95 Benchmarks, http://open.specbench.org/osg/cpu95/ (1997).
IBM Corporation, POWER2 and PowerPC, Special issue of IBM J. Res. Dev.
38(5): 489-648 (September 1994).
Barbara Simons, Vivek Sarkar, Jr. Mauricio Breternitz, and Michael Lai, An Optimal Asynchronous Scheduling Algorithm for Software Cache Consistency, Proc. Hawaii Int'l. Conf. Syst. Sci. (January 1994).
Max Hailperin, Improving the Ratio of Memory Operations to Floating-Point operations in loops, Computing Reviews. Copy of review can be found in the ACM digital library at http://www.acm.org/pubs/citations/journals/toplas/1994-16-6/p1768-carr/.
S. Weiss and J. E. Smith, A Study of Scalar Compilation Techniques for Pipelined Supercomputers, Proc. Second Int'l Conf. Architectural Support Progr. Lang. Oper. Syst. (ASPLOS), pp. 105-109 (October 1987).
Reese B. Jones and Vicki H. Allan, Software Pipelining: An Evaluation of Enhanced Pipelining, Proc. 24th Ann. Int'l. symp. Microarchitecture, pp. 82-92 (December 1990).
Bogong Su, Shiyuan Ding, Jian Wang, and Jinshi Xia, GURPR-A Method for Global Software Piplining; Proc. 20th Ann. Int'l. Symp. Microarchitecture, pp. 88-96 (December 1986).
Daniel M. Lavery and Wen-Mei W.Hwu, Unrolling-Based Optmizations for Modulo Sheduling, Proc. MICRO-28, pp. 327-337 (December 1995).