Advertisement

The Journal of Supercomputing

, Volume 24, Issue 1, pp 43–67 | Cite as

Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation

  • P. M. W. Knijnenburg
  • T. Kisuki
  • M. F. P. O'Boyle
Article

Abstract

Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary dramatically from one processor to the next. In this paper, we therefore address the problem of how to select tile sizes and unroll factors simultaneously. We approach this problem in an architecturally adaptive manner by means of iterative compilation, where we generate many versions of a program and decide upon the best by actually executing them and measuring their execution time. We evaluate several iterative strategies based on genetic algorithms, random sampling and simulated annealing. We compare the levels of optimization obtained by iterative compilation to several well-known static techniques and show that we outperform each of them on a range of benchmarks across a variety of architectures. Finally, we show how to quantitatively trade-off the number of profiles needed and the level of optimization that can be reached. In this way, we can reach high levels of optimization within 50 iterations.

program optimization adaptive compilation program transformation locality optimization instruction level parallelism 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pp. 1–30. Prentice-Hall, Englewood Cliffs, 1972.Google Scholar
  2. 2.
    M. Barreteau, F. Bodin, Z. Chamski, H.-P. Charles, C. Eisenbeis, J. Gurd, J. Hoogerbrugge, P. Hu, W. Jalby, T. Kisuki, P.M. W. Knijnenburg, P. van der Mark, A. Nisbet, M. F. P. O'Boyle, E. Rohou, A. Seznec, E. A. Stöhr, M. Treffers, and H. A. G. Wijshoff. OCEANS: Optimizing compilers for embedded applications. In P. Amestoy et al., ed., Proc. Euro-Par 99, volume 1685 of Lecture Notes in Computer Science, pp. 1171–1175. Springer Verlag, Berlin, 1999.Google Scholar
  3. 3.
    A. J. C. Bik, P. J. Brinkhaus, P.M. W. Knijnenburg, and H. A. G. Wijshoff. Transformation mechanisms in MT1. Technical Report 2000-21, LIACS, Leiden University, Leiden, 2000.Google Scholar
  4. 4.
    A. J. C. Bik and H. A. G. Wijshoff. MT1: A prototype restructuring compiler. Technical Report 93-32, Department of Computer Science, Leiden University, Leiden 1993.Google Scholar
  5. 5.
    J. Bilmes, K. Asanovi?, C. W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In Proc. International Conference on Supercomputing, pp. 340–347, ACM Press, New York, 1997.Google Scholar
  6. 6.
    F. Bodin, T. Kisuki, P.M. W. Knijnenburg, M. F. P. O'Boyle, and E. Rohou. Iterative compilation in a non-linear optimization space. In Proc. ACM Workshop on Profile and Feedback Directed Compilation, 1998. Organized in conjunction with PACT98, Paris, France.Google Scholar
  7. 7.
    S. Carr. Combining optimization for cache and instruction level parallelism. In Proc. Conference on Parallel Architectures and Compilation Techniques, pp. 238–247. IEEE Computer Society Press, Los Alamitos, Calif., 1996.Google Scholar
  8. 8.
    S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768–1810, 1994.Google Scholar
  9. 9.
    K. Chow and Y. Wu. Feedback-directed selection and characterization of compiler optimizatons. In Proc. 2nd Workshop on Feedback Directed Optimization, Haifa, 1999. Organized in conjunction with MICRO32.Google Scholar
  10. 10.
    R. Cohn and P. G. Lowney. Feedback directed optimization in Compaq's compilation tools for Alpha. In Proc. 2nd Workshop on Feedback Directed Optimization, Haifa, 1999. Organized in conjunction with MICRO32.Google Scholar
  11. 11.
    S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 279–290. ACM Press, New York, 1995.Google Scholar
  12. 12.
    H. Corporaal. Microprocessor Architectures: From VLIW to TTA. John Wiley, New York, 1997.Google Scholar
  13. 13.
    G. de Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, New York, 1994.Google Scholar
  14. 14.
    D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. J. Parallel and Distributed Computing, 5:587–616, 1988.Google Scholar
  15. 15.
    S. Gosh, M. Martonosi, and S. Malik. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. on Programming Languages and Systems, 21(4):703–746, 1999.Google Scholar
  16. 16.
    H. Han, G. Rivera, and C.-W. Tseng. Software support for improving locality in scientific codes. In Proc. Compilers for Parallel Computers, pp. 213–228, Aussois, 2000.Google Scholar
  17. 17.
    W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Cahng, N. J. Warter, R. A. Bringman, R. G. Oullette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The superblock: An effective technique for vliw and superscalar compilation. The Journal of Supercomputing, 7(1/2):229–248, 1993.Google Scholar
  18. 18.
    T. Kisuki, P.M. W. Knijnenburg, and M. F. P. O'Boyle. Iterative compilation for tile sizes and unroll factors: Implementation, performance, search strategies. Technical Report TR2000-06, LIACS, Leiden University, Leiden, 2000.Google Scholar
  19. 19.
    T. Kisuki, P. M. W. Knijnenburg, M. F. P. O'Boyle, F. Bodin, and H. A. G. Wijshoff. A feasibility study in iterative compilation. In Proc. International Symposium on High Performance Computing, volume 1615 of Lecture Notes in Computer Science, pp. 121–132. Springer Verlag, Berlin, 1999.Google Scholar
  20. 20.
    T. Kisuki, P.M. W. Knijnenburg, M. F. P. O'Boyle, and H. A. G. Wijshoff. Iterative compilation in program optimization. In Proc. Compilers for Parallel Computers, pp. 35–44, Aussois, 2000.Google Scholar
  21. 21.
    P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O'Boyle. The effect of cache models on iterative compilation for combined tiling and unrolling. In Proc. 3rd ACM Workshop on Profile Directed and Dynamic Optimization, pp. 31–40, Monterey, 2000. Organized in conjunction with MICRO-33.Google Scholar
  22. 22.
    M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proc. International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 63–74. ACM Press, New York, 1991.Google Scholar
  23. 23.
    S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective compiler support for predicated execution using the hyperblock. In Proc. 25th Internationsl Symposium on Microarchitecture, pp. 45–54. IEEE Computer Society Press, Los Alamitos, Calif., 1992.Google Scholar
  24. 24.
    M. Mock, M. Berryman, C. Chambers, and S. J. Eggers. Calpa: A tool for automating dynamic compilation. In Proc. 2nd Workshop on Feedback Directed Optimization, 1999. Organized in conjunction with MICRO32, Paris, France.Google Scholar
  25. 25.
    S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco, 1997.Google Scholar
  26. 26.
    A. Nisbet. GAPS: Genetic algorithm optimised parallelization. In Proc. Workshop on Profile and Feedback Directed Compilation, Paris, 1998. Organized in conjuction with PACT98.Google Scholar
  27. 27.
    M. F. P. O'Boyle and P.M. W. Knijnenburg. Efficient parallelization using combined loop and data transformations. In Proc. IEEE International Conference on Parallel Architectures and Compilation Techniques, pp. 283–291. IEEE Computer Society Press, Los Alamitos, Calif., 1999.Google Scholar
  28. 28.
    M. F. P. O'Boyle, P.M. W. Knijnenburg, T. Kisuki, and G. Fursin. Evaluating iterative compilation in massive optimization spaces. Preprint, University of Edinburgh, 2001.Google Scholar
  29. 29.
    G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In Proc. 8th International Conference on Compiler Construction, Lecture Notes in Computer Science. Springer Verlag, Berlin, 1999.Google Scholar
  30. 30.
    P. van der Mark, E. Rohou, F. Bodin, Z. Chamski, and C. Eisenbeis. Using iterative compilation for managing software pipeline—unrolling tradeoffs. In Proc. 4th International Workshop on Software and Compilers for Embedded Systems (SCOPES99), 1999.Google Scholar
  31. 31.
    R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. Technical Report UT-CS-97-366, University of Tennessee, TN, 1997.Google Scholar
  32. 32.
    M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. International Journal of Parallel Programming, 26(4):479–503, 1998.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • P. M. W. Knijnenburg
    • 1
  • T. Kisuki
    • 1
  • M. F. P. O'Boyle
    • 2
  1. 1.Leiden Institute of Advanced Computer ScienceLeiden UniversityLeidenthe Netherlands
  2. 2.Institute for Computing Systems ArchitectureEdinburgh UniversityEdinburghUK

Personalised recommendations