Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

Abstract

The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically under-utilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.

memory hierarchy optimization data reordering computation reordering space-filling curves multi-level blocking 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

REFERENCES

  1. 1.
    D. Callahan, S. Carr, and K. Kennedy, Improving Register Allocation for Subscripted Variables, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 53-65 (June 1990).Google Scholar
  2. 2.
    D. Gannon, W. Jalby, and K. Gallivan, Strategies for Cache and Local Memory Management by Global Program Transformation, J. Parallel Distributed Computing, 5:587-616 (1988).Google Scholar
  3. 3.
    M. S. Lam, E. E. Rothberg, and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Proc. Fourth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 63-74 (April 1991).Google Scholar
  4. 4.
    A. K. Porterfield, Software Methods for Improvement of Cache Performance on Super-computer Applications, Ph.D. Dissertation, Rice University, Houston, Texas (May 1989).Google Scholar
  5. 5.
    M. E. Wolf and M. S. Lam, A Data Locality Optimizing Algorithm, Proc. SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 30-44 (June 1991).Google Scholar
  6. 6.
    J. Ferrante, V. Sarkar, and W. Thrash, On Estimating and Enhancing Cache Effective-ness, Proc. Fourth Workshop on Lang. Compilers for Parallel Computing (August 1991).Google Scholar
  7. 7.
    D. M. Tullsen and S. J. Eggers, Effective Cache Prefetching on Bus-Based Multipro-cessors, ACM Trans. Computer Syst., 13(1):57-88 (February 1995).Google Scholar
  8. 8.
    T. C. Mowry, M. S. Lam, and A. Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Proc. Fifth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 62-73 (October 1992).Google Scholar
  9. 9.
    A. C. McKeller and E. G. Coffman, The Organization of Matrices and Matrix Operations in a Paged Multiprogramming Environment, Commun. ACM, 12(3):153-165 (1969).Google Scholar
  10. 10.
    W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, Automatic Program Transformations for Virtual Memory Computers, Proc. Nat'l. Computer Conf., pp. 969-974 (June 1979).Google Scholar
  11. 11.
    J. J. Navarro, E. Garcia, and J. R. Herrero, Proc. Tenth ACM Int'l. Conf. Supercomputing (ICS) (1996).Google Scholar
  12. 12.
    I. Kodukula, N. Ahmed, and K. Pingali, Data-Centric Multi-level Blocking, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 346-357 (June 1997).Google Scholar
  13. 13.
    J. R. Allen and K. Kennedy, Automatic Loop Interchange, Proc. SIGPLAN Symp. Compiler Construction SIGPLAN Notices, 19(6):233-246 (June 1984).Google Scholar
  14. 14.
    K. S. McKinley, S. Carr, and C.-W. Tseng, Improving Data Locality with Loop Transformations, ACM Trans. Progr. Lang. Syst., 18(4):424-453 (July 1996).Google Scholar
  15. 15.
    C. Ding and K. Kennedy, Improving Cache Performance of Dynamic Applications with Computation and Data Layout Transformations, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 229-241 (May 1999).Google Scholar
  16. 16.
    R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, The Design and Implemen-tation of a Parallel Unstructured Euler Solver Using Software Primitives, AIAA J., 32:489-496 (1994).Google Scholar
  17. 17.
    H. Sagan, Space-Filling Curves, Springer-Verlag, New York (1994).Google Scholar
  18. 18.
    H. Samet, Applications of Spatial Data Structures: Computer Graphics, Image Processing and GIS, Addison-Wesley, New York (1989).Google Scholar
  19. 19.
    J. P Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy, Load Balancing and Data Locality in Adaptive Hierarhcical N-body Methods: Barnes-Hut, Fast Multipole, and Radiosity, J. Parallel Distributed Computing (June 1995).Google Scholar
  20. 20.
    M. S. Warren and J. K. Salmon, A Parallel Hashed Oct-Tree N-Body Algorithm, Proc. Supercomputing (November 1993).Google Scholar
  21. 21.
    C. Ou, M. Gunwani, and S. Ranka, Architecture-Independent Locality-Improving Transformations of Computational Graphs Embedded in k-Dimensions, Proc. Int'l. Conf. Supercomputing (1995).Google Scholar
  22. 22.
    M. Parashar and J. C. Browne, On Partitioning Dynamic Adaptive Grid Hierarchies, Proc. Hawaii Conf. Syst. Sci. (January 1996).Google Scholar
  23. 23.
    M. Thottethodi, S. Chatterjee, and A. R. Lebeck, Tuning Strassen's Matrix Multiplication Algorithm for Memory Efficiency, Proc. SC98: High Performance Computing and Networking (November 1998).Google Scholar
  24. 24.
    J. Frens and D. Wise, Auto-blocking Matrix Multiplication or Tracking BLAS3 Performance from Source Code, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 206-216 (June 1997).Google Scholar
  25. 25.
    I. Al-Furaih and S. Ranka, Memory Hierarchy Management for Iterative Graph Structures, Proc. Int'l. Parallel Processing Symp. (March 1998).Google Scholar
  26. 26.
    A. George and G. Liu, Computer Solution of Large Sparse Positive Definite Systems, Prentice Hall, Englewood Cliffs, New Jersey (1981).Google Scholar
  27. 27.
    E. Cuthill and J. McKee, Reducing the Bandwidth of Sparse Symmetric Matrices, Proc. ACM National Conf., Association of Computing Machinery (1969).Google Scholar
  28. 28.
    S. Sloan, An Algorithm for Profile and Wavefront Reduction of Sparse Matrices, Int'l. J. Numerical Methods Engng., 23:239-251 (1986).Google Scholar
  29. 29.
    N. Mitchell, L. Carter, and J. Ferrante, Localizing Nonaffine Array References, Proc. Parallel Architectures and Compilation Techniques (October 1999).Google Scholar
  30. 30.
    J. Mellor-Crummey, D. Whalley, and K. Kennedy, Improving Memory Hierarchy Performance for Irregular Applications, Proc. ACM Int'l. Conf. Supercomputing, pp. 425-433 (June 1999).Google Scholar
  31. 31.
    H. Prokop, Cache-Oblivious Algorithms, Master's thesis, MIT Department of Electrical Engineering and Computer Science (June 1999).Google Scholar
  32. 32.
    D. Knuth, The Art of Computer Programming Volume 3: Sorting and Searching, Addison-Wesley, New York (1973).Google Scholar
  33. 33.
    B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, CHARMM: A Program for Macromolecular Energy, Minimization and Dynamics Calculations, J. Computational Chemistry, 4:187-217 (1983).Google Scholar
  34. 34.
    G. Karypis and V. Kumar, Parallel Multilevel k-way Partition Scheme for Irregular Graphs, SIAM Review, 41: 278-300 (1999).Google Scholar
  35. 35.
    R. Robey, Personal Communication (September 2000).Google Scholar
  36. 36.
    Y. C. Hu, A. Cox, and W. Zwaenepoel, Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reordering, Proc. Supercomputing (November 2000).Google Scholar
  37. 37.
    V. Pai and S. Adve, Code Transformations to Improve Memory Parallelism, Proc. MICRO-32 (November 1999).Google Scholar

Copyright information

© Plenum Publishing Corporation 2001

Authors and Affiliations

  • John Mellor-Crummey
    • 1
  • David Whalley
    • 2
  • Ken Kennedy
    • 1
  1. 1.Department of Computer Science, MS 132Rice UniversityHouston
  2. 2.Computer Science DepartmentFlorida State UniversityTallahassee

Personalised recommendations