Skip to main content

Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

Abstract

The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically under-utilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.

This is a preview of subscription content, access via your institution.

REFERENCES

  1. 1.

    D. Callahan, S. Carr, and K. Kennedy, Improving Register Allocation for Subscripted Variables, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 53-65 (June 1990).

  2. 2.

    D. Gannon, W. Jalby, and K. Gallivan, Strategies for Cache and Local Memory Management by Global Program Transformation, J. Parallel Distributed Computing, 5:587-616 (1988).

    Google Scholar 

  3. 3.

    M. S. Lam, E. E. Rothberg, and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Proc. Fourth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 63-74 (April 1991).

  4. 4.

    A. K. Porterfield, Software Methods for Improvement of Cache Performance on Super-computer Applications, Ph.D. Dissertation, Rice University, Houston, Texas (May 1989).

    Google Scholar 

  5. 5.

    M. E. Wolf and M. S. Lam, A Data Locality Optimizing Algorithm, Proc. SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 30-44 (June 1991).

  6. 6.

    J. Ferrante, V. Sarkar, and W. Thrash, On Estimating and Enhancing Cache Effective-ness, Proc. Fourth Workshop on Lang. Compilers for Parallel Computing (August 1991).

  7. 7.

    D. M. Tullsen and S. J. Eggers, Effective Cache Prefetching on Bus-Based Multipro-cessors, ACM Trans. Computer Syst., 13(1):57-88 (February 1995).

    Google Scholar 

  8. 8.

    T. C. Mowry, M. S. Lam, and A. Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Proc. Fifth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 62-73 (October 1992).

  9. 9.

    A. C. McKeller and E. G. Coffman, The Organization of Matrices and Matrix Operations in a Paged Multiprogramming Environment, Commun. ACM, 12(3):153-165 (1969).

    Google Scholar 

  10. 10.

    W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, Automatic Program Transformations for Virtual Memory Computers, Proc. Nat'l. Computer Conf., pp. 969-974 (June 1979).

  11. 11.

    J. J. Navarro, E. Garcia, and J. R. Herrero, Proc. Tenth ACM Int'l. Conf. Supercomputing (ICS) (1996).

  12. 12.

    I. Kodukula, N. Ahmed, and K. Pingali, Data-Centric Multi-level Blocking, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 346-357 (June 1997).

  13. 13.

    J. R. Allen and K. Kennedy, Automatic Loop Interchange, Proc. SIGPLAN Symp. Compiler Construction SIGPLAN Notices, 19(6):233-246 (June 1984).

    Google Scholar 

  14. 14.

    K. S. McKinley, S. Carr, and C.-W. Tseng, Improving Data Locality with Loop Transformations, ACM Trans. Progr. Lang. Syst., 18(4):424-453 (July 1996).

    Google Scholar 

  15. 15.

    C. Ding and K. Kennedy, Improving Cache Performance of Dynamic Applications with Computation and Data Layout Transformations, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 229-241 (May 1999).

  16. 16.

    R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, The Design and Implemen-tation of a Parallel Unstructured Euler Solver Using Software Primitives, AIAA J., 32:489-496 (1994).

    Google Scholar 

  17. 17.

    H. Sagan, Space-Filling Curves, Springer-Verlag, New York (1994).

    Google Scholar 

  18. 18.

    H. Samet, Applications of Spatial Data Structures: Computer Graphics, Image Processing and GIS, Addison-Wesley, New York (1989).

    Google Scholar 

  19. 19.

    J. P Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy, Load Balancing and Data Locality in Adaptive Hierarhcical N-body Methods: Barnes-Hut, Fast Multipole, and Radiosity, J. Parallel Distributed Computing (June 1995).

  20. 20.

    M. S. Warren and J. K. Salmon, A Parallel Hashed Oct-Tree N-Body Algorithm, Proc. Supercomputing (November 1993).

  21. 21.

    C. Ou, M. Gunwani, and S. Ranka, Architecture-Independent Locality-Improving Transformations of Computational Graphs Embedded in k-Dimensions, Proc. Int'l. Conf. Supercomputing (1995).

  22. 22.

    M. Parashar and J. C. Browne, On Partitioning Dynamic Adaptive Grid Hierarchies, Proc. Hawaii Conf. Syst. Sci. (January 1996).

  23. 23.

    M. Thottethodi, S. Chatterjee, and A. R. Lebeck, Tuning Strassen's Matrix Multiplication Algorithm for Memory Efficiency, Proc. SC98: High Performance Computing and Networking (November 1998).

  24. 24.

    J. Frens and D. Wise, Auto-blocking Matrix Multiplication or Tracking BLAS3 Performance from Source Code, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 206-216 (June 1997).

  25. 25.

    I. Al-Furaih and S. Ranka, Memory Hierarchy Management for Iterative Graph Structures, Proc. Int'l. Parallel Processing Symp. (March 1998).

  26. 26.

    A. George and G. Liu, Computer Solution of Large Sparse Positive Definite Systems, Prentice Hall, Englewood Cliffs, New Jersey (1981).

    Google Scholar 

  27. 27.

    E. Cuthill and J. McKee, Reducing the Bandwidth of Sparse Symmetric Matrices, Proc. ACM National Conf., Association of Computing Machinery (1969).

  28. 28.

    S. Sloan, An Algorithm for Profile and Wavefront Reduction of Sparse Matrices, Int'l. J. Numerical Methods Engng., 23:239-251 (1986).

    Google Scholar 

  29. 29.

    N. Mitchell, L. Carter, and J. Ferrante, Localizing Nonaffine Array References, Proc. Parallel Architectures and Compilation Techniques (October 1999).

  30. 30.

    J. Mellor-Crummey, D. Whalley, and K. Kennedy, Improving Memory Hierarchy Performance for Irregular Applications, Proc. ACM Int'l. Conf. Supercomputing, pp. 425-433 (June 1999).

  31. 31.

    H. Prokop, Cache-Oblivious Algorithms, Master's thesis, MIT Department of Electrical Engineering and Computer Science (June 1999).

  32. 32.

    D. Knuth, The Art of Computer Programming Volume 3: Sorting and Searching, Addison-Wesley, New York (1973).

    Google Scholar 

  33. 33.

    B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, CHARMM: A Program for Macromolecular Energy, Minimization and Dynamics Calculations, J. Computational Chemistry, 4:187-217 (1983).

    Google Scholar 

  34. 34.

    G. Karypis and V. Kumar, Parallel Multilevel k-way Partition Scheme for Irregular Graphs, SIAM Review, 41: 278-300 (1999).

    Google Scholar 

  35. 35.

    R. Robey, Personal Communication (September 2000).

  36. 36.

    Y. C. Hu, A. Cox, and W. Zwaenepoel, Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reordering, Proc. Supercomputing (November 2000).

  37. 37.

    V. Pai and S. Adve, Code Transformations to Improve Memory Parallelism, Proc. MICRO-32 (November 1999).

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to John Mellor-Crummey.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Mellor-Crummey, J., Whalley, D. & Kennedy, K. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. International Journal of Parallel Programming 29, 217–247 (2001). https://doi.org/10.1023/A:1011119519789

Download citation

  • memory hierarchy optimization
  • data reordering
  • computation reordering
  • space-filling curves
  • multi-level blocking