Skip to main content
Log in

Restructuring Computations for Temporal Data Cache Locality

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Data access costs contribute significantly to the execution time of applications with complex data structures. A the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it is useful to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of seven memory-intensive benchmarks; we describe Computation Regrouping, a source-level approach to improving the performance of memory-bound applications by increasing temporal locality to eliminate cache and TLB misses; and, we demonstrate significant performance improvement by applying Computation Regrouping to our suite of seven benchmarks. Using Computation Regrouping, we observe a geometric mean speedup of 1.90, with individual speedups ranging from 1.26 to 3.03. Most of this improvement comes from eliminating memory tall time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. V. Pingali, Memory Performance of Complex Data Structures: Characterization and Optimization, Master's thesis, University of Utah (August 2001).

  2. D. Callahan, K. Kennedy, and A. Porterfield, Software Prefetching, Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 40–52 (April 1991).

  3. S. Carr, K. McKinley, and C.-W. Tseng, Compiler Optimizations for Improving Data Locality, Proceedings of the 6th Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 252–262 (October 1994).

  4. H. Han and C.-W. Tseng, Improving Locality for Adaptive Irregular Scientific Codes, Technical Report CS-TR-4039, University of Maryland, College Park (September 1999).

    Google Scholar 

  5. M. T. Kandemir, A. N. Choudhary, J. Ramanujam, and P. Banerjee, Improving Locality Using Loop and Data Transformations in an Integrated Framework, International Symposium on Microarchitecture, pp. 285–297 (November–December 1998).

  6. M. Karlsson, F. Dahlgren, and P. Stenstrom, A Prefetching Technique for Irregular Accesses to Linked Data Structures, Proceedings of the Sixth Annual Symposium on High Performance Computer Architecture, pp. 206–217 (January 2000).

  7. I. Kodukula and K. Pingali, Data-Centric Transformations for Locality Enhancement, International Journal of Parallel Programming, Vol. 29, pp. 319–364 (June 2001).

    Google Scholar 

  8. A. Rogers and K. Pingali, Process Decomposition Through Locality of Reference, Proceedings of the 1989 ACM SIGPLAN Conference on Program-ming Language Design and Implementation, pp. 69–80 (June 1989).

  9. C. Ding and K. Kennedy, Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse, 2001 International Parallel and Distributed Processing Symposium (April 2001).

  10. S. Leung and J. Zahorjan, Optimizing Data Locality by Array Restructuring, Tech. Rep. UW-CSE–95–09–01, University of Washington Dept. of Computer Science and Engineering (September 1995).

  11. T. M. Chilimbi, M. D. Hill, and J. R. Larus, Cache-Conscious Structure Layout, Proceedings of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 1–12 (May 1999).

  12. J. Rao and K. A. Ross, Cache Conscious Indexing for Decision-Support in Main Memory, Proceedings of the 25th VLDB Conference, pp. 78–89 (1999).

  13. J. Rao and K. A. Ross, Making B+Trees Cache Conscious in Main Memory, Proceedings of the 26th VLDB Conference, pp. 475–486 (2000).

  14. W. Abu-Sufah, D. Kuck, and D. Lawrie, Automatic Program Transformations for Virtual Memory Computers, Proceedings of the 1979 National Computer Conference, pp. 969–974 (June 1979).

  15. N. Mitchell, L. Carter, and J. Ferrante, Localizing Non-Affine Array References, Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, pp. 192–202 (October 1999).

  16. M. Carlisle, A. Rogers, J. Reppy, and L. Hendren, Early Experiences with Olden, Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, pp. 1–20 (August 1993).

  17. Silicon Graphics Inc., SpeedShop User's Guide (1996).

  18. M. Zagha, B. Larson, S. Turner, and M. Itzkowitz, Performance Analysis Using the MIPS R10000 Performance Counters, Proceedings of Supercomputing '96 (November 1996).

  19. A. Guttmann, R-Trees: A Dynamic Index Structure for Spatial Searching, Proceedings of the 1984 International Conference on Management of Data, pp. 47–57 (August 1984).

  20. J. W. Manke and J. Wu, Data-Intensive System Benchmark Suite Analysis and Specification, Atlantic Aerospace Electronics Corp. (June 1999).

  21. H. Han and C. Tseng, Improving Compiler and Run-Time Support for Irregular Reductions, Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing, Chapel Hill, NC (August 1998).

  22. M. Frigo and S. Johnson, FFTW: An Adaptive Software Architecture for the FFT, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1381–1384 (May 1998).

  23. F. Somenzi, CUDD: CU Decision Diagram Package Release 2.3.1 (2001).

  24. D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick, Parallel Programming in Split-C, Proceedings of Supercomputing '93, pp. 262–273 (November 1993).

  25. A. Appel, J. Ellis, and K. Li, Real-Time Concurrent Collection on Tock Multiprocessors, Proceedings of the 1988 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 11–20 (June 1988).

  26. E. S. Roberts and M. T. Vandevoorde, WorkCrews: An Abstraction for Controlling Parallelism, Tech. Rep. SRC-042, Digital Systems Research Center (April 1989).

  27. J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li, Thread Scheduling for Cache Locality, Proceedings of the 7th Conference on Architectural Support for Programming Languages and Systems, Cambridge, MA, pp. 60–73 (October 1996).

  28. M. A. Bender, E. D. Demaine, and M. Farach-Colton, Cache-Oblivious B-Trees, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 399–409 (November 2000).

  29. M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran, Cache-Oblivious Algorithms, 40th Annual Symposium on Foundations of Computer Science, pp. 285–297 (October 1999).

  30. S. Chatterjee, A. Lebeck, P. Patnala, and M. Thottethodi, Recursive Array Layouts and Fast Matrix Multiplication, Proceedings of the 11th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 222–231 (June 1999).

  31. A. G. LaMarca, Caches and Algorithms, Ph.D. thesis, University of Washington (1996).

  32. A. Shatdal, C. Kant, and J. Naughton, Cache Conscious Algorithms for Relational Query Processing, Proceedings of the 20th VLDB Conference, pp. 510–521 (September 1994).

  33. I. Kodukula, N. Ahmed, and K. Pingali, Data-Centric Multi-level Blocking, Proceedings of the 1997 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 346–357 (June 1997).

  34. M. S. Lam, E. E. Rothberg, and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Proceedings of the 4th ASPLOS, pp. 63–74 (April 1991).

  35. D. N. Truong, F. Bodin, and A. Seznec, Improving Cache Behavior of Dynamically Allocated Data Structures, Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pp. 322–329 (October 1998).

  36. S. Ghosh, M. Martonosi, and S. Malik, Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity, Architectural Support for Programming Languages and Operating Systems, pp. 228–239 (October 1998).

  37. N. Mitchell, Guiding Program Transformations with Modal Performance Model, Ph.D. thesis, University of California, San Diego (August 2000).

    Google Scholar 

  38. A. J. C. Bik and H. A. G. Wijshoff, On Automatic Data Structure Election and Code Generation for Parse Computations, 1993 Workshop on Languages and Compilers for Parallel Computing, Vol. 768, Portland, Ore., Berlin, Springer-Verlag, pp. 57–75 (1993).

    Google Scholar 

  39. N. Mateev, K. Pingali, P. Stodghill, and V. Kotlyar, Next-Generation Generic Programming and Its Application to Parse Matrix Computations, International Conference on Supercomputing, pp. 88–99 (2000).

  40. V. Menon and K. Pingali, High-Level Semantic Optimization of Numerical Codes, Proceedings of the 1999 International Conference on Supercomputing, pp. 434–443 (1999).

  41. C.-K. Luk and T. C. Mowry, Compiler-Based Prefetching for Recursive Data Structure, Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 222–233 (October 1996).

  42. Y. Song and Z. Li, New Tiling Techniques to Improve Cache Temporal Locality, Proceedings of the SIGPLAN'99 Conference on Programming Language Design and Implementation, pp. 215–228 (May 1999).

  43. L. Arge, The Buffer Tree: A New Technique for Optimal I/O-Algorithms, Fourth Workshop on Algorithms and Data Structures, pp. 334–345 (August 1995).

  44. L. Arge, K. Hinrichs, J. Vahrenhold, and J. S. Vitter, Efficient Bulk Operations on Dynamic R-Trees, 1st Workshop on Algorithm Engineering and Experimentation, pp. 328–348 (January 1999).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pingali, V.K., McKee, S.A., Hsieh, W.C. et al. Restructuring Computations for Temporal Data Cache Locality. International Journal of Parallel Programming 31, 305–338 (2003). https://doi.org/10.1023/A:1024556711058

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1024556711058

Navigation