Skip to main content

Code and Data Transformations for Improving Shared Cache Performance on SMT Processors

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2858))

Abstract.

Simultaneous multithreaded processors use shared on-chip caches, which yield better cost-performance ratios. Sharing a cache between simultaneously executing threads causes excessive conflict misses. This paper proposes software solutions for dynamically partitioning the shared cache of an SMT processor, via the use of three methods originating in the optimizing compilers literature: dynamic tiling, copying and block data layouts. The paper presents an algorithm that combines these transformations and two runtime mechanisms to detect cache sharing between threads and react to it at runtime. The first mechanism uses minimal kernel extensions and the second mechanism uses information collected from the processor hardware counters. Our experimental results show that for regular, perfect loop nests, these transformations are very effective in coping with shared caches. When the caches are shared between threads from the same address space, performance is improved by 16-29% on average. Similar improvements are observed when the caches are shared between threads from different address spaces. To our knowledge, this is the first work to present an all-software approach for managing shared caches on SMT processors. It is also one of the first performance and program optimization studies conducted on a commercial SMT-based multiprocessor using Intel’s hyperthreading technology.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters. In: Proc. of Supercomputing 2000: High Performance Networking and Computing Conference, Dallas, TX (November 2000)

    Google Scholar 

  2. Cascaval, C., Padua, D.: Estimating Cache Misses and Locality using Stack Distances. In: Proc. of the 17th ACM International Conference on Supercomputing (ICS 2003), San Francisco, CA, June 2003, pp. 150–159 (2003)

    Google Scholar 

  3. Chame, J., Moon, S.: A Tile Selection Algorithm for Data Locality and Cache Interference. In: Proc. of the 13th ACM International Conference on Supercomputing (ICS 1999), Rhodes, Greece, June 1999, pp. 492–499 (1999)

    Google Scholar 

  4. Coleman, S., McKinley, K.: Tile Size Selection Using Cache Organization and Data Layout. In: Proc. of the 1995 ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI 1995), San Diego, CA, June 1995, pp. 279–290 (1995)

    Google Scholar 

  5. Craig, D.: An Integrated Kernel and User-Level Paradigm for Efficient Multiprogramming. Technical Report CSRD No. 1533, University of Illinois at Urbana- Champaign (June 1999)

    Google Scholar 

  6. Kodukula, I., Ahmed, N., Pingali, K.: Data-Centric Multilevel Blocking. In: Proc. of the 1997 ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI 1997), Las Vegas, Nevada, June 1997, pp. 346–357 (1997)

    Google Scholar 

  7. Mateev, N., Ahmed, N., Pingali, K.: Tiling Imperfect Loop Nests. In: Proc. of the IEEE/ACM Supercomputing 2000: High Performance Networking and Computing Conference (SC 2000), Dallas, TX (November 2000)

    Google Scholar 

  8. McDowell, L., Eggers, S., Gribble, S.: Improving Server Software Support for Simultaneous Multithreaded Processors. In: Proc. of the 2003 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2003), San Diego, CA (June 2003)

    Google Scholar 

  9. McKinley, K., Carr, S., Tseng, C.: Improving Data Locality with Loop Transformations. ACM Transactions on Programming Languages and Systems 18(4), 424–453 (1996)

    Article  Google Scholar 

  10. Park, N., Hong, B., Prasanna, V.: Analysis of Memory Hierarchy Performance of Block Data Layout. In: Proc. of the 2002 International Conference on Parallel Processing (ICPP 2002), Vancouver, Canada, August 2002, pp. 35–42 (2002)

    Google Scholar 

  11. Redstone, J., Eggers, S., Levy, H.: Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture. In: Proc. of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Cambridge, MA (November 2000)

    Google Scholar 

  12. Rivera, G., Tseng, C.: A Comparison of Tiling Algorithms. In: Jähnichen, S. (ed.) CC 1999. LNCS, vol. 1575, pp. 168–182. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  13. Suh, G., Devadas, S., Rudolph, L.: Analytical Cache Models with Applications to Cache Partitioning. In: Proc. of the 15th ACM International Conference on Supercomputing (ICS 2001), Sorrento, Italy, June 2001, pp. 1–12 (2001)

    Google Scholar 

  14. Suh, G., Rudolph, L., Devadas, S.: Effects of Memory Performance on Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 116–132. Springer, Heidelberg (2002)

    Google Scholar 

  15. Temam, O., Granston, E., Jalby, W.: To Copy or Not to Copy: A Compile- Time Technique for Assessing when Data Copying Should be Used to Eliminate Cache Conflicts. In: Proc. of the ACM/IEEE Supercomputing 1993: High Performance Networking and Computing Conference (SC 1993), Portland, OR, November 1993, pp. 410–419 (1993)

    Google Scholar 

  16. Tullsen, D., Eggers, S., Levy, H.: Simultaneous Multithreading: Maximizing On-Chip Parallelism. In: Proceedings of the 22nd International Symposium on Computer Architecture (ISCA 1995), June 1995, pp. 392–403. St. Margherita Ligure, Italy (1995)

    Chapter  Google Scholar 

  17. Wolf, M., Lam, M.: A Data Locality Optimizing Algorithm. In: Proc. of the 1991 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 1991), Toronto, Canada, June 1991, pp. 30–44 (1991)

    Google Scholar 

  18. Xue, J.: Loop Tiling for Parallelism, August 2000. Kluwer Academic Publishers, Dordrecht (2000)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nikolopoulos, D.S. (2003). Code and Data Transformations for Improving Shared Cache Performance on SMT Processors. In: Veidenbaum, A., Joe, K., Amano, H., Aiso, H. (eds) High Performance Computing. ISHPC 2003. Lecture Notes in Computer Science, vol 2858. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39707-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39707-6_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20359-9

  • Online ISBN: 978-3-540-39707-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics