Code and Data Transformations for Improving Shared Cache Performance on SMT Processors

Nikolopoulos, Dimitrios S.

doi:10.1007/978-3-540-39707-6_5

Code and Data Transformations for Improving Shared Cache Performance on SMT Processors

Dimitrios S. Nikolopoulos⁸

Conference paper

594 Accesses
14 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2858))

Abstract.

Simultaneous multithreaded processors use shared on-chip caches, which yield better cost-performance ratios. Sharing a cache between simultaneously executing threads causes excessive conflict misses. This paper proposes software solutions for dynamically partitioning the shared cache of an SMT processor, via the use of three methods originating in the optimizing compilers literature: dynamic tiling, copying and block data layouts. The paper presents an algorithm that combines these transformations and two runtime mechanisms to detect cache sharing between threads and react to it at runtime. The first mechanism uses minimal kernel extensions and the second mechanism uses information collected from the processor hardware counters. Our experimental results show that for regular, perfect loop nests, these transformations are very effective in coping with shared caches. When the caches are shared between threads from the same address space, performance is improved by 16-29% on average. Similar improvements are observed when the caches are shared between threads from different address spaces. To our knowledge, this is the first work to present an all-software approach for managing shared caches on SMT processors. It is also one of the first performance and program optimization studies conducted on a commercial SMT-based multiprocessor using Intel’s hyperthreading technology.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters. In: Proc. of Supercomputing 2000: High Performance Networking and Computing Conference, Dallas, TX (November 2000)
Google Scholar
Cascaval, C., Padua, D.: Estimating Cache Misses and Locality using Stack Distances. In: Proc. of the 17th ACM International Conference on Supercomputing (ICS 2003), San Francisco, CA, June 2003, pp. 150–159 (2003)
Google Scholar
Chame, J., Moon, S.: A Tile Selection Algorithm for Data Locality and Cache Interference. In: Proc. of the 13th ACM International Conference on Supercomputing (ICS 1999), Rhodes, Greece, June 1999, pp. 492–499 (1999)
Google Scholar
Coleman, S., McKinley, K.: Tile Size Selection Using Cache Organization and Data Layout. In: Proc. of the 1995 ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI 1995), San Diego, CA, June 1995, pp. 279–290 (1995)
Google Scholar
Craig, D.: An Integrated Kernel and User-Level Paradigm for Efficient Multiprogramming. Technical Report CSRD No. 1533, University of Illinois at Urbana- Champaign (June 1999)
Google Scholar
Kodukula, I., Ahmed, N., Pingali, K.: Data-Centric Multilevel Blocking. In: Proc. of the 1997 ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI 1997), Las Vegas, Nevada, June 1997, pp. 346–357 (1997)
Google Scholar
Mateev, N., Ahmed, N., Pingali, K.: Tiling Imperfect Loop Nests. In: Proc. of the IEEE/ACM Supercomputing 2000: High Performance Networking and Computing Conference (SC 2000), Dallas, TX (November 2000)
Google Scholar
McDowell, L., Eggers, S., Gribble, S.: Improving Server Software Support for Simultaneous Multithreaded Processors. In: Proc. of the 2003 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2003), San Diego, CA (June 2003)
Google Scholar
McKinley, K., Carr, S., Tseng, C.: Improving Data Locality with Loop Transformations. ACM Transactions on Programming Languages and Systems 18(4), 424–453 (1996)
Article Google Scholar
Park, N., Hong, B., Prasanna, V.: Analysis of Memory Hierarchy Performance of Block Data Layout. In: Proc. of the 2002 International Conference on Parallel Processing (ICPP 2002), Vancouver, Canada, August 2002, pp. 35–42 (2002)
Google Scholar
Redstone, J., Eggers, S., Levy, H.: Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture. In: Proc. of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Cambridge, MA (November 2000)
Google Scholar
Rivera, G., Tseng, C.: A Comparison of Tiling Algorithms. In: Jähnichen, S. (ed.) CC 1999. LNCS, vol. 1575, pp. 168–182. Springer, Heidelberg (1999)
Chapter Google Scholar
Suh, G., Devadas, S., Rudolph, L.: Analytical Cache Models with Applications to Cache Partitioning. In: Proc. of the 15th ACM International Conference on Supercomputing (ICS 2001), Sorrento, Italy, June 2001, pp. 1–12 (2001)
Google Scholar
Suh, G., Rudolph, L., Devadas, S.: Effects of Memory Performance on Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 116–132. Springer, Heidelberg (2002)
Google Scholar
Temam, O., Granston, E., Jalby, W.: To Copy or Not to Copy: A Compile- Time Technique for Assessing when Data Copying Should be Used to Eliminate Cache Conflicts. In: Proc. of the ACM/IEEE Supercomputing 1993: High Performance Networking and Computing Conference (SC 1993), Portland, OR, November 1993, pp. 410–419 (1993)
Google Scholar
Tullsen, D., Eggers, S., Levy, H.: Simultaneous Multithreading: Maximizing On-Chip Parallelism. In: Proceedings of the 22nd International Symposium on Computer Architecture (ISCA 1995), June 1995, pp. 392–403. St. Margherita Ligure, Italy (1995)
Chapter Google Scholar
Wolf, M., Lam, M.: A Data Locality Optimizing Algorithm. In: Proc. of the 1991 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 1991), Toronto, Canada, June 1991, pp. 30–44 (1991)
Google Scholar
Xue, J.: Loop Tiling for Parallelism, August 2000. Kluwer Academic Publishers, Dordrecht (2000)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The College of William & Mary, McGlothlin-Street Hall, VA, 23187-8795, Williamsburg, U.S.A.
Dimitrios S. Nikolopoulos

Authors

Dimitrios S. Nikolopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of California (UCI), 3019 Donald Bren Hall, 92697-3435, Irvine, CA, USA
Alex Veidenbaum
Department of Information and Computer Science, Faculty of Science, Nara women’s University, Kitauoyanishi-machi, Nara-city, 630-8506, Nara, Japan
Kazuki Joe
Keio University, Hiyoshi, Kohoku, Yokohama, 223–8522, Kanagawa, Japan
Hideharu Amano
Tokyo University of Technology, 1404-1 Katakura, Hachioji, 192-0982, Tokyo, Japan
Hideo Aiso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nikolopoulos, D.S. (2003). Code and Data Transformations for Improving Shared Cache Performance on SMT Processors. In: Veidenbaum, A., Joe, K., Amano, H., Aiso, H. (eds) High Performance Computing. ISHPC 2003. Lecture Notes in Computer Science, vol 2858. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39707-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-39707-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20359-9
Online ISBN: 978-3-540-39707-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics