Exploring the Capacity of a Modern SMT Architecture to Deliver High Scientific Application Performance

  • Evangelia Athanasaki
  • Nikos Anastopoulos
  • Kornilios Kourtis
  • Nectarios Koziris
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4208)


Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demonstrated that heterogeneity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent, threads. In this paper, we explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instructions streams. We evaluate and contrast speculative precomputation (SPR) and thread-level parallelism (TLP) techniques for a series of scientific codes executed on an SMT processor. We also examine the effect of thread synchronization mechanisms on multithreaded parallel applications that are executed on a single SMT processor. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor.


Multiple Thread Data Chunk Instruction Level Parallelism Helper Thread Instruction Stream 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Omni OpenMP Compiler Project. Released in the International Conference for High Performance Computing, Networking and Storage (SC 2003) (November 2003)Google Scholar
  2. 2.
    Athanasaki, E., Koziris, N.: Fast Indexing for Blocked Array Layouts to Improve Multi-Level Cache Locality. In: Proc. of INTERACT 2004, Madrid, Spain (2004)Google Scholar
  3. 3.
    Collins, J., Wang, H., Tullsen, D., Hughes, C., Lee, Y., Lavery, D., Shen, J.: Speculative Precomputation: Long-Range Prefetching of Delinquent Loads. In: Proc. of ISCA 2001, Göteborg, Sweden (2001)Google Scholar
  4. 4.
    Intel Corporation. IA-32 Intel Architecture Optimization. Order Num: 248966-011Google Scholar
  5. 5.
    Kim, D., Liao, S., Wang, P., Cuvillo, J., Tian, X., Wang, H., Yeung, D., Girkar, M., Shen, J.: Physical experimentation with prefetching helper threads on Intel’s hyper-threaded processors. In: Proc. of IEEE/ACM CGO 2004, San Jose, CA (2004)Google Scholar
  6. 6.
    Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V., Hazelwood, K.: Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. SIGPLAN Not. 40(6), 190–200 (2005)CrossRefGoogle Scholar
  7. 7.
    Marr, D., Desktop, F., Hill, D., Hinton, G., Koufaty, D., Miller, J., Upton, M.: Hyper-Threading Technology Architecture and Microarchitecture. ITJ (February 2002)Google Scholar
  8. 8.
    Nethercote, N., Seward, J.: Valgrind: A Program Supervision Framework. In: Proc. of RV 2003, Boulder, CO (2003)Google Scholar
  9. 9.
    Patterson, D., Hennessy, J.: Computer Architecture. A Quantitative Approach, 3rd edn., pp. 597–598. Morgan Kaufmann, San Francisco (2003)Google Scholar
  10. 10.
    Blagojevic, F., Wang, T., Nikolopoulos, D.: Runtime Support for Integrating Precomputation and Thread-Level Parallelism on Simultaneous Multithreaded Processors. In: Proc. of LCR 2004, Houston, TX (2004)Google Scholar
  11. 11.
    Tuck, N., Tullsen, D.: Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In: Malyshkin, V.E. (ed.) PaCT 2003. LNCS, vol. 2763. Springer, Heidelberg (2003)Google Scholar
  12. 12.
    Tullsen, D., Eggers, S., Levy, H.: Simultaneous Multithreading: Maximizing On-Chip Parallelism. In: Proc. of ISCA 1995, Santa Margherita Ligure, Italy (1995)Google Scholar
  13. 13.
    Wang, H., Wang, P., Weldon, R., Ettinger, S., Saito, H., Girkar, M., Liao, S., Shen, J.: Speculative Precomputation: Exploring the Use of Multithreading for Latency. ITJ (February 2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Evangelia Athanasaki
    • 1
  • Nikos Anastopoulos
    • 1
  • Kornilios Kourtis
    • 1
  • Nectarios Koziris
    • 1
  1. 1.School of Electrical and Computer EngineeringNational Technical University of Athens 

Personalised recommendations