Design and evaluation of a hierarchical decoupled architecture

Ro, Won W.; Crago, Stephen P.; Despain, Alvin M.; Gaudiot, Jean-Luc

doi:10.1007/s11227-006-8321-2

Design and evaluation of a hierarchical decoupled architecture

Published: December 2006

Volume 38, pages 237–259, (2006)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Won W. Ro¹,
Stephen P. Crago²,
Alvin M. Despain³ &
…
Jean-Luc Gaudiot⁴

97 Accesses
9 Citations
9 Altmetric
Explore all metrics

Abstract

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality.

To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors.

Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agarwal V, Hrishikesh MS, Keckler SW, Burger D (2000) Clock rate versus IPC: The end of the road for conventional microarchitectures. In: Proceedings of the 27th International Symposium on Computer Architecture
Annavaram M, Patel JM, Davidson ES (2001) Data prefetching by dependence graph precoumputation. In: Proceedings of the 28th International Symposium on Computer Architecture
Bird P, Rawsthorne A, Topham N (1993) The effectiveness of decoupling. In: Proc. of Int. Conf. on Supercomputing, pages
Burger D, Austin T (1997) The SimpleScalar Tool Set. Version 2.0. Technical Report CS-TR-97-1342, University of Wisconsin-Madison
Burns J, Gaudiot J-L (2002) SMT layout overhead and scalability. Transactions on Parallel and Distributed Processing Systems 13(2)
Chappell R, Stark J, Kim S, Reinhardt S, Patt Y (1999) Simultaneous subordinate microthreading (SSMT). In: Proceedings of the 26th Annual International Symposium on Computer Architecture
Chen T-F, Baer J-L (1995) Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers 44(5):609–623
Article MATH Google Scholar
Crago S, Despain A, Gaudiot J-L, Makhija M, Ro W, Srivastava A (2000) A high-performance, hierarchical decoupled architecture. In: Proceedings of MEDEA Workshop
Collins JD, Wang H, Tullsen DM, Hughes C, Lee Y-F, Lavery D, Shen JP (2001) Speculative precomputation: long-range prefetching of delinquent loads. In: Proceedings of the 28th International Symposium on Computer Architecture
Collins JD, Tullsen DM, Wang H, Shen JP (2001) Dynamic speculative precomputation. In: Proceedings of the 34th Annual International Symposium on Microarchitecture
Dubois M, Song Y (1998) Assisted execution. Technical Report CENG #98-25, Department of EE-Systems, University of Southern California
Eggers S, Emer J, Levy H, Lo J, Stamm R, Tullsen D (1997) Simultaneous multithreading: A platform for next-generation processors, IEEE Micro
Farkas KI, Chow P, Jouppi NP, Vranesic Z (1997) The multicluster architecture: reducing cycle time through partitioning. In: Proceedings of the 30th Annual. IEEE/ACM Symposium on Microarchitecture
Farrens M, Nico P, Ng P (1993) A comparison of superscalar and decoupled access/execute architectures. In: Proceedings of the 26th Annual International Symposium on Microarchitecture
Goodman JR, Hsieh JT, Liou K, Pleszkun AR, Schechter PB, Young HC (1985) PIPE: a vlsi decoupled architecture. In: Proceedings the 12th International Symposium on Computer Architecture
Hong SI, McKee SA, Salinas MH, Klenke RH, Aylor JH, Wulf WA (1999) Access order and effective bandwidth for streams on a direct rambus memory. In: Proceedings of the 5th International Symposium on High-Performance Computer Architecture
Jones GP, Topham NP (1997) A comparison of data prefetching on an access decoupled and superscalar machine. In: Proceedings of the 30th International Symposium on Microarchitecture
Kavi KM, Arul J, Giorgi R (2000) Execution and cache performance of the scheduled dataflow architecture. Journal of Universal Computer Science, Special Issue on Multithreaded and Chip Multiprocessors
Krishnan V, Torrellas J (1999) A chip-multiprocessor architecture with speculative multithreading. IEEE Trans Comput 48(9)
Kurian L, Hulina PT, Coraor LD (1994) Memory latency effects in decoupled architectures. IEEE Trans Comput 43(10)
Luk C-K, Mowry TC (1996) Compiler based prefetching for recursive data structures. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems
Luk C-K (2001) Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processor. In: Proceedings of the 28th International Symposium on Computer Architecture
Andreas Moshovos, Dionisios Pnevmatikatos N, Amirali Baniasadi (2001) Slice-processors: An implementation of operation-based prediction. In: Proceedings of the 15th international conference on Supercomputing
Palacharla S, Jouppi NP, Smith JE (1997) Complexity-effective superscalar processors. In: Proceedings of the 24th International Symposium on Computer Architecture
Parcerisa J-M, González A (1999) The synergy of multithreading and access/execute decoupling. In: Proceedings of the 5th International Symposium on High-Performance Computer Architecture
Patterson D, Anderson T, Cardwell N, Fromm R, Keeton K, Kozyrakis C, Thomas R, Yelick K (1997) A case for intelligent DRAM: IRAM. IEEE Micro
Ro WW, Gaudiot J-L, Crago SP, Despain AM (2003) HiDISC: A decoupled architecture for data-intensive applications. In: Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), Nice, France
Roth A, Moshovos A, Sohi GS (1998) Dependence based prefetching for linked data structures. In: Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems
Roth A, Zilles CB, Sohi GS (2000) Speculative miss/execute decoupling. In: Proceedings of MEDEA Workshop
Roth A, Sohi GS (2001) Speculative data-driven multithreading. In: Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Smith J (1982) Decoupled access/execute computer architecture. In: Proceedings of the 9th International Symposium on Computer Architecture
Smith J (1989) Dynamic instruction scheduling and the astronautics ZS-1. IEEE Computer
Sohi GS, Breach SE, Vijaykumar TN (1995) Multiscalar processors. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture
Tyson G, Farrens M, Pleszkun A (1992) MISC: A multiple instruction stream computer. In: Proceedings of the 25th Annual International Symposium on Microarchitecture
Wulf WA (1992) Evaluation of the WM architecture. In: Proceedings of the 19th International Symposium on Computer Architecture
Zhang Y, Adams GB III (1998) performance modeling and code partitioning for the DS achitecture. In: Proceedings of the 25th Annual International Symposium on Computer Architecture
Zhang Y, Adams GB III (1996) Exploiting instruction level parallelism with the ds architecture. In: Proceedings of the 1996 International Conference on Parallel Processing
Zilles CB, Sohi GS (2000) Understanding the backward slices of performance degrading instructions. In: Proceedings of the 27th International Symposium on Computer Architecture
Data-intensive systems benchmarks suite analysis and specification. http://www.aaec.com/projectweb/dis/
DIS Stressmark Suite. http://www.aaec.com/projectweb/dis/DIS_Stressmarks_v1_0.pdf

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, California State University, Northridge
Won W. Ro
Information Sciences Institute-East, University of Southern California, California
Stephen P. Crago
Department of Electrical Engineering, University of Southern California, California
Alvin M. Despain
Department of Electrical Engineering and Computer Science, University of California, Irvine
Jean-Luc Gaudiot

Authors

Won W. Ro
View author publications
You can also search for this author in PubMed Google Scholar
Stephen P. Crago
View author publications
You can also search for this author in PubMed Google Scholar
Alvin M. Despain
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Luc Gaudiot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Won W. Ro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ro, W.W., Crago, S.P., Despain, A.M. et al. Design and evaluation of a hierarchical decoupled architecture. J Supercomput 38, 237–259 (2006). https://doi.org/10.1007/s11227-006-8321-2

Download citation

Issue Date: December 2006
DOI: https://doi.org/10.1007/s11227-006-8321-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design and evaluation of a hierarchical decoupled architecture

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

A Modern Primer on Processing in Memory

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Design and evaluation of a hierarchical decoupled architecture

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

A Modern Primer on Processing in Memory

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation