Skip to main content
Log in

Design and evaluation of a hierarchical decoupled architecture

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality.

To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors.

Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal V, Hrishikesh MS, Keckler SW, Burger D (2000) Clock rate versus IPC: The end of the road for conventional microarchitectures. In: Proceedings of the 27th International Symposium on Computer Architecture

  2. Annavaram M, Patel JM, Davidson ES (2001) Data prefetching by dependence graph precoumputation. In: Proceedings of the 28th International Symposium on Computer Architecture

  3. Bird P, Rawsthorne A, Topham N (1993) The effectiveness of decoupling. In: Proc. of Int. Conf. on Supercomputing, pages

  4. Burger D, Austin T (1997) The SimpleScalar Tool Set. Version 2.0. Technical Report CS-TR-97-1342, University of Wisconsin-Madison

  5. Burns J, Gaudiot J-L (2002) SMT layout overhead and scalability. Transactions on Parallel and Distributed Processing Systems 13(2)

  6. Chappell R, Stark J, Kim S, Reinhardt S, Patt Y (1999) Simultaneous subordinate microthreading (SSMT). In: Proceedings of the 26th Annual International Symposium on Computer Architecture

  7. Chen T-F, Baer J-L (1995) Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers 44(5):609–623

    Article  MATH  Google Scholar 

  8. Crago S, Despain A, Gaudiot J-L, Makhija M, Ro W, Srivastava A (2000) A high-performance, hierarchical decoupled architecture. In: Proceedings of MEDEA Workshop

  9. Collins JD, Wang H, Tullsen DM, Hughes C, Lee Y-F, Lavery D, Shen JP (2001) Speculative precomputation: long-range prefetching of delinquent loads. In: Proceedings of the 28th International Symposium on Computer Architecture

  10. Collins JD, Tullsen DM, Wang H, Shen JP (2001) Dynamic speculative precomputation. In: Proceedings of the 34th Annual International Symposium on Microarchitecture

  11. Dubois M, Song Y (1998) Assisted execution. Technical Report CENG #98-25, Department of EE-Systems, University of Southern California

  12. Eggers S, Emer J, Levy H, Lo J, Stamm R, Tullsen D (1997) Simultaneous multithreading: A platform for next-generation processors, IEEE Micro

  13. Farkas KI, Chow P, Jouppi NP, Vranesic Z (1997) The multicluster architecture: reducing cycle time through partitioning. In: Proceedings of the 30th Annual. IEEE/ACM Symposium on Microarchitecture

  14. Farrens M, Nico P, Ng P (1993) A comparison of superscalar and decoupled access/execute architectures. In: Proceedings of the 26th Annual International Symposium on Microarchitecture

  15. Goodman JR, Hsieh JT, Liou K, Pleszkun AR, Schechter PB, Young HC (1985) PIPE: a vlsi decoupled architecture. In: Proceedings the 12th International Symposium on Computer Architecture

  16. Hong SI, McKee SA, Salinas MH, Klenke RH, Aylor JH, Wulf WA (1999) Access order and effective bandwidth for streams on a direct rambus memory. In: Proceedings of the 5th International Symposium on High-Performance Computer Architecture

  17. Jones GP, Topham NP (1997) A comparison of data prefetching on an access decoupled and superscalar machine. In: Proceedings of the 30th International Symposium on Microarchitecture

  18. Kavi KM, Arul J, Giorgi R (2000) Execution and cache performance of the scheduled dataflow architecture. Journal of Universal Computer Science, Special Issue on Multithreaded and Chip Multiprocessors

  19. Krishnan V, Torrellas J (1999) A chip-multiprocessor architecture with speculative multithreading. IEEE Trans Comput 48(9)

  20. Kurian L, Hulina PT, Coraor LD (1994) Memory latency effects in decoupled architectures. IEEE Trans Comput 43(10)

  21. Luk C-K, Mowry TC (1996) Compiler based prefetching for recursive data structures. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems

  22. Luk C-K (2001) Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processor. In: Proceedings of the 28th International Symposium on Computer Architecture

  23. Andreas Moshovos, Dionisios Pnevmatikatos N, Amirali Baniasadi (2001) Slice-processors: An implementation of operation-based prediction. In: Proceedings of the 15th international conference on Supercomputing

  24. Palacharla S, Jouppi NP, Smith JE (1997) Complexity-effective superscalar processors. In: Proceedings of the 24th International Symposium on Computer Architecture

  25. Parcerisa J-M, González A (1999) The synergy of multithreading and access/execute decoupling. In: Proceedings of the 5th International Symposium on High-Performance Computer Architecture

  26. Patterson D, Anderson T, Cardwell N, Fromm R, Keeton K, Kozyrakis C, Thomas R, Yelick K (1997) A case for intelligent DRAM: IRAM. IEEE Micro

  27. Ro WW, Gaudiot J-L, Crago SP, Despain AM (2003) HiDISC: A decoupled architecture for data-intensive applications. In: Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), Nice, France

  28. Roth A, Moshovos A, Sohi GS (1998) Dependence based prefetching for linked data structures. In: Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems

  29. Roth A, Zilles CB, Sohi GS (2000) Speculative miss/execute decoupling. In: Proceedings of MEDEA Workshop

  30. Roth A, Sohi GS (2001) Speculative data-driven multithreading. In: Proceedings of the 7th International Symposium on High-Performance Computer Architecture

  31. Smith J (1982) Decoupled access/execute computer architecture. In: Proceedings of the 9th International Symposium on Computer Architecture

  32. Smith J (1989) Dynamic instruction scheduling and the astronautics ZS-1. IEEE Computer

  33. Sohi GS, Breach SE, Vijaykumar TN (1995) Multiscalar processors. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture

  34. Tyson G, Farrens M, Pleszkun A (1992) MISC: A multiple instruction stream computer. In: Proceedings of the 25th Annual International Symposium on Microarchitecture

  35. Wulf WA (1992) Evaluation of the WM architecture. In: Proceedings of the 19th International Symposium on Computer Architecture

  36. Zhang Y, Adams GB III (1998) performance modeling and code partitioning for the DS achitecture. In: Proceedings of the 25th Annual International Symposium on Computer Architecture

  37. Zhang Y, Adams GB III (1996) Exploiting instruction level parallelism with the ds architecture. In: Proceedings of the 1996 International Conference on Parallel Processing

  38. Zilles CB, Sohi GS (2000) Understanding the backward slices of performance degrading instructions. In: Proceedings of the 27th International Symposium on Computer Architecture

  39. Data-intensive systems benchmarks suite analysis and specification. http://www.aaec.com/projectweb/dis/

  40. DIS Stressmark Suite. http://www.aaec.com/projectweb/dis/DIS_Stressmarks_v1_0.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Won W. Ro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ro, W.W., Crago, S.P., Despain, A.M. et al. Design and evaluation of a hierarchical decoupled architecture. J Supercomput 38, 237–259 (2006). https://doi.org/10.1007/s11227-006-8321-2

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-006-8321-2

Keywords

Navigation