The Journal of Supercomputing

, Volume 38, Issue 3, pp 237–259 | Cite as

Design and evaluation of a hierarchical decoupled architecture

  • Won W. Ro
  • Stephen P. Crago
  • Alvin M. Despain
  • Jean-Luc Gaudiot
Article

Abstract

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality.

To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors.

Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.

Keywords

Decoupled architectures Memory latency hiding Multithreading Parallel architecture Instruction level parallelism Data prefetching Speculative execution 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agarwal V, Hrishikesh MS, Keckler SW, Burger D (2000) Clock rate versus IPC: The end of the road for conventional microarchitectures. In: Proceedings of the 27th International Symposium on Computer ArchitectureGoogle Scholar
  2. 2.
    Annavaram M, Patel JM, Davidson ES (2001) Data prefetching by dependence graph precoumputation. In: Proceedings of the 28th International Symposium on Computer ArchitectureGoogle Scholar
  3. 3.
    Bird P, Rawsthorne A, Topham N (1993) The effectiveness of decoupling. In: Proc. of Int. Conf. on Supercomputing, pagesGoogle Scholar
  4. 4.
    Burger D, Austin T (1997) The SimpleScalar Tool Set. Version 2.0. Technical Report CS-TR-97-1342, University of Wisconsin-MadisonGoogle Scholar
  5. 5.
    Burns J, Gaudiot J-L (2002) SMT layout overhead and scalability. Transactions on Parallel and Distributed Processing Systems 13(2)Google Scholar
  6. 6.
    Chappell R, Stark J, Kim S, Reinhardt S, Patt Y (1999) Simultaneous subordinate microthreading (SSMT). In: Proceedings of the 26th Annual International Symposium on Computer ArchitectureGoogle Scholar
  7. 7.
    Chen T-F, Baer J-L (1995) Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers 44(5):609–623MATHCrossRefGoogle Scholar
  8. 8.
    Crago S, Despain A, Gaudiot J-L, Makhija M, Ro W, Srivastava A (2000) A high-performance, hierarchical decoupled architecture. In: Proceedings of MEDEA WorkshopGoogle Scholar
  9. 9.
    Collins JD, Wang H, Tullsen DM, Hughes C, Lee Y-F, Lavery D, Shen JP (2001) Speculative precomputation: long-range prefetching of delinquent loads. In: Proceedings of the 28th International Symposium on Computer ArchitectureGoogle Scholar
  10. 10.
    Collins JD, Tullsen DM, Wang H, Shen JP (2001) Dynamic speculative precomputation. In: Proceedings of the 34th Annual International Symposium on MicroarchitectureGoogle Scholar
  11. 11.
    Dubois M, Song Y (1998) Assisted execution. Technical Report CENG #98-25, Department of EE-Systems, University of Southern CaliforniaGoogle Scholar
  12. 12.
    Eggers S, Emer J, Levy H, Lo J, Stamm R, Tullsen D (1997) Simultaneous multithreading: A platform for next-generation processors, IEEE MicroGoogle Scholar
  13. 13.
    Farkas KI, Chow P, Jouppi NP, Vranesic Z (1997) The multicluster architecture: reducing cycle time through partitioning. In: Proceedings of the 30th Annual. IEEE/ACM Symposium on MicroarchitectureGoogle Scholar
  14. 14.
    Farrens M, Nico P, Ng P (1993) A comparison of superscalar and decoupled access/execute architectures. In: Proceedings of the 26th Annual International Symposium on MicroarchitectureGoogle Scholar
  15. 15.
    Goodman JR, Hsieh JT, Liou K, Pleszkun AR, Schechter PB, Young HC (1985) PIPE: a vlsi decoupled architecture. In: Proceedings the 12th International Symposium on Computer ArchitectureGoogle Scholar
  16. 16.
    Hong SI, McKee SA, Salinas MH, Klenke RH, Aylor JH, Wulf WA (1999) Access order and effective bandwidth for streams on a direct rambus memory. In: Proceedings of the 5th International Symposium on High-Performance Computer ArchitectureGoogle Scholar
  17. 17.
    Jones GP, Topham NP (1997) A comparison of data prefetching on an access decoupled and superscalar machine. In: Proceedings of the 30th International Symposium on MicroarchitectureGoogle Scholar
  18. 18.
    Kavi KM, Arul J, Giorgi R (2000) Execution and cache performance of the scheduled dataflow architecture. Journal of Universal Computer Science, Special Issue on Multithreaded and Chip MultiprocessorsGoogle Scholar
  19. 19.
    Krishnan V, Torrellas J (1999) A chip-multiprocessor architecture with speculative multithreading. IEEE Trans Comput 48(9)Google Scholar
  20. 20.
    Kurian L, Hulina PT, Coraor LD (1994) Memory latency effects in decoupled architectures. IEEE Trans Comput 43(10)Google Scholar
  21. 21.
    Luk C-K, Mowry TC (1996) Compiler based prefetching for recursive data structures. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating SystemsGoogle Scholar
  22. 22.
    Luk C-K (2001) Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processor. In: Proceedings of the 28th International Symposium on Computer ArchitectureGoogle Scholar
  23. 23.
    Andreas Moshovos, Dionisios Pnevmatikatos N, Amirali Baniasadi (2001) Slice-processors: An implementation of operation-based prediction. In: Proceedings of the 15th international conference on SupercomputingGoogle Scholar
  24. 24.
    Palacharla S, Jouppi NP, Smith JE (1997) Complexity-effective superscalar processors. In: Proceedings of the 24th International Symposium on Computer ArchitectureGoogle Scholar
  25. 25.
    Parcerisa J-M, González A (1999) The synergy of multithreading and access/execute decoupling. In: Proceedings of the 5th International Symposium on High-Performance Computer ArchitectureGoogle Scholar
  26. 26.
    Patterson D, Anderson T, Cardwell N, Fromm R, Keeton K, Kozyrakis C, Thomas R, Yelick K (1997) A case for intelligent DRAM: IRAM. IEEE MicroGoogle Scholar
  27. 27.
    Ro WW, Gaudiot J-L, Crago SP, Despain AM (2003) HiDISC: A decoupled architecture for data-intensive applications. In: Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), Nice, FranceGoogle Scholar
  28. 28.
    Roth A, Moshovos A, Sohi GS (1998) Dependence based prefetching for linked data structures. In: Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating SystemsGoogle Scholar
  29. 29.
    Roth A, Zilles CB, Sohi GS (2000) Speculative miss/execute decoupling. In: Proceedings of MEDEA WorkshopGoogle Scholar
  30. 30.
    Roth A, Sohi GS (2001) Speculative data-driven multithreading. In: Proceedings of the 7th International Symposium on High-Performance Computer ArchitectureGoogle Scholar
  31. 31.
    Smith J (1982) Decoupled access/execute computer architecture. In: Proceedings of the 9th International Symposium on Computer ArchitectureGoogle Scholar
  32. 32.
    Smith J (1989) Dynamic instruction scheduling and the astronautics ZS-1. IEEE ComputerGoogle Scholar
  33. 33.
    Sohi GS, Breach SE, Vijaykumar TN (1995) Multiscalar processors. In: Proceedings of the 22nd Annual International Symposium on Computer ArchitectureGoogle Scholar
  34. 34.
    Tyson G, Farrens M, Pleszkun A (1992) MISC: A multiple instruction stream computer. In: Proceedings of the 25th Annual International Symposium on MicroarchitectureGoogle Scholar
  35. 35.
    Wulf WA (1992) Evaluation of the WM architecture. In: Proceedings of the 19th International Symposium on Computer ArchitectureGoogle Scholar
  36. 36.
    Zhang Y, Adams GB III (1998) performance modeling and code partitioning for the DS achitecture. In: Proceedings of the 25th Annual International Symposium on Computer ArchitectureGoogle Scholar
  37. 37.
    Zhang Y, Adams GB III (1996) Exploiting instruction level parallelism with the ds architecture. In: Proceedings of the 1996 International Conference on Parallel ProcessingGoogle Scholar
  38. 38.
    Zilles CB, Sohi GS (2000) Understanding the backward slices of performance degrading instructions. In: Proceedings of the 27th International Symposium on Computer ArchitectureGoogle Scholar
  39. 39.
    Data-intensive systems benchmarks suite analysis and specification. http://www.aaec.com/projectweb/dis/
  40. 40.

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  • Won W. Ro
    • 1
  • Stephen P. Crago
    • 2
  • Alvin M. Despain
    • 3
  • Jean-Luc Gaudiot
    • 4
  1. 1.Department of Electrical and Computer EngineeringCalifornia State UniversityNorthridge
  2. 2.Information Sciences Institute-EastUniversity of Southern California
  3. 3.Department of Electrical EngineeringUniversity of Southern California
  4. 4.Department of Electrical Engineering and Computer ScienceUniversity of CaliforniaIrvine

Personalised recommendations