Skip to main content

Advertisement

Log in

PS directory: a scalable multilevel directory cache for CMPs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As the number of cores increases in current and future chip-multiprocessor (CMP) generations, coherence protocols must rely on novel hardware structures to scale in terms of performance, power, and area. Systems that use directory information for coherence purposes are currently the most scalable alternative. This paper studies the important differences between the directory behavior of private and shared blocks, which claim for a separate management of both types of blocks at the directory. We propose the PS directory, a two-level directory cache that keeps the reduced number of frequently accessed shared entries in a small and fast first-level cache, namely Shared cache, and uses a larger and slower second-level Private cache to track the large amount of private blocks. Entries in the Private cache do not implement the sharer vector, which allows important silicon area savings. Speed and area reasons suggest the use of eDRAM technology, much denser but slower than SRAM technology, for the Private cache, which in turn brings energy savings. Experimental results for a 16-core CMP show that, compared to a conventional directory, the PS directory improves performance by 14 \(\%\) while reducing silicon area and energy consumption by 34 and 27 \(\%\), respectively. Also, compared to the state-of-the-art Multi-Grain Directory, the PS directory apart from increasing performance, it reduces power by 18.7 \(\%\), and provides more scalability in terms of area.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Experimental conditions are defined in Sect. 6.

  2. Results for 0.125\(\times \) are not shown because CACTI is not able to provide results for so small caches.

References

  1. Acacio ME, González J, García JM, Duato J (2001) A new scalable directory architecture for large-scale multiprocessors. In: Proceedings of 7th international symposium on high-performance computer architecture (HPCA), pp 97–106

  2. Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parallel Distrib Syst (TPDS) 16(1):67–79

    Article  Google Scholar 

  3. Agarwal N, Krishna T, Peh L-S, Jha NK (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: Proceedings of IEEE international symposium on performance analysis of systems and software (ISPASS), pp 33–42

  4. Barroso LA, Gharachorloo K, McNamara R et al (2000) Piranha: a scalable architecture based on single-chip multiprocessing. In: Proceedings of 27th international symposium on computer architecture (ISCA), pp 12–14

  5. Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of 17th international conference on parallel architectures and compilation techniques (PACT), pp 72–81

  6. Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th international conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234

  7. Chen G (1993) Slid: a cost-effective and scalable limited-directory scheme for cache coherence. In: 5th international conference on parallel architectures and languages Europe (PARLE), pp 341–352

  8. Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B (2010) Cache hierarchy and memory subsystem of the AMD opteron processor. IEEE Micro 30(2):16–29

    Article  Google Scholar 

  9. Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: Proceedings of 38th international symposium on computer architecture (ISCA), pp 93–103

  10. Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th international symposium on high-performance computer architecture (HPCA), pp 169–180

  11. Guo S-L, Wang H-X, Xue Y-B, Li C-M, Wang D-S (2010) Hierarchical cache directory for cmp. J Comput Sci Technol 25(2):246–256

    Article  Google Scholar 

  12. Gupta A, Weber W-D, Mowry TC (1990) Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: Proceedings of international conference on parallel processing (ICPP), pp 312–321

  13. Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15

    Article  Google Scholar 

  14. Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of 10th international conference on architectural support for programming language and operating systems (ASPLOS), pp 211–222

  15. Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of ACM SIGPLAN conference on programming language design and implementation (PLDI), June 2005, pp 190–200

  16. Magnusson PS, Christensson M, Eskilson J et al (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58

    Article  Google Scholar 

  17. Martin MM, Sorin DJ, Beckmann BM et al (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Archit News 33(4):92–99

    Article  Google Scholar 

  18. Marty MR, Hill MD (2007) Virtual hierarchies to support server consolidation. In: Proceedings of 34th international symposium on computer architecture (ISCA), pp 46–56

  19. Marty MR, Hill MD (2008) Virtual hierarchies. IEEE Micro 28(1):99–109

    Article  Google Scholar 

  20. Matick RE, Schuster SE (2005) Logic-based eDRAM: origins and rationale for use. IBM J Res Dev 49(1):145–165

    Article  Google Scholar 

  21. Muralimanohar N, Balasubramonian R, Jouppi NP (2009) Cacti 6.0, HP Labs, technical report HPL-2009-85

  22. O’Krafka BW, Newton AR (1990) An empirical evaluation of two memory-efficient directory methods. In: Proceedings of 17th international symposium on computer architecture (ISCA), pp 138–147

  23. Ros A, Acacio ME, García JM (2010) A scalable organization for distributed directories. J Syst Archit (JSA) 56(2–3):77–87

    Article  Google Scholar 

  24. Ros A, Cuesta B, Fernández-Pascual R, Gómez ME, Acacio ME, Robles A, García JM, Duato J (2012) Extending magny-cours cache coherence. IEEE Trans Comput (TC) 61(5):593–606

    Article  Google Scholar 

  25. Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: Proceedings of 18th international sympoium on high-performance computer architecture (HPCA), pp 129–140

  26. Shah M, Barreh J, Brooks J et al (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: Proceedings of IEEE Asian solid-state circuits conference, pp 22–25

  27. Sinharoy B, Kalla RN, Tendler JM, Eickemeyer RJ, Joyner JB (2005) Power5 system microarchitecture. IBM J Res Dev 49(4/5):505–521

    Article  Google Scholar 

  28. Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Dev 46(1):5–25

    Article  Google Scholar 

  29. Valero A, Sahuquillo J, Petit S, Lorente V, Canal R, López P, Duato J (2009) An hybrid eDRAM/SRAM macrocell to implement first-level data caches. In: Proceedings of 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 213–221

  30. Valls JJ, Ros A, Sahuquillo J, Gómez ME, Duato J (2012) PS-Dir: a scalable two-level directory cache. In: Proceedings of 21st international conference on parallel architectures and compilation techniques (PACT), pp 451–452

  31. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of 22nd international symposium on computer architecture (ISCA), pp 24–36

  32. Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y (2009) Hybrid cache architecture with disparate memory technologies. In: Proceedings of 36th international symposium on computer architecture (ISCA), pp 34–45

  33. Zebchuk J, Falsafi B, Moshovos A (2013) Multi-grain coherence directories. In: Proceedings of 46th IEEE/ACM international symposium on microarchitecture (MICRO), pp 359–370

  34. Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: Proceedings of 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434

  35. Zhao H, Shriraman A, Dwarkadas S, Srinivasan V (2011) SPATL: Honey, I shrunk the coherence directory. In: Proceedings of 20th international conference on parallel architectures and compilation techniques (PACT), pp 148–157

Download references

Acknowledgments

This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2012-38341-C04-01 and the Fundación Seneca-Agencia de Ciencia y Tecnología de la Región de Murcia under the project Jóvenes Líderes en Investigación 18956/JLI/13.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joan J. Valls.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Valls, J.J., Ros, A., Sahuquillo, J. et al. PS directory: a scalable multilevel directory cache for CMPs. J Supercomput 71, 2847–2876 (2015). https://doi.org/10.1007/s11227-014-1332-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1332-5

Keywords

Navigation