An Energy-Efficient 3D Stacked STT-RAM Cache Architecture for CMPs



In this chapter, we introduce how to adopt spin-transfer torque random access memory (STT-RAM) as on-chip L2 caches to achieve better performance and lower energy consumption, compared to traditional L2 cache designs. STT-RAM is a promising memory technology for on-chip cache design because of its fast read access, high density, and non-volatility. Using 3D heterogeneous integrations, it becomes feasible and cost-efficient to stack STT-RAM atop conventional chip multiprocessors (CMPs). However, one disadvantage of STT-RAM is its long write latency and its high write energy. In this chapter, we first stack STT-RAM-based L2 caches directly atop CMPs and compare it against SRAM counterparts in terms of performance and energy. We observe that the direct STT-RAM stacking might harm the chip performance due to the aforementioned long write latency and high write energy. To solve this problem, we then propose two architectural techniques: read-preemptive write buffer and SRAM–STT-RAM hybrid L2 cache. The simulation result shows that our optimized STT-RAM L2 cache improves performance by 4.91 % and reduces power by 73.5 % compared to the conventional SRAM L2 cache with the similar area.


Cache Line Data Migration Leakage Power NMOS Transistor Phase Change Memory 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Black, B., Annavaram, M., Brekelbaum, N., et al. (2006). Die stacking (3D) microarchitecture. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 469–479).Google Scholar
  2. 2.
    Borkar, S. (2008). 3D technology: A system perspective. In Technical Digest of the International 3D System Integration Conference (pp. 1–14).Google Scholar
  3. 3.
    Burger, D., Goodman, J. R., & Kagi, A. (1997). Limited bandwidth to affect processor design. Micro, IEEE, 17(6), 55–62.CrossRefGoogle Scholar
  4. 4.
    Chishti, Z., Powell, M. D., & Vijaykumar, T. N. (2005). Optimizing replication, communication, and capacity allocation in CMPs. SIGARCH Computer Architecture News, 33(2), 357–368.CrossRefGoogle Scholar
  5. 5.
    Davis, J. D., Laudon, J., & Olukotun, K. (2005). Maximizing CMP throughput with mediocre cores. In PACT ’05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation, Techniques (pp. 51–62).Google Scholar
  6. 6.
    Davis, W. R., Wilson, J., Mick, S., et al. (2005). Demystifying 3D ICs: The pros and cons of going vertical. IEEE Design and Test of Computers, 22(6), 498–510.CrossRefGoogle Scholar
  7. 7.
    Desikan, R., Lefurgy, C. R., Keckler, S. W., & Burger, D. (2002). On-chip MRAM as a high-bandwidth low-latency replacement for DRAM physical memories. Technical report.Google Scholar
  8. 8.
    Diao, Z., Li, Z., Wang, S., et al. (2007). Spin-transfer torque switching in magnetic tunnel junctions and spin-transfer torque random access memory. Journal of Physics: Condensed Matter, 19(16), 165, 209 (13pp).Google Scholar
  9. 9.
    Dong, X., Wu, X., Sun, G., et al. (2008). Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In DAC ’08: Proceedings of the 45th annual conference on Design automation (pp. 554–559).Google Scholar
  10. 10.
    Ghosh, M., & Lee, H. H. S. (2007). Smart refresh: An enhanced memory controller design for reducing energy in conventional and 3D die-stacked DRAMs. In MICRO ’07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 134–145).Google Scholar
  11. 11.
    Hosomi, M., Yamagishi, H., Yamamoto, T., et al. (2005). A novel non-volatile memory with spin torque transfer magnetization switching: Spin-RAM. In International Electron Devices Meeting (pp. 459–462).Google Scholar
  12. 12.
  13. 13.
  14. 14.
  15. 15.
    Jacob, P., Erdogan, O., Zia, A., et al. (2005). Predicting the performance of a 3D processor-memory chip stack. IEEE Design and Test of Computers, 22(6), 540–547.CrossRefGoogle Scholar
  16. 16.
    Kahle, J. A., Day, M. N., Hofstee, H. P., et al. (2005). Introduction to the cell multiprocessor. IBM Journal of Research and Development, 49(4/5), 589–604.CrossRefGoogle Scholar
  17. 17.
    Kgil, T., et al., D’Souza, S., Saidi, A., et al. (2006). PicoServer: Using 3D Stacking Technology to Enable a Compact Energy Efficient Chip Multiprocessor. Proceedings of the 2006 ASPLOS Conference, 41(11), 117–128.Google Scholar
  18. 18.
    Kim, C., Burger, D., & Keckler, S. (2002). An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings the 10th International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
  19. 19.
    Kim, J., Chung, S., Jang, T., et al. (2010). Vertical double gate Z-RAM technology with remarkable low voltage operation for DRAM application (pp. 163–164).Google Scholar
  20. 20.
    Kongetira, P., Aingaran, K., & Olukotun, K. (2005). Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2), 21–29.CrossRefGoogle Scholar
  21. 21.
    Lee, B. C., Ipek, E., Mutlu, O., & Burger, D. (2009). Architecting phase change memory as a scalable DRAM alternative. In Proceedings of ISCA (pp. 2–13).Google Scholar
  22. 22.
    Li, F., Nicopoulos, C., Richardson, T., et al. (2006). Design and management of 3D chip multiprocessors using network-in-memory. In ISCA ’06: Proceedings of the 33rd, Annual International Symposium on Computer Architecture (pp. 130–141).Google Scholar
  23. 23.
    Liu, C. C., Ganusov, I., Burtscher, M., & Tiwari, S. (2005). Bridging the processor-memory performance gap with 3D IC technology. IEEE Design and Test of Computers, 22(6), 556–564.CrossRefGoogle Scholar
  24. 24.
    Loh, G. H. (2008). 3D-stacked memory architectures for multi-core processors. In ISCA ’08: Proceedings of the 35th International Symposium on Computer, Architecture (pp. 453–464).Google Scholar
  25. 25.
    Loh, G. H., & Hill, M. D. (2011). Efficiently enabling conventional block sizes for very large die-stacked dram caches. In MICRO’11 (pp. 454–464).Google Scholar
  26. 26.
    Loh, G. H., & Hill, M. D. (2012). Supporting very large dram caches with compound-access scheduling and missmap. IEEE Micro (pp. 70–78).Google Scholar
  27. 27.
    Loi, G. L., Agrawal, B., Srivastava, N., et al. (2006). A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In DAC ’06: Proceedings of the 43rd Annual Conference on Design automation (pp. 991–996).Google Scholar
  28. 28.
    Lu, Z., Collaert, N., Aoulaiche, M., De Wachter, B., De Keersgieter, A., Schwarzenbach, W., et al. (2010). A novel low-voltage biasing scheme for double gate fbc achieving 5s retention and \(10_{16}\) endurance at 85c. In IEDM (pp. 12.3.1–12.3.4). doi:10.1109/IEDM.2010.5703347.Google Scholar
  29. 29.
    Magnusson, P. S., Christensson, M., Eskilson, J., et al. (2002). Simics: A full system simulation platform. Computer, 35(2), 50–58.CrossRefGoogle Scholar
  30. 30.
    Nigam, A., Smullen, C., Mohan, V., Chen, E., Gurumurthi, S., & Stan, M. (2011). Delivering on the promise of universal memory for spin-transfer torque ram (stt-ram). In ISLPED 2011 (pp. 121–126). doi:10.1109/ISLPED.2011.5993623.Google Scholar
  31. 31.
    Qureshi, M., Franceschini, M., & Lastras-Montano, L. (2010). Improving read performance of phase change memories via write cancellation and write pausing. In HPCA (pp. 1–11). doi:10.1109/HPCA.2010.5416645.Google Scholar
  32. 32.
    Qureshi, M. K., Srinivasan, V., & Rivers, J. A. (2009). Scalable high performance main memory system using phase-change memory technology. In Proceedings of ISCA (pp. 24–33).Google Scholar
  33. 33.
    Smullen, C., Mohan, V., Nigam, A., Gurumurthi, S., & Stan, M. (2011). Relaxing non-volatility for fast and energy-efficient stt-ram caches. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA) (pp. 50–61). doi:10.1109/HPCA.2011.5749716.Google Scholar
  34. 34.
    Tsai, Y. F., Xie, Y., Vijaykrishnan, N., & Irwin, M. J. (2005). Three-dimensional cache design exploration using 3DCacti. In ICCD ’05: Proceedings of the 2005 International Conference on, Computer Design (pp. 519–524).Google Scholar
  35. 35.
    Xie, Y., Loh, G. H., Black, B., & Bernstein, K. (2006). Design space exploration for 3D architectures. ACM Journal on Emerging Technologies in Computing Systems, 2(2), 65–103.CrossRefGoogle Scholar
  36. 36.
    Zhao, W., Belhaire, E., Mistral, Q., et al. (2006). Macro-model of spin-transfer torque based magnetic unnel junction device for hybrid magnetic-CMOS design. In IEEE International Behavioral Modeling and Simulation, Workshop (pp. 40–43).Google Scholar
  37. 37.
    Zhou, P., Zhao, B., Yang, J., & Zhang, Y. (2009). A durable and energy efficient main memory using phase change memory technology. In Proceedings of ISCA (pp. 14–23).Google Scholar
  38. 38.
    Zhou, P., Zhao, B., Yang, J., & Zhang, Y. (2009). Energy reduction for stt-ram using early write termination. In ICCAD (pp. 264–268).Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Guangyu Sun
    • 1
  • Xiangyu Dong
    • 2
  • Yiran Chen
    • 3
  • Yuan Xie
    • 4
  1. 1.Peking UniversityBeijingChina
  2. 2.Qualcomm Research LabSan DiegoUSA
  3. 3.University of PittsburghPittsburghUSA
  4. 4.Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations