Influence of Stacked 3D Memory/Cache Architectures on GPUs

  • Ahmed Al MaashriEmail author
  • Guangyu Sun
  • Xiangyu Dong
  • Yuan Xie
  • Narayanan Vijaykrishnan
Part of the Integrated Circuits and Systems book series (ICIR)


This chapter investigates the architectural design of a 3D die-stacked Graphics Processing Unit. The investigation includes a discussion of the design space of the system as well as some empirical results that quantify the expected performance gain of such a system. Also, the chapter discusses the cost, power and thermal aspects of the proposed designs.


Power Consumption Graphic Processing Unit Thermal Profile Cache Size Access Latency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The work appeared in this chapter was supported in part by NSF grants 0903432; 0702617.


  1. 1.
    Stanford University CS488a Spring 2007 Real-Time Graphics Architecture, available at:
  2. 2.
    R. del Barrio, V. M. Gonzalez, C. Roca, J. Fernandez, and A. Espasa E., “ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures,” in Proc. International Symposium on Performance Analysis of Systems and Software, 2006, pages 231–241Google Scholar
  3. 3.
    General-Purpose Computation Using Graphics Hardware, available at:
  4. 4.
    Nvidia: CUDA Homepage, available at:
  5. 5.
    ATI Stream Software Development Kit (SDK), available at:
  6. 6.
  7. 7.
    Yuh-Fang Tsai, Y. Xie, N. Vijaykrishnan, and M. Jane Irwin, “Three-Dimensional Cache Design Exploration Using 3DCacti,” in Proc. International Conference on Computer Design, 2005, pages 519–524Google Scholar
  8. 8.
    N. Govindaraju, S. Larsen, J. Gray, and D. Manocha, “A Memory Model for Scientific Algorithms on Graphics Processors,” in Proc. Conference on High Performance Networking and Computing, 2006. Article No. 89Google Scholar
  9. 9.
    N. Goodnight, C. Woolley, G. Lewin, D. Luebke, and G. Humphreys, “A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware,” in Proc. SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, 2003, pages 102–111Google Scholar
  10. 10.
    K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication,” in Proc. SIGGRAPH, 2004, pages 133–137Google Scholar
  11. 11.
    CACTI Cache Simulator, available at:
  12. 12.
    V. K. Kodavalla, “IP Gate Count Estimation Methodology During Micro-Architecture Phase,” in IP based Electronic System Conference and Exhibition, Dec. 5–6 2007, Grenoble, France, available at:
  13. 13.
    ITRS, “International Technology Roadmap for Semiconductors,” available at:
  14. 14.
    X. Dong, and Y. Xie, “System-Level Cost Analysis and Design Exploration for 3D ICs,” in Proc. Asia and South Pacific Design Automation Conference, 2009, pages 234–241, Yokohama, JapanGoogle Scholar
  15. 15.
    J. L. Hennessy, and D. A. Patterson, Computer Architecture: A Quantitative Approach. Fourth Edition, Wiley, San Francisco, CA, 2010Google Scholar
  16. 16.
    M. Saravana Sibi Govindan, S. W. Keckler, S. R. Nassif, and E. Acar, “A Temperature Aware Power Estimation Methodology,” ASPDAC, January 2008Google Scholar
  17. 17.
    K. Skadron, M. R. Stan, W. Velusamy, K. Sankaranarayanan, and D. Tarjan, “Temperature-Aware Microarchitecture,” in Proc. International Symposium on Computer Architecture, 2003, pages 2–13CrossRefGoogle Scholar
  18. 18.
    Attila Project: AttilaWiki, available at:, 2008
  19. 19.
    OpenGL, available at:
  20. 20.
    DirectX Library, available at:
  21. 21.
    D. Luebke, and G. Humphreys, How GPUs Work, in IEEE Computer, vol. 40, no. 2, pages 126–130, 2007CrossRefGoogle Scholar
  22. 22.
    S. Jones, “2008 IC Economics Report,” in IC Knowledge LLC, 2008, available at:
  23. 23.
    S. Rodriguez, and B. Jacob, “Energy/power Breakdown of Pipelined Nanometer Caches (90nm/65nm/45nm/32),” in Proc. International Symposium on Low Power Electronics and Design, 2006, pages 25–30 Google Scholar
  24. 24.
    J. D. Hall, N. Carr, and J. Hart, “Cache and Bandwidth Aware Matrix Multiplication on the GPU,” Technical Report UIUCDCS-R-2003-2328, University of Illinois Urbana-Champain, 2003Google Scholar
  25. 25.
    M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens, “Efficient Computation of Sum-Products on GPUs Through Software-Managed Cache,” in Proc. Inter. Conference on Supercomputing, 2008, pages 308–318Google Scholar
  26. 26.
    G. Luca Loi, B. Agrawal, N. Srivastava, Sheng-Chih Lin, T. Sherwood, and K. Banerjee, “A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy,” in Proc. Design Automation Conference, 2006, pages 991–996Google Scholar
  27. 27.
    K. Puttaswamy, and G. H. Loh, “Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors,” in Proc. HPCA, 2007, pages 193–204Google Scholar
  28. 28.
    M. Hosomi, H. Yamagishi, and T. Yamamoto, “A Novel Nonvolatile Memory with Spin Torque Transfer Magnetization Switching: Spin-Ram,” in International Electron Devices Meeting, 2005, pages 459–462 Google Scholar
  29. 29.
    J. Owens, “GPU Architecture Overview,” in Proc. International Conference on Computer Graphics and Interactive Techniques, 2007, Article No. 2Google Scholar
  30. 30.
    A. Al Maashri, G. Sun, X. Dong, V. Narayanan, and Y. Xie, “3D GPU Architecture Using Cache Stacking: Performance, Cost, Power, and Thermal Analysis,” in Proc. International Conference on Computer Design (ICCD), 2009Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Ahmed Al Maashri
    • 1
    Email author
  • Guangyu Sun
  • Xiangyu Dong
  • Yuan Xie
  • Narayanan Vijaykrishnan
  1. 1.The Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations