Advertisement

Toward multi-programmed workloads with different memory footprints: a self-adaptive last level cache scheduling scheme

  • Jingyu Zhang
  • Minyi Guo
  • Chentao Wu
  • Yuanyi Chen
Research Paper

Abstract

With the emerging of 3D-stacking technology, the dynamic random-access memory (DRAM) can be stacked on chips to architect the DRAM last level cache (LLC). Compared with static randomaccess memory (SRAM), DRAM is larger but slower. In the existing research papers, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together, ranging from SRAM structure improvement, to optimizing cache tag and data access. Instead, little attention has been paid to designing an LLC scheduling scheme for multi-programmed workloads with different memory footprints. Motivated by this, we propose a self-adaptive LLC scheduling scheme, which allows us to utilize SRAM and 3D-stacked DRAM efficiently, achieving better workload performance. This scheduling scheme employs (1) an evaluation unit, which is used to probe and evaluate the cache information during the process of programs being executed; and (2) an implementation unit, which is used to self-adaptively choose SRAM or DRAM. To make the scheduling scheme work correctly, we develop a data migration policy. We conduct extensive experiments to evaluate the performance of our proposed scheme. Experimental results show that our method can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods.

Keywords

3D-stacking technology cache architecture cache scheduling multi-programmed workloads memory system performance optimization 

Notes

Acknowledgements

This work was supported by National Basic Research Program of China (973 Program) (Grant No. 2015CB352403), National Natural Science Foundation of China (Grant Nos. 61261160502, 61272099, 61303012, 61572323, 61628208), Scientific Innovation Act of STCSM (Grant No. 13511504200), EU FP7 CLIMBER Project (Grant No. PIRSES-GA-2012-318939), and CCF-Tencent Open Fund. We would like to acknowledge the anonymous reviewers for their careful work and instructive suggestions. Also, we thank Dr. Zhi-Jie Wang for his warm help and advices.

References

  1. 1.
    Chou C, Jaleel A, Qureshi M K. BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. ACM SIGARCH Comput Arch News, 2016, 43: 198–210CrossRefGoogle Scholar
  2. 2.
    Hudec B, Hsu C W, Wang I T, et al. 3D resistive ram cell design for high-density storage class memory—a review. Sci China Inf Sci, 2016, 59: 061403CrossRefGoogle Scholar
  3. 3.
    Lun Z Y, Du G, Zhao K, et al. A two-dimensional simulation method for investigating charge transport behavior in 3-D charge trapping memory. Sci China Inf Sci, 2016, 59: 122403CrossRefGoogle Scholar
  4. 4.
    Lee Y, Kim J, Jang H, et al. A fully associative, tagless DRAM cache. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture, Portland, 2015. 211–222Google Scholar
  5. 5.
    Hameed F, Bauer L, Henkel J. Adaptive cache management for a combined SRAM and DRAM cache hierarchy for multi-cores. In: Proceedings of Design, Automation and Test in Europe, Grenoble, 2013. 77–82Google Scholar
  6. 6.
    Hundal R, Oklobdzija V G. Determination of optimal sizes for a first and second level SRAM-DRAM on-chip cache combination. In: Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, Cambridge, 1994. 60–64Google Scholar
  7. 7.
    Hameed F, Bauer L, Henkel J. Reducing latency in an SRAM/DRAMcache hierarchy via a novel tag-cache architecture. In: Proceedings of Design Automation Conference, San Francisco, 2014. 1–6Google Scholar
  8. 8.
    Huang C C, Nagarajan V. ATCache: reducing DRAM cache latency via a small SRAM tag cache. In: Proceedings of International Conference on Parallel Architectures and Compilation, Edmonton, 2014. 51–60Google Scholar
  9. 9.
    Qureshi M K, Loh G H. Fundamental latency trade-off in architecting DRAM caches: outperforming impractical SRAM-tags with a simple and practical design. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture, Vancouver, 2012. 235–246Google Scholar
  10. 10.
    Elhelw A S, El-Moursy A, Fahmy H A H. Adaptive time-based least memory intensive scheduling. In: Proceedings of IEEE 9th International Symposium on Embedded Multicore/Manycore Systems-on-Chip, Turin, 2015. 167–174Google Scholar
  11. 11.
    Elhelw A S, Moursy A E, Fahmy H A H. Time-based least memory intensive scheduling. In: Proceedings of IEEE 8th International Symposium on Embedded Multicore/Manycore Systems-on-Chip, Aizu-Wakamatsu, 2014. 311–318Google Scholar
  12. 12.
    Chen Q, Zheng L, Guo M. DWS: demand-aware work-stealing in multi-programmed multi-core architectures. In: Proceedings of International Workshop on Programming Models and Applications on Multicores and Manycores, Orlando, 2014. 131Google Scholar
  13. 13.
    Chen Q, Zheng L, Guo M. Adaptive demand-aware work-stealing in multi-programmed multi-core architectures. J Concurr Comput Prac Exp, 2016, 28: 455–471CrossRefGoogle Scholar
  14. 14.
    Roscoe B, Herlev M, Liu C. Auto-tuning multi-programmed workload on the SCC. In: Proceedings of International Green Computing Conference, Arlington VA, 2013. 1–5Google Scholar
  15. 15.
    Castellana V G, Ferrandi F. Abstract: speeding-up memory intensive applications through adaptive hardware accelerators. In: Proceedings of SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, 2012. 1415–1416Google Scholar
  16. 16.
    Huang C, Ravi S, Raghunathan A, et al. Synthesis of heterogeneous distributed architectures for memory-intensive applications. In: Proceedings of International Conference on Computer Aided Design, San Jose, 2003. 46–53Google Scholar
  17. 17.
    Huang C, Ravi S, Raghunathan A, et al. Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis. IEEE Trans Very Large Scale Int Syst, 2007, 15: 1191–1204CrossRefGoogle Scholar
  18. 18.
    Athanasaki E, Anastopoulos N, Kourtis K, et al. Exploring the performance limits of simultaneous multithreading for memory intensive applications. J Supercomp, 2008, 44: 64–97CrossRefGoogle Scholar
  19. 19.
    Chun K C, Jain P, Kim C H. Logic-compatible embedded DRAM design for memory intensive low power systems. In: Proceedings of IEEE International Symposium on Circuits and Systems, Paris, 2010. 277–280Google Scholar
  20. 20.
    Yi W, Tang Y, Wang G, et al. A case study of SWIM: optimization of memory intensive application on GPGPU. In: Proceedings of International Symposium on Parallel Architectures, Algorithms and Programming, Dalian, 2010. 123–129Google Scholar
  21. 21.
    Qin X, Jiang H, Zhu Y, et al. A Feedback control mechanism for balancing I/O-and memory-intensive applications on clusters. Scal Comput Prac Exp, 2005, 6: 95–107Google Scholar
  22. 22.
    Qin X, Jiang H, Zhu Y, et al. Dynamic load balancing for I/O-and memory-intensive workload in clusters using a feedback control mechanism. In: Proceedings of International Euro-Par Conference, Klagenfurt, 2003. 224–229Google Scholar
  23. 23.
    Jaleel A, Nuzman J, Moga A, et al. High performing cache hierarchies for server workloads: relaxing inclusion to capture the latency benefits of exclusive caches. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture, Burlingame, 2015. 343–353Google Scholar
  24. 24.
    Xiao N, Zhao Y J, Liu F, et al. Dual queues cache replacement algorithm based on sequentiality detection. Sci China Inf Sci, 2012, 55: 191–199CrossRefGoogle Scholar
  25. 25.
    Akin B, Franchetti F, Hoe J C. Data reorganization in memory using 3D-stacked DRAM. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture, Portland, 2015. 131–143Google Scholar
  26. 26.
    Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture, Tel-Aviv, 2013. 404–415Google Scholar
  27. 27.
    Oskin M, Loh G H. A software-managed approach to die-stacked DRAM. In: Proceedings of International Conference on Parallel Architecture and Compilation, San Francisco, 2015. 188–200Google Scholar
  28. 28.
    Loh G H, Hill M D. Supporting very large DRAM caches with compound-access scheduling and MissMap. IEEE Micro, 2012, 32: 70–78CrossRefGoogle Scholar
  29. 29.
    Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture, Porto Alegre, 2011. 454–464Google Scholar
  30. 30.
    Dong H W, Seong N H, Lee H H S. Pragmatic integration of an SRAM row cache in heterogeneous 3-D DRAM architecture Using TSV. IEEE Trans Very Large Scale Int Syst, 2013, 21: 1–13CrossRefGoogle Scholar
  31. 31.
    Chen X, Xu C, Dick R P, et al. Performance and power modeling in a multi-programmed multi-core environment. In: Proceedings of Design Automation Conference, Anaheim, 2010. 813–818Google Scholar
  32. 32.
    Suo G, Yang X. System level speedup oriented cache partitioning for multi-programmed systems. In: Proceedings of IFIP International Conference on Network and Parallel Computing, Gold Coast, 2009. 204–210Google Scholar
  33. 33.
    Kirovski D, Lee C, Potkonjak M, et al. Application-driven synthesis of memory-intensive systems-on-chip. IEEE Trans Comp-Aided Des Int Circ Syst, 1999, 18: 1316–1326CrossRefGoogle Scholar
  34. 34.
    Sim J, Loh G H, Sridharan V, et al. A configurable and strong RAS solution for die-stacked DRAM caches. IEEE Micro, 2014, 34: 80–90CrossRefGoogle Scholar
  35. 35.
    Chou C, Jaleel A, Qureshi M K. CAMEO: a two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014. 1–12Google Scholar
  36. 36.
    Ou J, Patton M, Moore M D, et al. A penalty aware memory allocation scheme for key-value cach. In: Proceedings of International Conference on Parallel Processing, Beijing, 2015. 530–539Google Scholar
  37. 37.
    Hennessy J L, Patterson D A. Computer Architecture: a Quantitative Approach. 5th ed. Waltham: Morgan Kaufmann, 2012. 72–96MATHGoogle Scholar
  38. 38.
    Sim J, Loh G H, Kim H, et al. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture, Vancouver, 2012. 247–257Google Scholar
  39. 39.
    Begum R, Hempstead M. Power-agility metrics: measuring dynamic characteristics of energy proportionality. In: Proceedings of IEEE International Conference on Computer Design, New York, 2015. 643–650Google Scholar

Copyright information

© Science China Press and Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  • Jingyu Zhang
    • 1
  • Minyi Guo
    • 1
  • Chentao Wu
    • 1
  • Yuanyi Chen
    • 1
  1. 1.Department of Computer Science and EngineeringShanghai Jiao Tong UniversityShanghaiChina

Personalised recommendations