Advertisement

Design Automation for Embedded Systems

, Volume 20, Issue 1, pp 65–92 | Cite as

CaPPS: cache partitioning with partial sharing for multi-core embedded systems

  • Wei ZangEmail author
  • Ann Gordon-Ross
Article
  • 237 Downloads

Abstract

As the number of cores in chip multi-processor systems increases, the contention over shared last-level cache (LLC) resources increases, thus making LLC optimization critical, especially for embedded systems with strict area/energy/power constraints. We propose cache partitioning with partial sharing (CaPPS), which reduces LLC contention using cache partitioning and improves utilization with sharing configuration. Sharing configuration enables the partitions to be privately allocated to a single core, partially shared with a subset of cores, or fully shared with all cores based on the co-executing applications’ requirements. CaPPS imposes low hardware overhead and affords an extensive design space to increase optimization potential. To facilitate fast design space exploration, we develop an analytical model to quickly estimate the miss rates of all CaPPS configurations using the applications’ isolated LLC access traces to predict runtime LLC contention. Experimental results demonstrate that the analytical model estimates cache miss rates with an average error of only 0.73 % and with an average speedup of \(3505\times \) as compared to a cycle-accurate simulator. Due to CaPPS’s extensive design space, CaPPS can reduce the average LLC miss rate by as much as 25 % as compared to baseline configurations and as much as 14–17 % as compared to prior works.

Keywords

Cache memories Modeling techniques Optimization Performance evaluation 

Notes

Acknowledgments

This work was supported by the National Science Foundation (CNS-0953447). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

  1. 1.
  2. 2.
    Binkert N, Beckmann B, Black G et al. The gem5 Simulator. http://gem5.org
  3. 3.
    Burger D, Austin TM, Bennett S (2000) Evaluating future microprocessors: the Simplescalar Toolset. In: Technical Report, CS-TR-1308. Computer Science Department, University of Wisconsin-Madison, WisconsinGoogle Scholar
  4. 4.
    Chandra D, Guo F, Kim S, Solihin Y (2005) Predicting inter-thread cache contention on a chip multi-processor architecture. In: Proceedings of HPCA, pp 340–351Google Scholar
  5. 5.
    Chang J, Sohi G (2006) Co-operative caching for chip multiprocessors. In: Proceedings of the 33rd annual international symposium on Computer Architecture (ISCA). IEEE, Los Alamitos, pp 264–276Google Scholar
  6. 6.
    Chang J, Sohi G (2014) Cooperative cache partitioning for chip multiprocessors. In: 25th Anniversary international conference on supercomputing anniversary volume. ACM, New YorkGoogle Scholar
  7. 7.
    Chen XE, Aamodt TM (2009) A first-order fine-grained multithreaded throughput model. In: Proceedings of HPCA, pp 329–340Google Scholar
  8. 8.
    Chiou D, Chiouy D, Rudolph L, Rudolphy L, Devadas S, Devadasy S, Ang BS (2000) Dynamic cache partitioning via columnization. Computation Structures Group Memo 430. MIT, CambridgeGoogle Scholar
  9. 9.
    Dybdahl H, Stenstrom P (2007) An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In: Proceedings of HPCA, pp 2–12Google Scholar
  10. 10.
    Eklov D, Black-Schaffer D, Hagersten E (2011) Fast modeling of shared cache in multicore systems. In: Proceedings of HiPEAC, pp 147–157Google Scholar
  11. 11.
    Ghasemzadeh H, Mazrouee S, Moghaddam HG, Shojaei H, Kakoee MR (2006) Hardware implementation of stack-based replacement algorithms. In: Proceedings of world academy of science and technology, vol 16Google Scholar
  12. 12.
    Hamerly G, Perelman E, Lau J, Calder B (2005) SimPoint 3.0: faster and more flexible program analysis. J Instr Level Parallel 7(4):1–28Google Scholar
  13. 13.
    Hill MD, Smith AJ (1989) Evaluating associativity in CPU caches. IEEE Trans Comput 38(12):1612–1630CrossRefGoogle Scholar
  14. 14.
    Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2007) A NUCA substrate for flexible CMP cache sharing. IEEE Trans Parallel Distrib Syst 18(8):1028–1040CrossRefGoogle Scholar
  15. 15.
  16. 16.
    Johnson K, Rathbone M (2010) Sun’s Niagara Processor. NYU Multicore ProgrammingGoogle Scholar
  17. 17.
    Kessler RE, Hill MD (1992) Page placement algorithms for large real-indexed caches. ACM Trans Comput Syst 10(4):338–359CrossRefGoogle Scholar
  18. 18.
    Kim S, Chandra D, Solihin Y (2004) Fair cache sharing and partitioning in a chip multiprocessor architecture. In: Proceedings of PACT, pp 111–122Google Scholar
  19. 19.
    Lee H, Cho S, Childers BR (2011) CloudCache: expanding and shrinking private caches. In: Proceedings of HPCA, pp 219–230Google Scholar
  20. 20.
    Manikantan R, Kaushik R, Govindarajan R (2012) Probabilistic shared cache management (PriSM). In: ACM SIGARCH computer architecture news, vol. 40(3). IEEE Computer Society, New YorkGoogle Scholar
  21. 21.
    Qureshi MK (2009) Adaptive spill-receive for robust high-performance caching in CMPs. In: Proceedings of HPCA, pp 45–54Google Scholar
  22. 22.
    Qureshi MK, Patt YN (2006) Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of MICRO, pp 423–432Google Scholar
  23. 23.
    Shedler GS, Slutz DR (1976) Derivation of miss ratios for merged access streams. IBM J Res Dev 20(5):505–517Google Scholar
  24. 24.
    Shen X, Zhong Y, Din C (2004) Locality phase prediction. In: Proceedings of ASPLOS, pp 165–176Google Scholar
  25. 25.
    Sherwood T, Perelman E, Hamerly G, Sair S, Calder B (2003) Discovering and exploiting program phases. IEEE Micro: top picks from computer architecture conference, pp 84–93Google Scholar
  26. 26.
  27. 27.
    Srikantaiah S, Kultursay E, Zhang T, Kandemir M, Irwin MJ, Xie Y (2011) MorphCache: a reconfigurable adaptive multi-level cache hierarchy for CMPs. In: Proceedings of HPCA, pp 231–242Google Scholar
  28. 28.
    Suh E, Rudolph L, Devadas S (2001) Dynamic cache partitioning for simultaneous multithreading systems. In: Proceedings of the IASTED international conference on parallel and distributed computing and systems, pp 116–127Google Scholar
  29. 29.
    Sundararajan KT, Jones TM, Topham NP (2013) RECAP: region-aware cache partitioning. In: IEEE 31st international conference on computer design, pp 294–301Google Scholar
  30. 30.
    Varadarajan K, Nandy SK, Sharda V, Bharadwa A, Iyer R, Makineni S, Newell D (2006) Molecular caches: a caching structure for dynamic creation of application-specific heterogeneous cache regions. In: Proceedings of MICRO, pp 433–442Google Scholar
  31. 31.
    Wang R, Hsieh M, Chen L (2014) Futility scaling: high-associativity cache partitioning. In: Proceedings of MICRO-47, pp 356–367Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.SK Hynix Memory SolutionSan JoseUSA
  2. 2.Department of Electrical and Computer EngineeringUniversity of FloridaGainesvilleUSA

Personalised recommendations