Fusion Coherence: Scalable Cache Coherence for Heterogeneous Kilo-Core System

  • Songwen Pei
  • Myoung-Seo Kim
  • Jean-Luc Gaudiot
  • Naixue Xiong
Part of the Communications in Computer and Information Science book series (CCIS, volume 451)

Abstract

Future heterogeneous systems will integrate CPUs and GPUs on a single chip to achieve high computing performance as well as high throughput. In general, it would discard the current discrete pattern and will build a uniformed shared memory system avoiding explicit data movement among CPUs and GPUs connected by high throughput NoC.

We propose a scalable cache coherence solution Fusion Coherence for Heterogeneous Kilo-core System Architecture by integrating CPUs and GPUs on a single chip to mitigate the coherence bandwidth side effects of GPU memory requests as well as overhead of copying data among memories of CPUs and GPUs. The Fusion Coherence coalesces L3 data cache of CPUs and GPUs based on a uniformed physical memory, further integrates a region directory and cuckoo directory into two levels of cache coherence directory without modifying cache coherence protocol. According to the experimental results with a subset of Rodina benchmarks, it is effective to decrease the overhead of data transfer and get an average execution speedup by 2.4x. The highest speedup is approximate to 4x for data-intensive applications.

Keywords

Fusion Coherence Fusion Directory Two-level Cache Directories Heterogeneous Kilo-core System Cache Coherence 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Borkar, S.: Thousand core chips: a technology perspective. In: Proceedings of the 44th Annual Design Automation Conference (DAC), San Diego, CA, pp. 746–749 (2007)Google Scholar
  2. 2.
    Brookwood, N.: AMD fusion family of APUs: enabling a superior, immersive PC experience, AMD white paper (2010) (available in January 2014)Google Scholar
  3. 3.
    Intel Corpaoration. Intel Haswell Microarchitecture, http://www.intel.com (available in January 2014)
  4. 4.
    Nvidia Corporation. Nvidia Project Denver, http://www.nvidia.com (available in January 2014)
  5. 5.
    ARM Corporation. Big.LITTLE Processing, http://www.arm.com (available in January 2014)
  6. 6.
    Lustig, D., Martonosi, M.: Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: IEEE Conference on HPCA (2013)Google Scholar
  7. 7.
    AMD. Heterogeneous System Architecture: A Technical Review, developer.amd.com/wordpress/media/2012/10/hsa10.pdf (available in January 2014)
  8. 8.
    Greeg, C., Hazelwood, K.: Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, pp. 134–144 (2011)Google Scholar
  9. 9.
    Daga, M., Aji, A.M., Feng, W.: On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing. In: 2011 Symposium on Application Accelerators in High-Performance Computing, Knoxville, Tennessee, pp. 141–149 (2011)Google Scholar
  10. 10.
    Hwu, W.-M.: Rethinking computer architecture for throughput computing. In: 2013 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), Keynote, Greece (2013)Google Scholar
  11. 11.
    Pei, S., Gaudiot, J.-L., et al.: Decoupled memory system for heterogeneous kilo-core high throughput processor. Tech Report, UC Irvine (2013)Google Scholar
  12. 12.
    Ferdman, M., Lotfi-kamran, P., Balet, K., et al.: Cuckoo directory: a scalable directory for many-core systems. In: Proceedings of IEEE 17th International Symposium on High Performance Computer Architecture (HPCA), San Antonio, TX, pp. 169–180 (2011)Google Scholar
  13. 13.
    Power, J., Basu, A., Gu, J., et al.: Heterogeneous system coherence for integrated CPU-GPU systems. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Davis, CA, pp. 457–467 (2013)Google Scholar
  14. 14.
    Binkert, N., Beckmann, B., Black, G., et al.: The gem5 simulator. ACM SIGARCH Computer Architecture News 39(2), 1–7 (2011)CrossRefGoogle Scholar
  15. 15.
    Bakhoda, A., Yuan, G.L., Fung, W.W.L., et al.: Analyzing CUDA workloads using a detailed GPU simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, pp. 163–174 (2009)Google Scholar
  16. 16.
    Hennessy, J., Patterson, D.: Computer Architecture a quantitative approach, 5th edn., p. 333 (2012)Google Scholar
  17. 17.
    Kelm, J., Johnson, M., Lumettea, S., et al.: WayPoint: scaling coherence to 1000-core architectures. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), Vienna, Austria, pp. 99–110 (2010)Google Scholar
  18. 18.
    Sanchez, D., Kozyrakis, C.: SCD: A scalable coherence directory with flexible sharer set encoding. In: Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA), New Orleans, LA, pp. 1–12 (2012)Google Scholar
  19. 19.
    Barroso, L., Gharachorloo, K., McNamara, R., et al.: Piranha: a scalable architecture based on single-chip multiprocessing. In: Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA), Vancouver, Canada, pp. 282–293 (2000)Google Scholar
  20. 20.
    Gupta, A., Weber, W., Mowry, T.: Reducing memory and traffic requirements for scalable directory based cache coherence schemes. In: Proceedings of the International Conference on Parallel Processing, ICPP (1990)Google Scholar
  21. 21.
    Yang, Q., Thangadurai, G., Bhuyan, L.: Design of an adaptive cache coherence protocol for large scale multiprocessors. IEEE Transactions on Parallel and Distributed Systems (TPDS) 3(3), 281–293 (1992)CrossRefGoogle Scholar
  22. 22.
    Zhao, H., Shriraman, A., Kumar, S., et al.: Protozoa: Adaptive granularity cache coherence. In: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), Israel, pp. 547–558 (2013)Google Scholar
  23. 23.
    Zhao, H., Shriraman, A., Dwarkadsa, S., et al.: SPATL: Honey, I Shrunk the Coherence Directory. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), Galveston, TX, pp. 33–44 (2011)Google Scholar
  24. 24.
    Sanchez, D., Kozyrakis, C.: The ZCache: decoupling ways and associativity. In: Proceedings of the 43rd Annual IEE/ACM Symposium on Microarchitecture (MICRO), Atlanta, GA, pp. 187–198 (2010)Google Scholar
  25. 25.
    Beckmann, N., Sanchez, D.: Jigsaw: scalable software-defined caches. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 213–224 (2013)Google Scholar
  26. 26.
    Johnson, D.R., Kelm, J.H., Crago, N.C., et al.: Rigel: a scalable architecture for 1000+ core accelerators. IEEE Micro 31(4), 30–41 (2011)CrossRefGoogle Scholar
  27. 27.
    Xu, Y., Du, Y., Zhang, Y., et al.: A composite and scalable cache coherence protocol for large scale CMPs. In: Proceedings of the International Conference on Supercomputing, Tucson, Arizona, pp. 285–294 (2011)Google Scholar
  28. 28.
    Hechtman, B.A., Sorin, D.J.: Evaluating cache coherent shared virtual memory for heterogeneous multicore chips. In: 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, pp. 118–119 (2013)Google Scholar
  29. 29.
    Lis, M., Shim, K.S., Cho, M.H., et al.: Memory coherence in the age of multicores. In: 2011 IEEE 29th International Conference on Computer Design (ICCD), Amherst, MA, pp. 1–8 (2011)Google Scholar
  30. 30.
    Singh, I., Shriraman, A., Fung, W.W.L., et al.: Cache Coherence for GPU Architecture. In: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), Shenzhen, China, pp. 578–590 (2013)Google Scholar
  31. 31.
    Kasture, H., Sanchez, D.: Ubik: Efficient Cache Sharing with Strict QoS for Latency-Critical Workloads. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 1–14 (2014)Google Scholar
  32. 32.
    Basu, A., Beckmann, B.M., Hill, M.D., et al.: CMP Directory Coherence: One Granularity Does Not Fit All. TR1798, http://minds.wisconsin.edu/handle/1793/66144 (available in January 2014)
  33. 33.
    Mekkat, V., Holey, A., Yew, P.C., et al.: Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 225–243 (2013)Google Scholar
  34. 34.
    Abeyratne, N., Das, Q., Li, Q., et al.: Scaling towards kilo-core processors with asymmetric high-radix topologies. In: Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA), Shenzhen, China, pp. 496–507 (2013)Google Scholar
  35. 35.
    Cesier, L.M., Feautrier, P.: A new solution to coherence problems in mulicache systems. IEEE Transactions on Computers 27 (1978)Google Scholar
  36. 36.
    Guo, S.L., Wang, H.X., Xue, Y.B., et al.: Hierarchical cache directory for CMP. Journal of Computer Science and Technology 25(2) (2010)Google Scholar
  37. 37.
    Pagh, R., Rodler, F.F.: Cuckoo Hashing. Algotithms 51 (2004)Google Scholar
  38. 38.
    Moshovos, A.: RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), pp. 234–245 (2005)Google Scholar
  39. 39.
    Zebchuk, J., Safi, E., Moshovos, A.: A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 314–327 (2007)Google Scholar
  40. 40.
    Alisafaee, M.: Spatiotemporal Coherence Tracking. In: Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Vancouver, BC, pp. 341–350 (2012)Google Scholar
  41. 41.
    Beckmann, B.M., Basu, A., Reinhardt, S.K.: Region Privatization in directory-based cache coherence. U.S.Patent Application Publication, US2013/0073811a1 (2013)Google Scholar
  42. 42.
    Kelm, J.H., Johnson, D.R., Tuohy, W., et al.: Cohesion: a Hybrid Memory Model for Accelerators. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, Saint-Malo, France, pp. 429–440 (2010)Google Scholar
  43. 43.
    Kelm, J.H., Johnson, D.R., Tuohy, W., et al.: Cohesion: An Adaptive Hybrid Memory Model for Accelerators. IEEE Micro 31(1), 42–55 (2011)CrossRefGoogle Scholar
  44. 44.
    Hechtman, B.A., Sorin, D.J.: Exploring Memory Consistency for Massively-Threaded Throughput-Oriented Processors. In: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), Tel-aviv, Israel, pp. 201–212 (2013)Google Scholar
  45. 45.
    Sanchez, D., Kozyrakis, C.: ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), Tel-aviv, Israel, pp. 475–486 (2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Songwen Pei
    • 1
    • 2
    • 3
  • Myoung-Seo Kim
    • 3
  • Jean-Luc Gaudiot
    • 3
  • Naixue Xiong
    • 4
  1. 1.Department of Computer Science and EngineeringUniversity of Shanghai for Science and TechnologyShanghaiChina
  2. 2.State Key Laboratory of Computer Architecture, Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  3. 3.Department of Electrical Engineering and Computer ScienceUniversity of CaliforniaIrvineUSA
  4. 4.School of Computer ScienceColorado Technical UniversitySpringsUSA

Personalised recommendations