An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

Article
  • 62 Downloads

Abstract

The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. In this article, we investigate a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. This mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. We extend our previous work by investigating in detail the impact of system design parameters and extending the system to support multi-level cache hierarchies. Results show that the choice of implementation of multi-level cache hierarchies can have a significant impact on performance.

Keywords

Many-core architectures Cache coherence Shared memory 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abts, D., Scott, S., Lilja, D.J.: So many states, so little time: verifying memory coherence in the Cray X1. In: Proceedings of the International Parallel and Distributed Processing Symposium. (2003). doi:10.1109/IPDPS.2003.1213087
  2. 2.
    Adve, S.V., Gharachorloo, K.: Shared memory consistency models: a tutorial. IEEE Comput. 29(12) (1996). doi:10.1109/2.546611
  3. 3.
    Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.W., Ryu, S., Steele, G.L. Jr., Tobin-Hochstadt, S.: The Fortress Language Specification Version 1.0 β. Sun Microsystems, Inc., http://research.sun.com/projects/plrg/Publications/fortress1.0beta.pdf (2007)
  4. 4.
    Beckmann, B.M., Wood, D.A.: Managing wire delay in large chip-multiprocessor caches. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 319–330. (2004). doi:10.1109/MICRO.2004.21
  5. 5.
    Burger, D., Austin, T.M., Bennett, S.: Evaluating future microprocessors: the SimpleScalar tool set. Technical Report CS-TR-1996-1308, University of Wisconsin-Madison (1996)Google Scholar
  6. 6.
    Burger, D., Keckler, S.W., McKinley, K.S., Dahlin, M., John, L.K., Lin, C., Moore, C.R., Burrill, J., McDonald, R.G., Yoder, W., The TRIPS Team: Scaling to the end of silicon with EDGE architectures. IEEE Comput. 37(7), 44–55 (2004). doi:10.1109/MC.2004.65 Google Scholar
  7. 7.
    Carter, J.B., Bennett, J.K., Zwaenepoel, W.: Implementation and performance of munin. In: Proceedings of the 13th Symposium on Operating Systems Principles, pp. 152–164 (1991). doi:10.1145/121133.121159
  8. 8.
    Caşcaval, C., Castaños, J.G., Ceze, L., Denneau, M., Gupta, M., Lieber, D., Moreira, J.E., Strauss, K., Warren, H.S. Jr.: Evaluation of a multithreaded architecture for cellular computing. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, pp. 311–322 (2002). doi:10.1109/HPCA.2002.995720
  9. 9.
    Chang, J., Sohi, G.S.: Cooperative caching for chip multiprocessors. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture, pp. 264–276 (2006). doi:10.1109/ISCA.2006.17
  10. 10.
    Chaudhuri, M., Heinrich, M.: SMTp: an architecture for next-generation scalable multi-threading. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 124–137 (2004). doi:10.1109/ISCA.2004.1310769
  11. 11.
    Chishti, Z., Powell, M.D., Vijaykumar, T.N.: Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 357–368 (2005). doi:10.1109/ISCA.2005.39
  12. 12.
    Cray: Chapel Language Specification 0.785. Cray Inc., http://chapel.cray.com/spec-0.785.pdf (2009)
  13. 13.
    Fensch, C., Cintra, M.: An OS-based alternative to full hardware coherence on tiled CMPs. In: Proceedings of the 14th International Symposium on High-Performance Computer Architecture, pp. 355–366 (2008). doi:10.1109/HPCA.2008.4658652
  14. 14.
    Fillo M., Keckler S.W., Dally W.J., Carter N.P., Chang A., Gurevich Y., Lee W.S.: The M-machine multicomputer. Int. J. Parallel Programm. 25(3), 183–212 (1997). doi:10.1007/BF02700035 CrossRefGoogle Scholar
  15. 15.
    Hagersten, E.: Personal Communication regarding the verification of the coherence protocol of Sun Microsystems’ Enterprise Servers E3000, E4000, E5000 and E6000 (2007)Google Scholar
  16. 16.
    Hill M.D.: Multiprocessors should support simple memory-consistency models. Computer 31(8), 28–34 (1998). doi:10.1109/2.707614 CrossRefGoogle Scholar
  17. 17.
    Iftode, L., Singh, J.P., Li, K.: Understanding applications performance on shared virtual memory systems. In: Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 122–133 (1996). doi:10.1145/232973.232987
  18. 18.
    Intel: Intel Core2 Extreme Processor X6800 and Intel Core2 Duo Desktop Processor E6000 and E4000 Sequence Specification Update. Intel, document No: 313279-016 (2007)Google Scholar
  19. 19.
    Kalla R., Sinharoy B., Tendler J.M.: IBM Power5 chip: a dual-core multithreaded processor. IEEE Micro 24(2), 40–47 (2004). doi:10.1109/MM.2004.1289290 CrossRefGoogle Scholar
  20. 20.
    Keleher, P., Cox, A.L., Dwarkadas, S., Zwaenepoel, W.: TreadMarks: distributed shared memory on standard workstations and operating systems. In: USENIX Winter 1994 Technical Conference Proceedings, pp. 115–131 (1994)Google Scholar
  21. 21.
    Kim, C., Burger, D., Keckler, S.W.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 211–222 (2002). doi:10.1145/605432.605420
  22. 22.
    Kongetira P., Aingaran K., Olukotun K.: Niagara: a 32-way multithreaded sparc processor. IEEE Micro 25(2), 21–29 (2005). doi:10.1109/MM.2005.35 CrossRefGoogle Scholar
  23. 23.
    Kontothanassis, L.I., Hunt, G., Stets, R., Hardavellas, N., Cierniak, M., Parthasarathy, S., Meira, W. Jr., Dwarkadas, S., Scott, M.L.: VM-based shared memory on low-latency, remote-memory-access networks. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 157–169 (1997). doi:10.1145/384286.264163
  24. 24.
    Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanović, K.: The vector-thread architecture. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 52–64 (2004). doi:10.1109/ISCA.2004.1310763
  25. 25.
    Kumar, R., Zyuban, V., Tullsen, D.M.: Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 408–419 (2005). doi:10.1109/ISCA.2005.34
  26. 26.
    Kuskin, J., Ofelt, D., Heinrich, M., Heinlein, J., Simoni, R., Gharachorloo, K., Chapin, J., Nakahira, D., Baxter, J., Horowitz, M., Gupta, A., Rosenblum, M., Hennessy, J.L.: The stanford FLASH multiprocessor. In: Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 325–337 (1994). doi:10.1109/ISCA.1994.288140
  27. 27.
    Laudon, J., Lenoski, D.: The SGI Origin: a ccNUMA highly scalable server. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 241–251 (1997). doi:10.1145/384286.264206
  28. 28.
    Li, K.: IVY: a shared virtual memory system for parallel computing. In: Proceedings of the 1988 International Conference on Parallel Processing, vol. 2, pp. 94–101, Pennsylvania State University Press (1988)Google Scholar
  29. 29.
    Li, M., Sasanka, R., Adve, S.V., Chen, Y.K., Debes, E.: The ALPBench benchmark suite for complex multimedia applications. In: Proceedings of IEEE International Symposium on Workload Characterization, pp. 34–45 (2005). doi:10.1109/IISWC.2005.1525999
  30. 30.
    Martin, M.M.K., Hill, M.D., Wood, D.A.: Token coherence: decoupling performance and correctness. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 182–193 (2003). doi:10.1109/ISCA.2003.1206999
  31. 31.
    McNairy C., Bhatia R.: Montecito: a dual-core, dual-thread itanium processor. IEEE Micro 25(2), 10–20 (2005). doi:10.1109/MM.2005.35 CrossRefGoogle Scholar
  32. 32.
    Scott, S.L.: Synchronization and communication in the T3E multiprocessor. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26–36 (1996). doi:10.1145/237090.237144
  33. 33.
    Swanson, S., Michelson, K., Schwerin, A., Oskin, M.: WaveScalar. In: Proceedings of the 36th Annual International Symposium on Microarchitecture, pp. 291–203 (2003). doi:10.1109/MICRO.2003.1253203
  34. 34.
    Taylor, M.B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., Hoffmann, H., Johnson, P., Kim, J., Psota, J., Saraf, A., Shnidman, N., Strumpen, V., Frank, M., Agarwal, A., Amarasinghe, S.: Evaluation of the raw microprocessor: an exposed-wire-delay architecture for ILP and streams. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 2–13 (2004). doi:10.1109/ISCA.2004.1310759
  35. 35.
    Vachharajani, M., Vachharajani, N., August, D.I.: The liberty structural specification language: a high-level modeling language for component reuse. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 195–206 (2004). doi:10.1145/996893.996865
  36. 36.
    Verghese, B., Devine, S., Gupta, A., Rosenblum, M.: Operating system support for improving data locality on CC-NUMA compute servers. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279–289 (1996). doi:10.1145/237090.237205
  37. 37.
    Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24–36 (1995). doi:10.1145/223982.223990
  38. 38.
    Zeffer, H., Hagersten, E.: A case for low-complexity MP architectures. In: Proceedings of the Conference on Supercomputing (2007). doi:10.1145/1362622.1362648
  39. 39.
    Zeffer, H., Radović, Z., Karlsson, M., Hagersten, E.: TMA: a trap-based memory architecture. In: Proceedings of the 20th Annual International Conference on Supercomputing, pp. 259–268 (2006). doi:10.1145/1183401.1183438
  40. 40.
    Zhang, M., Asanović, K.: Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 336–345 (2005). doi:10.1109/ISCA.2005.53

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.School of InformaticsUniversity of EdinburghEdinburghUK

Personalised recommendations