Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Kavadias, Stamatis; Katevenis, Manolis; Zampetakis, Michail; Nikolopoulos, Dimitrios S.

doi:10.1007/s10766-011-0173-6

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Open access
Published: 01 June 2011

Volume 40, pages 583–604, (2012)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Parallel Programming Aims and scope Submit manuscript

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Download PDF

Stamatis Kavadias¹,
Manolis Katevenis¹,
Michail Zampetakis¹ &
…
Dimitrios S. Nikolopoulos¹

856 Accesses
Explore all metrics

Abstract

Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds – the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the performance of synchronization functions with simulation of CMPs with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

Article PDF

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

Concurrent Data Structures in Architectures with Limited Shared Memory Support

Rhymes+: A Software Shared Virtual Memory System with Three Way Coherence Protocols on the Intel Single-Chip Cloud Computer

References

Abdel-Shafi, H., Hall, J., Adve, S.V., Adve, V.S.: An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In: HPCA’97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, p. 204. USA, IEEE Computer Society, Washington, DC (1997)
Amarasinghe S.P., Gordon M.I., Karczmarek M., Lin J., Maze D., Rabbah R.M., Thies W.: Language and compiler design for streaming applications. Int. J. Parallel Program. 33(2–3), 261–278 (2005)
Article Google Scholar
Bronevetsky, G., Gyllenhaal, J., de Supinski, B.R.: CLOMP: accurately characterizing OpenMP application overheads. In: Proceedings of the Fourth International Workshop on OpenMP (IWOMP), pp. 13–25. West Lafayette, IN (May 2008)
Cook, H., Asanović, K., Patterson, D.A.: Virtual local stores: enabling software-managed memory hierarchies in mainstream computing environments. Technical Report UCB/EECS-2009-131, EECS Department, University of California, Berkeley (Sep 2009)
Falsafi, B., Lebeck, A.R., Reinhardt, S.K., Schoinas, I., Hill, M.D., Larus, J.R., Rogers, A., Wood, D.A.: Application-specific protocols for user-level shared memory. In: Supercomputing ’94: Proceedings of the 1994 Conference on Supercomputing, pp. 380–389. IEEE Computer Society Press, Los Alamitos, CA, USA (1994)
Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y. Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P.: Sequoia: programming the memory hierarchy. In: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 83. ACM, New York, NY, USA (2006)
Firoozshahian, A., Solomatnikov, A., Shacham, O., Asgar, Z., Richardson, S. Kozyrakis, C., Horowitz, M.: A memory system design framework: creating smart memories. In: ISCA ’09: Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 406–417. ACM, New York, NY, USA (2009)
Gharachorloo K., Sharma M., Steely S., Van Doren S.: Architecture and design of AlphaServer GS320. SIGPLAN Not. 35(11), 13–24 (2000)
Article Google Scholar
Gummaraju, J., Coburn, J., Turner, Y., Rosenblum, M.: Streamware: programming general-purpose multicore processors using streams. In: ASPLOS XIII: Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 297–307. ACM, New York, NY, USA (2008)
Gummaraju, J., Erez, M., Coburn, J., Rosenblum, M., Dally, W.J.: Architectural support for the stream execution model on general-purpose processors. 16th International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 3–12, 15–19 (Sept 2007)
IBM: PowerPC 750GX/FX Cache Programming (Dec 2004)
Intel: Intel XScale Microarchitecture Programmers Reference Manual (Feb 2001)
Kahle J.A., Day M.N., Hofstee H.P., Johns C.R., Maeurer T.R., Shippy D.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005)
Article Google Scholar
Kalokairinos, G., Papaefstathiou, V., Nikiforos, G., Kavadias, S., Katevenis, M., Pnevmatikatos, D., Yang, X.: FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability. In: Proceedings IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (IC-SAMOS2009) (July 2009)
Kalokerinos, G., Papaefstathiou, V., Nikiforos, G., Kavadias, S., Katevenis, M., Pnevmatikatos, D., Yang, X.: Prototyping a configurable cache/scratchpad memory with virtualized user-level RDMA capability. Trans. HiPEAC (2010, to appear)
Katevenis, M.: Interprocessor communication seen as load-store instruction generalization. In: K. Bertels e.a. (ed.) The Future of Computing, Essays in Memory of Stamatis Vassiliadis, pp. 55–68. Delft, The Netherlands (Sept 2007)
Kavadias, S., Katevenis, M.G.H., Zampetakis, M., Nikolopoulos, D.S.: On-chip communication and synchronization with cache-integrated network interfaces. In: CF ’10: Proceedings of the 7th ACM Conference on Computing Frontiers. ACM, New York, NY, USA (May 2010)
Keckler S.W., Chang A., Lee W.S., Chatterjee S., Dally W.J.: Concurrent event handling through multithreading. IEEE Trans. Comput. 48(9), 903–916 (1999)
Article Google Scholar
Koufaty, D., Torrellas, J.: Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs. In: ICS ’98: Proceedings of the 12th International Conference on Supercomputing, pp. 53–60. ACM, New York, NY, USA (1998)
Lenoski D., Laudon J., Gharachorloo K., Weber W.-D., Gupta A., Hennessy J., Horowitz M., Lam M.S.: The stanford dash multiprocessor. Computer 25(3), 63–79 (1992)
Article Google Scholar
Leverich J., Arakida H., Solomatnikov A., Firoozshahian A., Horowitz M., Kozyrakis C.: Comparing memory systems for chip multiprocessors. SIGARCH Comput. Archit. News. 35(2), 358–368 (2007)
Article Google Scholar
Magnusson P.S., Christensson M., Eskilson J., Forsgren D., Hållberg G., Högberg J., Larsson F., Moestedt A., Werner B.: Simics: a full system simulation platform. Computer 35(2), 50–58 (2002)
Article Google Scholar
Martin M.M.K., Sorin D.J., Beckmann B.M., Marty M.R., Xu M., Alameldeen A.R., Moore K.E., Hill M.D., Wood D.A.: Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News. 33(4), 92–99 (2005)
Article Google Scholar
McCalpin, J.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. (TCCA) Newsl. (Dec 1995)
Mellor-Crummey J.M., Scott M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)
Article Google Scholar
Poulsen, D.K., Yew, P.-C.: Data prefetching and data forwarding in shared memory multiprocessors. In: Proceedings of the 1994 International Conference on Parallel Processing (ICPP ’94), vol. 2, pp. 276–280 (1994)
Rangan, R., Vachharajani, N., Stoler, A., Ottoni, G., August, D.I., Cai, G.Z.N.: Support for high-frequency streaming in CMPs. In: MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 259–272. IEEE Computer Society, Washington, DC, USA (2006)
Ranganathan, P., Adve, S., Jouppi, N.P.: Reconfigurable caches and their application to media processing. In: ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 214–224. ACM, New York, NY, USA (2000)
Reilly, M., Stewart, L.C., Leonard, J., Gingold, D.: A new generation of cluster interconnect
SARC: Scalable computer ARChitecture: http://www.sarc-ip.org/. European IP Project (2005–2009)
Schoinas I., Falsafi B., Lebeck A.R., Reinhardt S.K., Larus J.R., Wood D.A.: Fine-grain access control for distributed shared memory. SIGPLAN Not. 29(11), 297–306 (1994)
Article Google Scholar
Scott S.L.: Synchronization and communication in the T3E multiprocessor. SIGOPS Oper. Syst. Rev. 30(5), 26–36 (1996)
Article Google Scholar
Shan H., Singh J.P.: A comparison of MPI, SHMEM and cache-coherent shared address space programming models on a tightly-coupled multiprocessors. Int. J. Parallel Program. 29(3), 283–318 (2001)
Article MATH Google Scholar
Wen M., Wu N., Zhang C., Yang Q., Ren J., He Y., Wu W., Chai J., Guan M., Xun C.: On-chip memory system optimization design for the FT64 scientific stream accelerator. IEEE Micro. 28(4), 51–70 (2008)
Article Google Scholar
Wentzlaff D., Griffin P., Hoffmann H., Bao L., Edwards B., Ramey C., Mattina M., Miao C.-C., III J.F.B., Agarwal A.: On-chip interconnection architecture of the tile processor. IEEE Micro. 27(5), 15–31 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Foundation for Research & Technology - Hellas, Institute of Computer Science (FORTH-ICS), Heraklion, Crete, Greece
Stamatis Kavadias, Manolis Katevenis, Michail Zampetakis & Dimitrios S. Nikolopoulos

Authors

Stamatis Kavadias
View author publications
You can also search for this author in PubMed Google Scholar
Manolis Katevenis
View author publications
You can also search for this author in PubMed Google Scholar
Michail Zampetakis
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios S. Nikolopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stamatis Kavadias.

Additional information

All the authors are member of HiPEAC.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Kavadias, S., Katevenis, M., Zampetakis, M. et al. Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs. Int J Parallel Prog 40, 583–604 (2012). https://doi.org/10.1007/s10766-011-0173-6

Download citation

Received: 02 November 2010
Accepted: 13 May 2011
Published: 01 June 2011
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10766-011-0173-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Abstract

Article PDF

Similar content being viewed by others

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

Concurrent Data Structures in Architectures with Limited Shared Memory Support

Rhymes+: A Software Shared Virtual Memory System with Three Way Coherence Protocols on the Intel Single-Chip Cloud Computer

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Abstract

Article PDF

Similar content being viewed by others

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

Concurrent Data Structures in Architectures with Limited Shared Memory Support

Rhymes+: A Software Shared Virtual Memory System with Three Way Coherence Protocols on the Intel Single-Chip Cloud Computer

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation