Abstract
Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds – the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the performance of synchronization functions with simulation of CMPs with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.
Article PDF
Similar content being viewed by others
References
Abdel-Shafi, H., Hall, J., Adve, S.V., Adve, V.S.: An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In: HPCA’97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, p. 204. USA, IEEE Computer Society, Washington, DC (1997)
Amarasinghe S.P., Gordon M.I., Karczmarek M., Lin J., Maze D., Rabbah R.M., Thies W.: Language and compiler design for streaming applications. Int. J. Parallel Program. 33(2–3), 261–278 (2005)
Bronevetsky, G., Gyllenhaal, J., de Supinski, B.R.: CLOMP: accurately characterizing OpenMP application overheads. In: Proceedings of the Fourth International Workshop on OpenMP (IWOMP), pp. 13–25. West Lafayette, IN (May 2008)
Cook, H., Asanović, K., Patterson, D.A.: Virtual local stores: enabling software-managed memory hierarchies in mainstream computing environments. Technical Report UCB/EECS-2009-131, EECS Department, University of California, Berkeley (Sep 2009)
Falsafi, B., Lebeck, A.R., Reinhardt, S.K., Schoinas, I., Hill, M.D., Larus, J.R., Rogers, A., Wood, D.A.: Application-specific protocols for user-level shared memory. In: Supercomputing ’94: Proceedings of the 1994 Conference on Supercomputing, pp. 380–389. IEEE Computer Society Press, Los Alamitos, CA, USA (1994)
Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y. Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P.: Sequoia: programming the memory hierarchy. In: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 83. ACM, New York, NY, USA (2006)
Firoozshahian, A., Solomatnikov, A., Shacham, O., Asgar, Z., Richardson, S. Kozyrakis, C., Horowitz, M.: A memory system design framework: creating smart memories. In: ISCA ’09: Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 406–417. ACM, New York, NY, USA (2009)
Gharachorloo K., Sharma M., Steely S., Van Doren S.: Architecture and design of AlphaServer GS320. SIGPLAN Not. 35(11), 13–24 (2000)
Gummaraju, J., Coburn, J., Turner, Y., Rosenblum, M.: Streamware: programming general-purpose multicore processors using streams. In: ASPLOS XIII: Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 297–307. ACM, New York, NY, USA (2008)
Gummaraju, J., Erez, M., Coburn, J., Rosenblum, M., Dally, W.J.: Architectural support for the stream execution model on general-purpose processors. 16th International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 3–12, 15–19 (Sept 2007)
IBM: PowerPC 750GX/FX Cache Programming (Dec 2004)
Intel: Intel XScale Microarchitecture Programmers Reference Manual (Feb 2001)
Kahle J.A., Day M.N., Hofstee H.P., Johns C.R., Maeurer T.R., Shippy D.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005)
Kalokairinos, G., Papaefstathiou, V., Nikiforos, G., Kavadias, S., Katevenis, M., Pnevmatikatos, D., Yang, X.: FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability. In: Proceedings IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (IC-SAMOS2009) (July 2009)
Kalokerinos, G., Papaefstathiou, V., Nikiforos, G., Kavadias, S., Katevenis, M., Pnevmatikatos, D., Yang, X.: Prototyping a configurable cache/scratchpad memory with virtualized user-level RDMA capability. Trans. HiPEAC (2010, to appear)
Katevenis, M.: Interprocessor communication seen as load-store instruction generalization. In: K. Bertels e.a. (ed.) The Future of Computing, Essays in Memory of Stamatis Vassiliadis, pp. 55–68. Delft, The Netherlands (Sept 2007)
Kavadias, S., Katevenis, M.G.H., Zampetakis, M., Nikolopoulos, D.S.: On-chip communication and synchronization with cache-integrated network interfaces. In: CF ’10: Proceedings of the 7th ACM Conference on Computing Frontiers. ACM, New York, NY, USA (May 2010)
Keckler S.W., Chang A., Lee W.S., Chatterjee S., Dally W.J.: Concurrent event handling through multithreading. IEEE Trans. Comput. 48(9), 903–916 (1999)
Koufaty, D., Torrellas, J.: Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs. In: ICS ’98: Proceedings of the 12th International Conference on Supercomputing, pp. 53–60. ACM, New York, NY, USA (1998)
Lenoski D., Laudon J., Gharachorloo K., Weber W.-D., Gupta A., Hennessy J., Horowitz M., Lam M.S.: The stanford dash multiprocessor. Computer 25(3), 63–79 (1992)
Leverich J., Arakida H., Solomatnikov A., Firoozshahian A., Horowitz M., Kozyrakis C.: Comparing memory systems for chip multiprocessors. SIGARCH Comput. Archit. News. 35(2), 358–368 (2007)
Magnusson P.S., Christensson M., Eskilson J., Forsgren D., Hållberg G., Högberg J., Larsson F., Moestedt A., Werner B.: Simics: a full system simulation platform. Computer 35(2), 50–58 (2002)
Martin M.M.K., Sorin D.J., Beckmann B.M., Marty M.R., Xu M., Alameldeen A.R., Moore K.E., Hill M.D., Wood D.A.: Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News. 33(4), 92–99 (2005)
McCalpin, J.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. (TCCA) Newsl. (Dec 1995)
Mellor-Crummey J.M., Scott M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)
Poulsen, D.K., Yew, P.-C.: Data prefetching and data forwarding in shared memory multiprocessors. In: Proceedings of the 1994 International Conference on Parallel Processing (ICPP ’94), vol. 2, pp. 276–280 (1994)
Rangan, R., Vachharajani, N., Stoler, A., Ottoni, G., August, D.I., Cai, G.Z.N.: Support for high-frequency streaming in CMPs. In: MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 259–272. IEEE Computer Society, Washington, DC, USA (2006)
Ranganathan, P., Adve, S., Jouppi, N.P.: Reconfigurable caches and their application to media processing. In: ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 214–224. ACM, New York, NY, USA (2000)
Reilly, M., Stewart, L.C., Leonard, J., Gingold, D.: A new generation of cluster interconnect
SARC: Scalable computer ARChitecture: http://www.sarc-ip.org/. European IP Project (2005–2009)
Schoinas I., Falsafi B., Lebeck A.R., Reinhardt S.K., Larus J.R., Wood D.A.: Fine-grain access control for distributed shared memory. SIGPLAN Not. 29(11), 297–306 (1994)
Scott S.L.: Synchronization and communication in the T3E multiprocessor. SIGOPS Oper. Syst. Rev. 30(5), 26–36 (1996)
Shan H., Singh J.P.: A comparison of MPI, SHMEM and cache-coherent shared address space programming models on a tightly-coupled multiprocessors. Int. J. Parallel Program. 29(3), 283–318 (2001)
Wen M., Wu N., Zhang C., Yang Q., Ren J., He Y., Wu W., Chai J., Guan M., Xun C.: On-chip memory system optimization design for the FT64 scientific stream accelerator. IEEE Micro. 28(4), 51–70 (2008)
Wentzlaff D., Griffin P., Hoffmann H., Bao L., Edwards B., Ramey C., Mattina M., Miao C.-C., III J.F.B., Agarwal A.: On-chip interconnection architecture of the tile processor. IEEE Micro. 27(5), 15–31 (2007)
Author information
Authors and Affiliations
Corresponding author
Additional information
All the authors are member of HiPEAC.
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Kavadias, S., Katevenis, M., Zampetakis, M. et al. Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs. Int J Parallel Prog 40, 583–604 (2012). https://doi.org/10.1007/s10766-011-0173-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-011-0173-6