Abstract
Computer systems are showing a continuously increasing degree of parallelism in all areas. Stagnating single thread performance as well as power constraints prevent a reversal of this trend. On the contrary, current projections show that the trend towards parallelism will accelerate. In cluster computing scalability and therefore the degree of parallelism are limited by the network interconnect and its characteristics like latency, message rate, overlap and bandwidth. While most interconnection networks focus on improving bandwidth, there are many applications that are very sensitive to latency, message rate and overlap, too. We present an interconnection network called EXTOLL, which is specifically designed to improve characteristics like latency, message rate and overlap, rather than focusing solely on improving bandwidth. Key techniques to achieve this are designing EXTOLL as an integral part of the HPC system, providing dedicated support for multi-core environments and designing and optimizing EXTOLL from scratch for the needs of high performance computing. The most important parts of EXTOLL are the network interface and the network switch, which is a crucial resource when scaling the network. EXTOLL’s network interface provides dedicated support for small messages for eager communication, and for bulk transfers in the form of rendezvous communication. While support for small messages is optimized mainly for high message rates and low latencies, for bulk transfers the possible amount of overlap between communication and computation is optimized. EXTOLL is completely based on FPGA technology, both for the network interface and the switching. In this work we present a case for accelerated communication, where FPGAs are not used to speed up computational processes, rather we employ FPGAs to speed up communication. We will show that in spite of the inferior performance characteristics of FPGAs compared to ASIC solutions, we can dramatically accelerate communication tasks and thus reduce the overall execution time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
TOP500 list: http://www.top500.org
Infiniband Trade Association; InfiniBand Architecture Specification Volume 1; Release 1.2.1, 2007
R. Brightwell, K.T. Pedretti, K.D. Underwood, H. Trammell, SeaStar interconnect: Balanced bandwidth for scalable performance. IEEE Micro 41–57 (2006)
R. Alverson, D. Roweth, L. Kaplan, The gemini system interconnect high performance interconnects (HOTI), in 2010 IEEE 18th Annual Symposium on, 2010, pp. 83–87
Y. Ajima, S. Sumimoto, T. Shimizu, Tofu: A 6D mesh/torus interconnect for exascale computers. Computer 36–40 (2009). IEEE Computer Society
The BlueGene/L Team An overview of the BlueGene/L supercomputer, in Proceedings 2002 ACM/IEEE Conf. Supercomputing (SC 02), IEEE CS Press, 2002
P. Kogge et al., (eds.), ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. US Department of Energy, Office of Science, Advanced Scientific computing Reasrach, Waschington, DC, (2008). available at http://www.er.doe.gov/ascr
D.A. Patterson, Latency lags bandwith. Comm. ACM 47(10), 71–75 (2004)
D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, T. von Eicken, LogP: towards a realistic model of parallel computation, in Fourth ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 1993, pp. 262–273
S. Sur, M.J. Koop, D.K. Panda, High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis, in Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC ‘06), 2006
W. Lawry, C. Wilson, A. Maccabe, R. Brightwell, COMB: a portable benchmark suite for assessing MPI overlap, in Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 2002), 2002, p. 472
D. Slogsnat, A. Giese, M. Nüssle, U. Brüning, An open-source HyperTransport core, in ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 1(3), 2008, p. 1–21
H. Litz, H. Fröning, U. Brüning, HTAX: a novel framework for flexible and high performance networks-on-chip, in Fourth Workshop on Interconnection Network Architectures: On-Chip, Multi-Chip (INA-OCMC), in conjunction with HiPEAC, 2010
H. Litz, H. Fröning, M. Nüssle, U. Brüning, VELO: a novel communication engine for ultra-low latency message transfers, in 37th International Conference on Parallel Processing (ICPP-08), 2008
M. Nüssle, Acceleration of the Hardware-Software Interface of a Communication Device for Parallel Systems. Ph.D. thesis, University of Mannheim, 2009
M. Nüssle, M. Scherer, U. Brüning, A resource optimized remote-memory-access architecture for low-latency communication, in 38th International Conference on Parallel Processing (ICPP-2009), 2009
Hypertransport Technology Consortium, Hypertransport I/O Link Specification Revision 2.00b, 2005. Document #HTC20031217–0036–0009
Xilinx Inc, XtremeDSP for Virtex-4 FPGAs User Guide, UG073 (v2.7), 2008
Xilinx Inc, Virtex-4 RocketIO Multi-Gigabit Transceiver User Guide, UG076 (v4.1), 2008
H. Fröning, M. Nüssle, D. Slogsnat, H. Litz, U. Brüning, The HTX-board: a rapid prototyping station, in 3rd annual FPGAworld Conference, 2006
Xilinx Inc, Virtex-4 FPGA User Guide, UG070 (v2.6), 2008
M. Nüssle, B. Geib, H. Fröning, U. Brüning, An FPGA-based custom high performance interconnection network, in 2009 International Conference on ReConFigurable Computing and FPGAs, 2009
E. Gabriel, G.E. Fagg, G. Bosilca, et al., Open MPI: goals, concept, and design of a next generation MPI implementation, in Proceedings of the 11th European PVM/MPI Users’ Group Meeting (Euro- PVM/MPI04, 2004
H.W. Jin, S. Sur, L. Chai, D.K. Panda, LiMIC: support for high-performance MPI intra-node communication on Linux cluster, in 34th International Conference on Parallel Processing (ICPP-05), 2005
K. Yelick, D. Bonachea, W.Y. Chen, P. Colella, K. Datta, J. Duell, et al., Productivity and performance using partitioned global address space languages, in International Conference on Symbolic and Algebraic Computation, 2007
P. Mochel, The sysfs filesystem, in Proceedings of the Annual Linux Symposium, 2005
H. Fröning, M. Nüssle, H. Litz, U. Brüning, A case for FPGA based accelerated communication, in 9th International Conference on Networks (ICN 2010), 2010
Intel GmbH, Intel®; MPI Benchmarks Users Guide and Methodology Description, 2006
V. Aggarwal, Y. Sabharwal, R. Garg, P. Heidelberger, HPCC RandomAccess benchmark for next generation supercomputers, in Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing (IPDPS), IEEE Computer Society, 2009
J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, W. Wang, The weather research and forecast model: software architecture and performance, in Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004
N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, S. Wen-King, Myrinet: a gigabit-per-second local area network. IEEE Micro 15(1), 29–36 (1995)
F. Petrini, et al., The quadrics network: high-performance clustering technology. IEEE Micro 22(1), 46–57 (2002)
Y. Ajima, Y. Takagi, T. Inoue, S. Hiramoto, T. Shimizu, The tofu interconnect. High performance interconnects (HOTI), in 2011 IEEE 19th Annual Symposium, 2011
M. Trams, W. Rehm, SCI transaction management in our FPGA-based PCI-SCI bridge, in Proceedings of SCI Europe, 2001
N. Fugier, M. Herbert, E. Lemoine, B. Tourancheau, MPI for the Clint Gb/s Interconnect. Recent Advances in Parallel Virtual Machine and Message Passing Interface, Vol. 2840/2003, Lecture Notes in Computer Science (Springer, Heidelberg, 2003)
N. Tanabe, A. Kitamura, et al., Preliminary evaluations of a FPGA based-prototype of DIMMnet-2 network interface, in IEEE International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, 2005
M. Marazakis, K. Xinidis, V. Papaefstathiou, A. Bilas, Efficient remote block-level I/O over an RDMA-capable NIC, in Proceedings of the ACM International Conference on Supercomputing (ICS), 2006
M. Schlansker, N. Chitlur, et al., High-performance ethernet-based communications for future multi-core processors, in Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC ‘07), 2007
H. Baier, et al., QPACE - a QCD parallel computer based on Cell processors, in Proceedings Science (LAT2009), 2009
C. Leber, B. Geib, H. Litz, High frequency trading acceleration using FPGAs, in 21st International Conference on Field Programmable Logic and Applications (FPL 2011), 2011
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Nüssle, M., Fröning, H., Kapferer, S., Brüning, U. (2013). Accelerate Communication, not Computation!. In: Vanderbauwhede, W., Benkrid, K. (eds) High-Performance Computing Using FPGAs. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1791-0_17
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1791-0_17
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1790-3
Online ISBN: 978-1-4614-1791-0
eBook Packages: EngineeringEngineering (R0)