Receive-Side Notification for Enhanced RDMA in FPGA Based Networks
- 657 Downloads
Abstract
FPGAs are rapidly gaining traction in the domain of HPC thanks to the advent of FPGA-friendly data-flow workloads, as well as their flexibility and energy efficiency. However, these devices pose a new challenge in terms of how to better support their communications, since standard protocols are known to hinder their performance greatly either by requiring CPU intervention or consuming excessive FPGA logic. Hence, the community is moving towards custom-made solutions. This paper analyses an optimization to our custom, reliable, interconnect with connectionless transport—a mechanism to register and track inbound RDMA communication at the receive-side. This way, it provides completion notifications directly to the remote node which saves a round-trip latency. The entire mechanism is designed to sit within the fabric of the FPGA, requiring no software intervention. Our solution is able to reduce the latency of a receive operation by around 20\(\%\) for small message sizes (4 KB) over a single hop (longer distances would experience even higher improvement). Results from synthesis over a wide parameter range confirm this optimization is scalable both in terms of the number of concurrent outstanding RDMA operations, and the maximum message size.
Keywords
FPGA Transport layer Micro-architecture ReliabilityReferences
- 1.Caulfield, A.M., et al.: A cloud-scale acceleration architecture. In: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 7. IEEE Press (2016)Google Scholar
- 2.Concatto, C., et al.: A CAM-free exascalable HPC router for low-energy communications. In: Berekovic, M., Buchty, R., Hamann, H., Koch, D., Pionteck, T. (eds.) ARCS 2018. LNCS, vol. 10793, pp. 99–111. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77610-1_8CrossRefGoogle Scholar
- 3.Dally, W.J., Aoki, H.: Deadlock-free adaptive routing in multicomputer networks using virtual channels. IEEE Trans. Parallel Distrib. Syst. 4(4), 466–475 (1993)CrossRefGoogle Scholar
- 4.El-Ghazawi, T., El-Araby, E., Huang, M., Gaj, K., Kindratenko, V., Buell, D.: The promise of high-performance reconfigurable computing. Computer 41(2), 69–76 (2008)CrossRefGoogle Scholar
- 5.Grant, R.E., Rashti, M.J., Balaji, P., Afsahi, A.: Scalable connectionless RDMA over unreliable datagrams. Parallel Comput. 48, 15–39 (2015)CrossRefGoogle Scholar
- 6.Katevenis, M., et al.: Next generation of exascale-class systems: exanest project and the status of its interconnect and storage development. Microprocess. Microsyst. 61, 58–71 (2018)CrossRefGoogle Scholar
- 7.Katevenis, M., et al.: The exanest project: interconnects, storage, and packaging for exascale systems. In: 2016 Euromicro Conference on Digital System Design (DSD), pp. 60–67. IEEE (2016)Google Scholar
- 8.Koop, M.J., Sur, S., Gao, Q., Panda, D.K.: High performance MPI design using unreliable datagram for ultra-scale infiniband clusters. In: Proceedings of the 21st Annual International Conference on Supercomputing, pp. 180–189. ACM (2007)Google Scholar
- 9.Lant, J., et al.: Enabling shared memory communication in networks of mpsocs. Concurr. Comput. Pract. Exp. (CCPE), e4774 (2018)Google Scholar
- 10.Mogul, J.C.: TCP offload is a dumb idea whose time has come. In: HotOS, pp. 25–30 (2003)Google Scholar
- 11.Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K., Chung, E.S.: Accelerating deep convolutional neural networks using specialized hardware. Microsoft Res. Whitepaper 2(11), 1–4 (2015)Google Scholar
- 12.PLDA: An implementation of the TCP/IP protocol suite for the Linux operating system (2018). https://github.com/torvalds/linux/blob/master/net/ipv4/tcp.c
- 13.Intilop Corporation: 10 g bit TCP offload engine + PCIe/DMA soc IP (2012)Google Scholar
- 14.Ohio Supercomputing Centre: Software implementation and testing of iWarp protocol (2018). https://www.osc.edu/research/network_file/projects/iwarp
- 15.Sidler, D., Alonso, G., Blott, M., Karras, K., Vissers, K., Carley, R.: Scalable 10Gbps TCP/IP stack architecture for reconfigurable hardware. In: 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 36–43. IEEE (2015)Google Scholar
- 16.Underwood, K.D., Hemmert, K.S., Ulmer, C.D.: From silicon to science: the long road to production reconfigurable supercomputing. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 2(4), 26 (2009)Google Scholar
- 17.Xilinx Inc.: Zynq UltraScale + MPSoC Data Sheet: Overview (2018). v1.7Google Scholar
- 18.Xirouchakis, P., et al.: The network interface of the exanest hpc prototype. Technical report, ICS-FORTH / TR 471, Heraklion, Crete, Greece (2018)Google Scholar