Advertisement

Receive-Side Notification for Enhanced RDMA in FPGA Based Networks

  • Joshua LantEmail author
  • Andrew Attwood
  • Javier Navaridas
  • Mikel Lujan
  • John Goodacre
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11479)

Abstract

FPGAs are rapidly gaining traction in the domain of HPC thanks to the advent of FPGA-friendly data-flow workloads, as well as their flexibility and energy efficiency. However, these devices pose a new challenge in terms of how to better support their communications, since standard protocols are known to hinder their performance greatly either by requiring CPU intervention or consuming excessive FPGA logic. Hence, the community is moving towards custom-made solutions. This paper analyses an optimization to our custom, reliable, interconnect with connectionless transport—a mechanism to register and track inbound RDMA communication at the receive-side. This way, it provides completion notifications directly to the remote node which saves a round-trip latency. The entire mechanism is designed to sit within the fabric of the FPGA, requiring no software intervention. Our solution is able to reduce the latency of a receive operation by around 20\(\%\) for small message sizes (4 KB) over a single hop (longer distances would experience even higher improvement). Results from synthesis over a wide parameter range confirm this optimization is scalable both in terms of the number of concurrent outstanding RDMA operations, and the maximum message size.

Keywords

FPGA Transport layer Micro-architecture Reliability 

References

  1. 1.
    Caulfield, A.M., et al.: A cloud-scale acceleration architecture. In: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 7. IEEE Press (2016)Google Scholar
  2. 2.
    Concatto, C., et al.: A CAM-free exascalable HPC router for low-energy communications. In: Berekovic, M., Buchty, R., Hamann, H., Koch, D., Pionteck, T. (eds.) ARCS 2018. LNCS, vol. 10793, pp. 99–111. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-77610-1_8CrossRefGoogle Scholar
  3. 3.
    Dally, W.J., Aoki, H.: Deadlock-free adaptive routing in multicomputer networks using virtual channels. IEEE Trans. Parallel Distrib. Syst. 4(4), 466–475 (1993)CrossRefGoogle Scholar
  4. 4.
    El-Ghazawi, T., El-Araby, E., Huang, M., Gaj, K., Kindratenko, V., Buell, D.: The promise of high-performance reconfigurable computing. Computer 41(2), 69–76 (2008)CrossRefGoogle Scholar
  5. 5.
    Grant, R.E., Rashti, M.J., Balaji, P., Afsahi, A.: Scalable connectionless RDMA over unreliable datagrams. Parallel Comput. 48, 15–39 (2015)CrossRefGoogle Scholar
  6. 6.
    Katevenis, M., et al.: Next generation of exascale-class systems: exanest project and the status of its interconnect and storage development. Microprocess. Microsyst. 61, 58–71 (2018)CrossRefGoogle Scholar
  7. 7.
    Katevenis, M., et al.: The exanest project: interconnects, storage, and packaging for exascale systems. In: 2016 Euromicro Conference on Digital System Design (DSD), pp. 60–67. IEEE (2016)Google Scholar
  8. 8.
    Koop, M.J., Sur, S., Gao, Q., Panda, D.K.: High performance MPI design using unreliable datagram for ultra-scale infiniband clusters. In: Proceedings of the 21st Annual International Conference on Supercomputing, pp. 180–189. ACM (2007)Google Scholar
  9. 9.
    Lant, J., et al.: Enabling shared memory communication in networks of mpsocs. Concurr. Comput. Pract. Exp. (CCPE), e4774 (2018)Google Scholar
  10. 10.
    Mogul, J.C.: TCP offload is a dumb idea whose time has come. In: HotOS, pp. 25–30 (2003)Google Scholar
  11. 11.
    Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K., Chung, E.S.: Accelerating deep convolutional neural networks using specialized hardware. Microsoft Res. Whitepaper 2(11), 1–4 (2015)Google Scholar
  12. 12.
    PLDA: An implementation of the TCP/IP protocol suite for the Linux operating system (2018). https://github.com/torvalds/linux/blob/master/net/ipv4/tcp.c
  13. 13.
    Intilop Corporation: 10 g bit TCP offload engine + PCIe/DMA soc IP (2012)Google Scholar
  14. 14.
    Ohio Supercomputing Centre: Software implementation and testing of iWarp protocol (2018). https://www.osc.edu/research/network_file/projects/iwarp
  15. 15.
    Sidler, D., Alonso, G., Blott, M., Karras, K., Vissers, K., Carley, R.: Scalable 10Gbps TCP/IP stack architecture for reconfigurable hardware. In: 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 36–43. IEEE (2015)Google Scholar
  16. 16.
    Underwood, K.D., Hemmert, K.S., Ulmer, C.D.: From silicon to science: the long road to production reconfigurable supercomputing. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 2(4), 26 (2009)Google Scholar
  17. 17.
    Xilinx Inc.: Zynq UltraScale + MPSoC Data Sheet: Overview (2018). v1.7Google Scholar
  18. 18.
    Xirouchakis, P., et al.: The network interface of the exanest hpc prototype. Technical report, ICS-FORTH / TR 471, Heraklion, Crete, Greece (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Joshua Lant
    • 1
    Email author
  • Andrew Attwood
    • 1
  • Javier Navaridas
    • 1
  • Mikel Lujan
    • 1
  • John Goodacre
    • 1
  1. 1.University of ManchesterManchesterUK

Personalised recommendations