ProOnE: a general-purpose protocol onload engine for multi- and many-core architectures

Special Issue Paper


Modern high-end computing systems utilize specialized offload engines to enhance various aspects of their processing. For example, high-speed networks such as InfiniBand, Quadrics and Myrinet utilize specialized hardware to offload network processing to help improve performance. However, such hardware units are expensive, and their manufacturing complexity increases exponentially depending on the number and complexity of tasks they offload. On the other hand, the proliferation of multi- and many-core processors into the general desktop and laptop markets is increasingly driving their cost down due to the economies of scale. To take advantage of the obvious benefits of multi/many-core architectures, we propose, design and evaluate ProOnE, a general purpose Protocol Onload Engine. ProOnE utilizes a small subset of the available cores on a multi-core CPU to ‘‘onload’’ various tasks in a dedicated manner instead of ‘‘offloading’’ them to specialized hardware. The general purpose processing capabilities of multi-core architectures allow ProOnE to be designed in a flexible, extensible and scalable manner, while benefiting from the reducing costs of general-purpose CPUs. In this paper, we onload onto ProOnE, several tasks relevant to communication sub-systems such as MPI that are too complex for current hardware offload engines to support, and demonstrate significant benefits in terms of overlap of computation and communication and improved application performance.


Protocol offload/onload Many-core Multi-core 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1. Scholar
  2. 2. Scholar
  3. 3. Scholar
  4. 4. Scholar
  5. 5. Scholar
  6. 6.
    Chelsio TOE. Scholar
  7. 7.
    Giganet clan. Scholar
  8. 8.
    InfiniBand Trade Association. http://www.infinibandta.comGoogle Scholar
  9. 9.
    Jacobi Method. Scholar
  10. 10.
    MPICH2. Scholar
  11. 11.
    OpenMP. Scholar
  12. 12.
    Top 500 SuperComputer Sites. Scholar
  13. 13.
    Amerson G, Apon A (2004) Implementation and design analysis of a network messaging module using virtual interface architecture. In: International Conference on Cluster ComputingGoogle Scholar
  14. 14.
    Regnier G, Minturn D, McAlpine G, Saletore V, Foong A (2003) ETA: experience with an Intel Xeon processor as a packet processing engin. In: Proceedings of the 11th Symposium on High Performance Interconnects (HOTI’03)Google Scholar
  15. 15.
    Brightwell R, Underwood KD (2004) An analysis of the impact of MPI overlap and independent progress. In: Proceedings of the 18th annual international conference on Supercomputing, March 2004Google Scholar
  16. 16.
    Chai L, Hartono A, Panda DK (2006) Designing high performance and scalable MPI intra-node communication support for clusters. In: The IEEE International Conference on Cluster ComputingGoogle Scholar
  17. 17.
    MPI Forum (1993) MPI: A Message Passing InterfaceGoogle Scholar
  18. 18.
    Gropp W, Lusk E, Doss N, Skjellum A. A high-performance, portable implementation of the MPI. Technical report, Argonne National Laboratory and Mississippi State UniversityGoogle Scholar
  19. 19.
    Jin H-W, Sur S, Chai L, Panda DK (2007) Lightweight Kernel-Level Primitives for High-performance MPI Intra-Node Communication over Multi-Core Systems. In: IEEE International Conference on Cluster Computing (poster presentation)Google Scholar
  20. 20.
    Kumar R, Mamidala AR, Koop MJ, Santhanaraman G, Panda DK (2008) Lock-free asynchronous rendezvous design for MPI Point-to-point communication. In: EuroPVM ’08Google Scholar
  21. 21.
    Majumder S, Rixner S, Pai VS (2004) An event-driven architecture for mpi libraries. In: Computer Science Institute SymposiumGoogle Scholar
  22. 22.
    Ortiz A, Ortega J, Daz AF, Prieto A (2008) Comparison of onloading and offloading strategies to improve network interfaces. In PDP. IEEE Computer Society, 2008.Google Scholar
  23. 23.
    Regnier G, Makineni S, Illikkal R, Minturn D, Huggahalli R, Newell D, Cline L, Foong A. TCP onloading for data center servers. IEEE Comput 37(11):48–58Google Scholar
  24. 24.
    Sancho JC, Barker KJ, Kerbyson DJ, Davis K (2006) Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications. In: ACM/IEEESC 2006 Conference (SC’06)Google Scholar
  25. 25.
    Sancho JC, Kerbyson DJ, Barker KJ (2007) Efficient offloading of collective communications in large-scale systems. In: IEEE International Conference on Cluster ComputingGoogle Scholar
  26. 26.
    Sandia National Laboratories. Sandia MPI Micro-Benchmark Suite. Scholar
  27. 27.
    Shivam P, Chase JS (2003) On the elusive benefits of protocol offload. In: SIGCOMM’03 Workshop on NICELIGoogle Scholar
  28. 28.
    Sur S, Jin H-W, Chai L, Panda DK (2006) RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Symposium on PPOPP, March 2006Google Scholar
  29. 29.
    Trahay F, Brunet E, Denis A, Namyst R (2008) A multithreaded communication engine for multicore architectures. In: International Parallel and Distributed Processing (IPDPS)Google Scholar
  30. 30.
    Vaidyanathan K, Lai P, Narravula S, Panda DK (2008) Optimized distributed data sharing substrate in multi-core commodity clusters: A comprehensive study with applications. In: International Symposium on Cluster Computing and the Grid (CCGrid), May 2008Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringOhio State UniversityColumbusUSA
  2. 2.Mathematics and Computer Science DivisionArgonne National LaboratoryArgonneUSA

Personalised recommendations