Scalable NIC Architecture to Support Offloading of Large Scale MPI Barrier
MPI collective communication overhead dominates the communication cost for large scale parallel computers, scalability and operation latency for collective communication is critical for next generation computers. This paper proposes a fast and scalable barrier communication offload approach which supports millions of compute cores. Following our approach, the barrier operation sequence is packed by host MPI driver into the barrier ”descriptor”, which is pushed to the NIC (Network-Interfaces). The NIC can complete the barrier automatically following its algorithm descriptor. Our approach leverages an enhanced dissemination algorithm which is suitable for current large scale networks. We show that our approach achieves both barrier performance and scalability, especially for large scale computer system. This paper also proposes an extendable and easy-to-implement NIC architecture supporting barrier offload communication and also other communication pattern.
Keywordsbarrier offload dissemination algorithm MPI collective communication
Unable to display preview. Download preview PDF.
- 2.Miyazaki, H., Kusano, Y., Shinjou, N., Shoji, F., Yokokawa, M., Watanabe, T.: Overview of the K computer System. FUJITSU Sci. Tech. J. 48(3), 255–265 (2012)Google Scholar
- 3.Venkata, M.G., Graham, R.L., Ladd, J., Shamis, P.: Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct. In: 2012 41st International Conference on Parallel Processing (ICPP), pp. 289–298. IEEE (2012)Google Scholar
- 4.Xie, M., Lu, Y., Liu, L., Cao, H., Yang, X.: Implementation and Evaluation of Network Interface and Message Passing Services for TianHe-1A Supercomputer. In: 2011 IEEE 19th Annual Symposium on High-Performance Interconnects (HOTI), pp. 78–86. IEEE (2011)Google Scholar
- 6.Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Fast barrier synchronization for InfiniBandTM. In: IPDPS 2006: Proceedings of the 20th International Conference on Parallel and Distributed Processing. IEEE Computer Society (April 2006)Google Scholar
- 7.Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. International Journal of Parallel Programming 17(1) (February 1988)Google Scholar
- 8.Mamidala, A.R.: Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband Clusters. Phd Thesis (2008)Google Scholar
- 9.Sonja, F.: Hardware Support for Efficient Packet Processing. Phd Thesis, pp. 1–207 (March 2012)Google Scholar
- 11.Tipparaju, V., Gropp, W., Ritzdorf, H., Thakur, R., Traff, J.L.: Investigating High Performance RMA Interfaces for the MPI-3 Standard. In: 2009 International Conference on Parallel Processing (ICPP), pp. 293–300. IEEE (2009)Google Scholar