Finepoints: Partitioned Multithreaded MPI Communication
The MPI multithreading model has been historically difficult to optimize; the interface that it provides for threads was designed as a process-level interface. This model has led to implementations that treat function calls as critical regions and protect them with locks to avoid race conditions. We hypothesize that an interface designed specifically for threads can provide superior performance than current approaches and even outperform single-threaded MPI.
In this paper, we describe a design for partitioned communication in MPI that we call finepoints. First, we assess the existing communication models for MPI two-sided communication and then introduce finepoints as a hybrid of MPI models that has the best features of each existing MPI communication model. In addition, “partitioned communication” created with finepoints leverages new network hardware features that cannot be exploited with current MPI point-to-point semantics, making this new approach both innovative and useful both now and in the future.
To demonstrate the validity of our hypothesis, we implement a finepoints library and show improvements against a state-of-the-art multithreaded optimized Open MPI implementation on a Cray XC40 with an Aries network. Our experiments demonstrate up to a 12\(\times \) reduction in wait time for completion of send operations. This new model is shown working on a nuclear reactor physics neutron-transport proxy-application, providing up to 26.1% improvement in communication time and up to 4.8% improvement in runtime over the best performing MPI communication mode, single-threaded MPI.
- 1.Barrett, B.W., Brightwell, R., et al.: The Portals 4.1 networking programming interface. Technical report SAND2017-3825, Sandia National Laboratories (SNL-NM), Albuquerque, NM, United States (2017)Google Scholar
- 2.Bernholdt, D.E., Boehm, S., et al.: A survey of MPI usage in the U.S. Exascale Computing Project. Concurr. Comput. Pract. Exp. (2018)Google Scholar
- 3.Derradji, S. Palfer-Sollier, T., et al.: The BXI interconnect architecture. In: Proceedings of the 23rd Annual Symposium on High Performance Interconnects, HOTI 2015. IEEE (2015)Google Scholar
- 4.Dimitrov, R., Skjellum, A.: Software architecture and performance comparison of MPI/Pro and MPICH. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J.J., Zomaya, A.Y. (eds.) ICCS 2003, Part III. LNCS, vol. 2659, pp. 307–315. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44863-2_31CrossRefGoogle Scholar
- 6.Doerfler, D.W., Rajan, M., et al.: A comparison of the performance characteristics of capability and capacity class HPC systems. Technical report, Sandia National Lab. (SNL-NM), Albuquerque, NM, United States (2011)Google Scholar
- 7.Dosanjh, M.G.F., Grant, R.E., et al.: Re-evaluating network onload vs. offload for the many-core era. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 342–350. IEEE (2015)Google Scholar
- 8.Dosanjh, M.G.F., Groves, T., et al.: RMA-MT: a benchmark suite for assessing MPI multi-threaded RMA performance. In: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 550–559. IEEE (2016)Google Scholar
- 9.Grant, R.E., Rashti, M.J., et al.: RDMA capable iWARP over datagrams. In: IEEE International Parallel & Distributed Processing Symposium (IPDPS), pp. 628–639. IEEE (2011)Google Scholar
- 10.Gunow, G., Tramm, J.R., et al.: SimpleMOC - a performance abstraction for 3D MOC. In: ANS MC2015. American Nuclear Society, American Nuclear Society (2015)Google Scholar
- 11.Heroux, M.A., Doerfler, D.W., et al.: Improving performance via mini-applications. Sandia National Laboratories, Technical report SAND2009-5574, vol. 3 (2009)Google Scholar
- 12.Hjelm, N., Dosanjh, M.G.F., et al.: Improving MPI multi-threaded RMA communication performance. In: Proceedings of the International Conference on Parallel Processing, pp. 1–10 (2018)Google Scholar
- 15.MPI Forum. MPI: A message-passing interface standard version 3.1. Technical report, University of Tennessee, Knoxville (2015)Google Scholar
- 16.Petrini, F., Kerbyson, D.J., et al.: The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p. 55 (2003)Google Scholar
- 17.Rashti, M.J., Grant, R.E., et al.: iWARP redefined: scalable connectionless communication over high-speed Ethernet. In: International Conference on High Performance Computing (HiPC), pp. 1–10. IEEE (2010)Google Scholar
- 18.Schneider, T., Hoefler, T., et al.: Protocols for fully offloaded collective operations on accelerated network adapters. In: 42nd International Conference on Parallel Processing (ICPP 2013), Lyon, France, October 2013Google Scholar
- 19.Weeks, H., Dosanjh, M.G.F., Bridges, P.G., Grant, R.E.: SHMEM-MT: a benchmark suite for assessing multi-threaded SHMEM performance. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 227–231. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_16CrossRefGoogle Scholar