Measuring Multithreaded Message Matching Misery

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11014)


MPI usage patterns are changing as applications move towards fully-multithreaded runtimes. However, the impact of these patterns on MPI message matching is not well-studied. In particular, MPI’s mechanic for receiver-side data placement, message matching, can be impacted by increased message volume and nondeterminism incurred by multithreading. While there has been significant developer interest and work to provide an efficient MPI interface for multithreaded access, there has not been a study showing how these patterns affect messaging patterns and matching behavior. In this paper, we present a framework for studying the effects of multithreading on MPI message matching. This framework allows us to explore the implications of different common communication patterns and thread-level decompositions. We present a study of these impacts on the architecture of two of the Top 10 supercomputers (NERSC’s Cori and LANL’s Trinity). This data provides a baseline to evaluate reasonable matching engine queue lengths, search depths, and queue drain times under the multithreaded model. Furthermore, the study highlights surprising results on the challenge posed by message matching for multithreaded application performance.


  1. 1.
    Amer, A., Lu, H., Wei, Y., Balaji, P., Matsuoka, S.: MPI+ threads: runtime contention and remedies. ACM SIGPLAN Not. 50(8), 239–248 (2015)CrossRefGoogle Scholar
  2. 2.
    Balaji, P., Buntinas, D., Goodell, D., Gropp, W.D., Thakur, R.: Fine-grained multithreading support for hybrid threaded MPI programming. Int. J. High Perform. Comput. Appl. 24(1), 49–57 (2010)CrossRefGoogle Scholar
  3. 3.
    Barrett, B.W., Brightwell, R., Grant, R.E., Hammond, S.D., Hemmert, K.S.: An evaluation of MPI message rate on hybrid-core processors. Int. J. High Perform. Comput. Appl. 28(4), 415–424 (2014)CrossRefGoogle Scholar
  4. 4.
    Barrett, B.W., et al.: The Portals 4.0.2 networking programming interface (2014)Google Scholar
  5. 5.
    Barrett, R.F., Stark, D.T., Vaughan, C.T., Grant, R.E., Olivier, S.L., Pedretti, K.T.: Toward an evolutionary task parallel integrated MPI+X programming model. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 30–39. ACM (2015)Google Scholar
  6. 6.
    Bayatpour, M., Subramoni, H., Chakraborty, S., Panda, D.K.: Adaptive and dynamic design for MPI tag matching. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–10. IEEE (2016)Google Scholar
  7. 7.
    Bernholdt, D.E., et al.: A survey of MPI usage in the U.S. exascale computing project. Concurrency and Computation: Practice and Experience (2017, in Press)Google Scholar
  8. 8.
    Dang, H.-V., Snir, M., Gropp, W.: Towards millions of communicating threads. In: Proceedings of the 23rd European MPI Users’ Group Meeting, pp. 1–14. ACM (2016)Google Scholar
  9. 9.
    Derradji, S., Palfer-Sollier, T., Panziera, J.-P., Poudes, A., Atos, F.W.: The BXI interconnect architecture. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI), pp. 18–25. IEEE (2015)Google Scholar
  10. 10.
    Dosanjh, M.G., Groves, T., Grant, R.E., Brightwell, R., Bridges, P.G.: RMA-MT: a benchmark suite for assessing MPI multi-threaded RMA performance. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 550–559. IEEE (2016)Google Scholar
  11. 11.
    Ferreira, K.B., Levy, S., Pedretti, K., Grant, R.E.: Characterizing MPI matching via trace-based simulation. In: Proceedings of the 24th European MPI Users’ Group Meeting, p. 8. ACM (2017)Google Scholar
  12. 12.
    Flajslik, M., Dinan, J., Underwood, K.D.: Mitigating MPI message matching misery. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 281–299. Springer, Cham (2016). Scholar
  13. 13.
    Klenk, B., Froning, H., Eberle, H., Dennison, L.: Relaxations for high-performance message passing on massively parallel SIMT processors. In: 31st International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2017)Google Scholar
  14. 14.
    Lindahl, E., Hess, B., Páll, S., Metere, A.: GROMACS 5.0 benchmarks (2017)Google Scholar
  15. 15.
    MPI Forum: MPI: a message-passing interface standard version 3.0. Technical report, University of Tennessee, Knoxville (2012)Google Scholar
  16. 16.
    Plimpton, S., Crozier, P., Thompson, A.: LAMMPS-large-scale atomic/molecular massively parallel simulator, vol. 18. Sandia National Laboratories (2007)Google Scholar
  17. 17.
    Rodrigues, A., Murphy, R., Brightwell, R., Underwood, K.D.: Enhancing NIC performance for MPI using processing-in-memory. In: 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS), p. 8–pp. IEEE (2005)Google Scholar
  18. 18.
    Stark, D.T., Barrett, R.F., Grant, R.E., Olivier, S.L., Pedretti, K.T., Vaughan, C.T.: Early experiences co-scheduling work and communication tasks for hybrid MPI+X applications. In: Proceedings of the 2014 Workshop on Exascale MPI, pp. 9–19. IEEE Press (2014)Google Scholar
  19. 19.
    MPICH Development Team: MPICH (2017). Accessed 30 Mar 2017Google Scholar
  20. 20.
    Open MPI Development Team: Open MPI (2017). Accessed 28 Mar 2017Google Scholar
  21. 21.
    Underwood, K.D., Brightwell, R.: The impact of MPI queue usage on message latency. In: International Conference on Parallel Processing (ICPP), pp. 152–160. IEEE (2004)Google Scholar
  22. 22.
    Underwood, K.D., Hemmert, K.S., Rodrigues, A., Murphy, R., Brightwell, R.: A hardware acceleration unit for MPI queue processing. In: 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS), p. 10–pp. IEEE (2005)Google Scholar
  23. 23.
    Vaidyanathan, K., et al.: Improving concurrency and asynchrony in multithreaded MPI applications using software offloading. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 30. ACM (2015)Google Scholar
  24. 24.
    Zounmevo, J.A., Afsahi, A.: A fast and resource-conscious MPI message queue mechanism for large-scale jobs. Future Gener. Comput. Syst. 30, 265–290 (2014)CrossRefGoogle Scholar

Copyright information

© National Technology & Engineering Solutions of Sandia, LLC 2018

Authors and Affiliations

  1. 1.Sandia National LaboratoriesCenter for Computing ResearchAlbuquerqueUSA
  2. 2.Department of Computer ScienceUniversity of New MexicoAlbuquerqueUSA

Personalised recommendations