Trace-Based Detection of Lock Contention in MPI One-Sided Communication

  • Marc-André HermannsEmail author
  • Markus Geimer
  • Bernd Mohr
  • Felix Wolf
Conference paper


Performance analysis is an essential part of the development process of HPC applications. Thus, developers need adequate tools to evaluate design and implementation decisions to effectively develop efficient parallel applications. Therefore, it is crucial that tools provide an as complete support as possible for the available language and library features to ensure that design decisions are not negatively influenced by the level of available tool support. The message passing interface (MPI) supports three basic communication paradigms: point-to-point, collective, and one-sided. Each of these targets and excels at a specific application scenario. While current performance tools support the first two quite well, one-sided communication is often neglected. In our earlier work, we were able to reduce this gap by showing how wait states in MPI one-sided communication using active-target synchronization can be detected at large scale using our trace-based message replay technique. Further extending our work on the detection of progress-related wait states in ARMCI, this paper presents an improved infrastructure that is capable of not only detecting progress-related wait states, but also wait states due to lock contention in MPI passive-target synchronization. We present an event-based definition of lock contention, the trace-based algorithm to detect it, as well as initial results with a micro-benchmark and an application kernel scaling up to 65,536 processes.


Message Passing Interface Target Process Origin Process Memory Window Event Trace 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work has been partly funded by the Excellence Initiative of the German federal and state governments. The authors gratefully acknowledge the computing time granted by the JARA-HPC Vergabegremium and VSR commission provided on the JARA-HPC Partition part of the supercomputer JUQUEEN [9] at Forschungszentrum Jülich.


  1. 1.
    Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M., Marin, G., Mellor-Crummey, J.M., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exper. 22 (6), 685–701 (2010). doi:10.1002/cpe.1553. Google Scholar
  2. 2.
    Böhme, D., Geimer, M., Wolf, F., Arnold, L.: Identifying the root causes of wait states in large-scale parallel applications. In: Proceedings of the 39th International Conference on Parallel Processing (ICPP), San Diego, CA, pp. 90–100 (2010). doi:10.1109/ICPP.2010.18Google Scholar
  3. 3.
    Böhme, D., de Supinski, B.R., Geimer, M., Schulz, M., Wolf, F.: Scalable critical-path based performance analysis. In: Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Shanghai (2012)Google Scholar
  4. 4.
    Chapman, B.M., Curtis, A., Pophale, S., Poole, S.W., Kuehn, J.A., Koelbel, C., Smith, L., Curtis, T., Pophale, S., Poole, S.W., Kuehn, J.A., Koelbel, C., Smith, L., Curtis, A., Pophale, S., Poole, S.W., Kuehn, J.A., Koelbel, C., Smith, L.: Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, no. c in PGAS ’10, pp. 2:1–2:3. ACM, New York, NY (2010). doi:10.1145/2020373.2020375.
  5. 5.
    Geimer, M., Wolf, F., Wylie, B.J.N., Mohr, B.: A scalable tool architecture for diagnosing wait states in massively parallel applications. Parallel Comput. 35 (7), 375–388 (2009). doi:10.1016/j.parco.2009.02.003CrossRefGoogle Scholar
  6. 6.
    Hermanns, M.A., Geimer, M., Mohr, B., Wolf, F.: Scalable detection of MPI-2 remote memory access inefficiency patterns. Int. J. High Perform. Comput. Appl. 26 (3), 227–236 (2012). doi:10.1177/1094342011406758CrossRefGoogle Scholar
  7. 7.
    Hermanns, M.A., Krishnamoorthy, S., Wolf, F.: A scalable infrastructure for the performance analysis of passive target synchronization. Parallel Comput. 39 (3), 132–145 (2013). doi:10.1016/j.parco.2012.09.002. CrossRefGoogle Scholar
  8. 8.
    Intel Corp.: Intel VTune Amplifier XE (2012). Google Scholar
  9. 9.
    Jülich Supercomputing Centre: JUQUEEN: IBM Blue Gene/Q Supercomputer System at the Jülich Supercomputing Centre. J. Large-Scale Res. Facil. 1 (A1) (2015). doi:10.17815/jlsrf-1-18.
  10. 10.
    Kühnal, A., Hermanns, M.A., Mohr, B., Wolf, F.: Specification of inefficiency patterns for MPI-2 one-sided communication. In: Proceedings of the 12th Euro-Par Conference, Dresden. Lecture Notes in Computer Science, vol. 4128, pp. 47–62. Springer, Berlin (2006)Google Scholar
  11. 11.
    MPI Forum (ed.): MPI: A Message-Passing Interface Standard. Version 3.1. MPI Forum (2015).
  12. 12.
    Nieplocha, J., Carpenter, B.: ARMCI: a portable remote memory copy library for distributed array libraries and compiler run-time systems. In: Proceedings of the 11 IPPS/SPDP’99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, vol. 1586, pp. 533–546. Springer, London (1999). doi:10.1007/BFb0097937.
  13. 13.
    Tallent, N.R., Mellor-Crummey, J.M., Porterfield, A.: Analyzing lock contention in multithreaded applications. SIGPLAN Not. 45 (5), 269–280 (2010). doi:10.1145/1837853.1693489. CrossRefGoogle Scholar
  14. 14.
    Tallent, N.R., Vishnu, A., Van Dam, H., Daily, J., Kerbyson, D.J., Hoisie, A.: Diagnosing the causes and severity of one-sided message contention. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 130–139. ACM, New York, NY (2015). doi:10.1145/2688500.2688516.
  15. 15.
    Zounmevo, J.A., Zhao, X., Balaji, P., Gropp, W., Afsahi, A.: Nonblocking epochs in MPI one-sided communication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 475–486. IEEE Press, Piscataway, NJ (2014). doi:10.1109/SC.2014.44.

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Marc-André Hermanns
    • 1
    Email author
  • Markus Geimer
    • 2
  • Bernd Mohr
    • 1
  • Felix Wolf
    • 3
  1. 1.JARA-HPC, Jülich Supercomputing CentreForschungszentrum Jülich GmbHJülichGermany
  2. 2.Jülich Supercomputing CentreForschungszentrum Jülich GmbHJülichGermany
  3. 3.Parallel ProgrammingTU DarmstadtDarmstadtGermany

Personalised recommendations