Skip to main content

Trace-Based Detection of Lock Contention in MPI One-Sided Communication

  • Conference paper
  • First Online:
Tools for High Performance Computing 2016

Abstract

Performance analysis is an essential part of the development process of HPC applications. Thus, developers need adequate tools to evaluate design and implementation decisions to effectively develop efficient parallel applications. Therefore, it is crucial that tools provide an as complete support as possible for the available language and library features to ensure that design decisions are not negatively influenced by the level of available tool support. The message passing interface (MPI) supports three basic communication paradigms: point-to-point, collective, and one-sided. Each of these targets and excels at a specific application scenario. While current performance tools support the first two quite well, one-sided communication is often neglected. In our earlier work, we were able to reduce this gap by showing how wait states in MPI one-sided communication using active-target synchronization can be detected at large scale using our trace-based message replay technique. Further extending our work on the detection of progress-related wait states in ARMCI, this paper presents an improved infrastructure that is capable of not only detecting progress-related wait states, but also wait states due to lock contention in MPI passive-target synchronization. We present an event-based definition of lock contention, the trace-based algorithm to detect it, as well as initial results with a micro-benchmark and an application kernel scaling up to 65,536 processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M., Marin, G., Mellor-Crummey, J.M., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exper. 22 (6), 685–701 (2010). doi:10.1002/cpe.1553. http://doi.wiley.com/10.1002/cpe.1553

    Google Scholar 

  2. Böhme, D., Geimer, M., Wolf, F., Arnold, L.: Identifying the root causes of wait states in large-scale parallel applications. In: Proceedings of the 39th International Conference on Parallel Processing (ICPP), San Diego, CA, pp. 90–100 (2010). doi:10.1109/ICPP.2010.18

    Google Scholar 

  3. Böhme, D., de Supinski, B.R., Geimer, M., Schulz, M., Wolf, F.: Scalable critical-path based performance analysis. In: Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Shanghai (2012)

    Google Scholar 

  4. Chapman, B.M., Curtis, A., Pophale, S., Poole, S.W., Kuehn, J.A., Koelbel, C., Smith, L., Curtis, T., Pophale, S., Poole, S.W., Kuehn, J.A., Koelbel, C., Smith, L., Curtis, A., Pophale, S., Poole, S.W., Kuehn, J.A., Koelbel, C., Smith, L.: Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, no. c in PGAS ’10, pp. 2:1–2:3. ACM, New York, NY (2010). doi:10.1145/2020373.2020375. http://doi.acm.org/10.1145/2020373.2020375

  5. Geimer, M., Wolf, F., Wylie, B.J.N., Mohr, B.: A scalable tool architecture for diagnosing wait states in massively parallel applications. Parallel Comput. 35 (7), 375–388 (2009). doi:10.1016/j.parco.2009.02.003

    Article  Google Scholar 

  6. Hermanns, M.A., Geimer, M., Mohr, B., Wolf, F.: Scalable detection of MPI-2 remote memory access inefficiency patterns. Int. J. High Perform. Comput. Appl. 26 (3), 227–236 (2012). doi:10.1177/1094342011406758

    Article  Google Scholar 

  7. Hermanns, M.A., Krishnamoorthy, S., Wolf, F.: A scalable infrastructure for the performance analysis of passive target synchronization. Parallel Comput. 39 (3), 132–145 (2013). doi:10.1016/j.parco.2012.09.002. http://www.sciencedirect.com/science/article/pii/S0167819112000762

    Article  Google Scholar 

  8. Intel Corp.: Intel VTune Amplifier XE (2012). http://software.intel.com/en-us/intel-vtune-amplifier-xe

    Google Scholar 

  9. Jülich Supercomputing Centre: JUQUEEN: IBM Blue Gene/Q Supercomputer System at the Jülich Supercomputing Centre. J. Large-Scale Res. Facil. 1 (A1) (2015). doi:10.17815/jlsrf-1-18. http://dx.doi.org/10.17815/jlsrf-1-18

  10. Kühnal, A., Hermanns, M.A., Mohr, B., Wolf, F.: Specification of inefficiency patterns for MPI-2 one-sided communication. In: Proceedings of the 12th Euro-Par Conference, Dresden. Lecture Notes in Computer Science, vol. 4128, pp. 47–62. Springer, Berlin (2006)

    Google Scholar 

  11. MPI Forum (ed.): MPI: A Message-Passing Interface Standard. Version 3.1. MPI Forum (2015). http://www.mpi-forum.org/

  12. Nieplocha, J., Carpenter, B.: ARMCI: a portable remote memory copy library for distributed array libraries and compiler run-time systems. In: Proceedings of the 11 IPPS/SPDP’99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, vol. 1586, pp. 533–546. Springer, London (1999). doi:10.1007/BFb0097937. http://dl.acm.org/citation.cfm?id=645611.662053

  13. Tallent, N.R., Mellor-Crummey, J.M., Porterfield, A.: Analyzing lock contention in multithreaded applications. SIGPLAN Not. 45 (5), 269–280 (2010). doi:10.1145/1837853.1693489. http://doi.acm.org/10.1145/1837853.1693489

    Article  Google Scholar 

  14. Tallent, N.R., Vishnu, A., Van Dam, H., Daily, J., Kerbyson, D.J., Hoisie, A.: Diagnosing the causes and severity of one-sided message contention. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 130–139. ACM, New York, NY (2015). doi:10.1145/2688500.2688516. http://doi.acm.org/10.1145/2688500.2688516

  15. Zounmevo, J.A., Zhao, X., Balaji, P., Gropp, W., Afsahi, A.: Nonblocking epochs in MPI one-sided communication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 475–486. IEEE Press, Piscataway, NJ (2014). doi:10.1109/SC.2014.44. http://dx.doi.org/10.1109/SC.2014.44

Download references

Acknowledgements

This work has been partly funded by the Excellence Initiative of the German federal and state governments. The authors gratefully acknowledge the computing time granted by the JARA-HPC Vergabegremium and VSR commission provided on the JARA-HPC Partition part of the supercomputer JUQUEEN [9] at Forschungszentrum Jülich.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc-André Hermanns .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hermanns, MA., Geimer, M., Mohr, B., Wolf, F. (2017). Trace-Based Detection of Lock Contention in MPI One-Sided Communication. In: Niethammer, C., Gracia, J., Hilbrich, T., Knüpfer, A., Resch, M., Nagel, W. (eds) Tools for High Performance Computing 2016. Springer, Cham. https://doi.org/10.1007/978-3-319-56702-0_6

Download citation

Publish with us

Policies and ethics