Skip to main content

Fast Communication Trace Collection

  • Chapter
  • First Online:
Performance Analysis of Parallel Applications for HPC
  • 192 Accesses

Abstract

Communication patterns of parallel applications are important for optimizing application performance and designing better communication subsystems. Communication patterns can be extracted from communication traces. However, existing approaches are time-consuming and expensive because they generate communication traces by executing the entire parallel applications on full-scale systems. We propose a novel technique, namely, Fact, which can perform Fast Communication Traces collection for large-scale parallel applications on small-scale systems. Our idea is to reduce the original program and obtain a program slice through static analysis and to execute the program slice to acquire the communication traces, which is based on an observation that most computation and message contents in parallel applications are not relevant to their spatial and volume communication attributes and therefore can be removed for the purpose of communication trace collection. We have implemented Fact and evaluated it with NPB programs and Sweep3D. The results show that Fact can reduce resource consumption by two orders of magnitude in most cases. For example, Fact collects communication traces of 512-process Sweep3D on a 4-node (32 cores) platform in just 6.79 s, consuming 1.25 GB memory, while the original program takes 256.63 s and consumes 213.83 GB memory on a 32-node (512 cores) platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The communication volume is specified by the number of messages and the message size. The spatial attribute is characterized by the distribution of message source and destination. The temporal behavior is captured by the message generation rate.

  2. 2.

    The marked MPI routines will be executed at runtime, and the unmarked will not be executed at runtime. Specific definitions of them will be given in Sect. 2.4.

References

  1. Chodnekar, S., et al. (1997). Towards a communication characterization methodology for parallel applications. In Proceedings Third International Symposium on High-Performance Computer Architecture.

    Google Scholar 

  2. Kim, J., & Lilja, D. J. (1998). Characterization of communication patterns in message-passing parallel scientific application programs. In: International Workshop on Communication, Architecture, and Applications for Network-Based Parallel Computing (pp. 202–216).

    Google Scholar 

  3. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.

    Google Scholar 

  4. Chen, H., et al. (2006). MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In Proceedings of the 20th Annual International Conference on supercomputing (pp. 353–360) ACM.

    Google Scholar 

  5. Preissl, R., et al. (2008). Using MPI communication patterns to guide source code transformations. In: Computational Science–ICCS 2008: 8th International Conference (pp. 253–260).

    Google Scholar 

  6. Ding, Z., et al. (2005). Switch design to enable predictive multiplexed switching. In: 19th IEEE International Parallel and Distributed Processing Symposium (p. 100.1).

    Google Scholar 

  7. Xue, R., et al. (2009). MPIWiz: Subgroup reproducible replay of MPI applications. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 251–260).

    Google Scholar 

  8. Preissl, R., et al. (2008). Detecting patterns in MPI communication traces”. In: ICPP. 2008, pp. 230–237.

    Google Scholar 

  9. Vetter, J. S., & Mueller, F. (2002). Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In Proceedings 16th International Parallel and Distributed Processing Symposium (pp. 853–865).

    Google Scholar 

  10. Intel Ltd. Intel Trace Analyzer & Collector. http://www.intel.com/cd/software/products/asmo-na/eng/244171.htm

  11. Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources.

    Google Scholar 

  12. Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311.

    Article  Google Scholar 

  13. Mohr, B., & Wolf, F. (2003). KOJAK-A tool set for automatic performance analysis of parallel programs. In Euro-Par 2003 Parallel Processing: 9th International Euro-Par Conference.

    Google Scholar 

  14. Labarta, J., et al. (1996). DiP: A parallel program development environment. In: European Conference on Parallel Processing (pp. 665–674). Springer.

    Google Scholar 

  15. Kerbyson, D. J., et al. (2001). Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (pp. 37–48).

    Google Scholar 

  16. Zhang, J., et al. (2009). Process mapping for MPI collective communications. In Euro-Par 2009 Parallel Processing: 15th International Euro-Par Conference.

    Google Scholar 

  17. Zhai, J., et al. (2009). FACT: Fast communication trace collection for parallel applications through program slicing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. https://doi.org/10.1145/1654059.1654087

  18. Zamani, R., & Afsahi, A. (2005). Communication characteristics of message- passing scientific and engineering applications. In: International Conference on Parallel and Distributed Computing Systems.

    Google Scholar 

  19. Faraj, A., & Yuan, X. (2002). Communication characteristics in the NAS parallel benchmarks. In International Conference on Parallel and Distributed Computing Systems.

    Google Scholar 

  20. Vetter, J. S., & McCracken, M. O. (2001). Statistical scalability analysis of communication operations in distributed applications. In: Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (pp. 123–132).

    Google Scholar 

  21. Noeth, M., et al. (2009). ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. Journal of Parallel and Distributed Computing, 69(8), 696–710.

    Article  Google Scholar 

  22. Shao, S., Jones, A. K., & Melhem, R. G. (2006). A compiler-based communication analysis approach for multiprocessor systems. In: Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

    Google Scholar 

  23. Ho, S., & Lin, N. (2002). Static analysis of communication structures in parallel programs. In International Computer Symposium.

    Google Scholar 

  24. Weiser, M. (1984). Program slicing. IEEE Transactions on Software Engineering, 10(4), 352–357.

    Article  MATH  Google Scholar 

  25. Gallagher, K. B., & Lyle, J. R. (1991). Using program slicing in software maintenance. IEEE Transactions on Software Engineering, 17(8), 751–761.

    Article  Google Scholar 

  26. Binkley, D. (1998). The application of program slicing to regression testing. Information and Software Technology, 40(11–12), 583–594.

    Article  Google Scholar 

  27. Harman, M., & Danicic, S. (1995). Using program slicing to simplify testing. Journal of Software Testing, Verification and Reliability, 5, 143–162.

    Article  Google Scholar 

  28. Chen, Y., et al. (2008). Hiding I/O latency with pre-execution prefetching for parallel applications. In: SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (pp. 1–10).

    Google Scholar 

  29. Horwitz, S., Reps, T., & Binkley, D. (1990). Interprocedural slicing using dependence graphs. ACM Transactions on Programming Languages and Systems, 12(1), 26–60.

    Article  Google Scholar 

  30. Appel, A. W. (1997). Modern compiler implementation in C: Basic techniques. New York, USA: Cambridge University Press.

    Book  MATH  Google Scholar 

  31. Aho, A. V., et al. (2006). Compilers: Principles, techniques, and tools (2nd Edn.). USA: Addison-Wesley Longman Publishing. ISBN: 978-0-321-48681-3.

    MATH  Google Scholar 

  32. Ferrante, J., Ottenstein, K. J., & Warren, J. D. (1987). The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3), 319–349.

    Article  MATH  Google Scholar 

  33. Banning, J. (1979). An efficient way to find side effects of procedure calls and aliases of variables. In: Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (pp. 29–41).

    Google Scholar 

  34. Muchnick, S. S. (1997). Advanced compiler design and implementation. San Francisco, CA, USA: Morgan Kaufmann Publishers.

    Google Scholar 

  35. Strout, M. M., Kreaseck, B., & Hovland, P. (2006). Data-flow analysis for MPI programs. In 2006 International Conference on Parallel Processing (ICPP’06) (pp. 175–184).

    Google Scholar 

  36. Bronevetsky, G. (2009). Communication-sensitive static dataflow for parallel message passing applications. In 2009 International Symposium on Code Generation and Optimization.

    Google Scholar 

  37. SGI. Open64 compiler and tools. http://www.open64.net

  38. LLNL. ASCI Purple Benchmark. https://asc.llnl.gov/computing_resources/purple/archive/benchmarks

  39. Argonne National Laboratory. MPICH2. http://www.mcs.anl.gov/research/projects/mpich2

  40. Ohio State University. MVAPICH: MPI over InfiniBand and iWARP. http://mvapich.cse.ohio-state.edu

  41. Träff, J. L. (2002). Implementing the MPI process topology mechanism. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing.

    Google Scholar 

  42. Hockney, R. W. (1994). The communication challenge for MPP: Intel paragon and meiko CS-2. Parallel Computing, 20(3) , 389–398.

    Article  Google Scholar 

  43. Argonne National Laboratory. MPICH1. http://www-unix.mcs.anl.gov/mpi/mpich1

  44. Duato, J., Yalamanchili, S., & Ni, L. (2003). Interconnection networks: An engineering approach. Morgan Kaufmann Publishers.

    Google Scholar 

  45. Snavely, A., et al. (2002). A framework for application performance modeling and prediction. In: SC ’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (pp. 1–17).

    Google Scholar 

  46. Zhai, J., Chen, W., & Zheng, W. (2010). PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a singlenode. In PPoPP.

    Google Scholar 

  47. Schulz, M. (2005). Extracting critical path graphs from MPI applications. In: 2005 IEEE International Conference on Cluster Computing (pp. 1–10).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Fast Communication Trace Collection. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4366-1_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4365-4

  • Online ISBN: 978-981-99-4366-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics