Fast Communication Trace Collection

Zhai, Jidong; Jin, Yuyang; Chen, Wenguang; Zheng, Weimin

doi:10.1007/978-981-99-4366-1_2

Jidong Zhai⁵,
Yuyang Jin⁵,
Wenguang Chen⁵ &
…
Weimin Zheng⁵

192 Accesses

Abstract

Communication patterns of parallel applications are important for optimizing application performance and designing better communication subsystems. Communication patterns can be extracted from communication traces. However, existing approaches are time-consuming and expensive because they generate communication traces by executing the entire parallel applications on full-scale systems. We propose a novel technique, namely, Fact, which can perform Fast Communication Traces collection for large-scale parallel applications on small-scale systems. Our idea is to reduce the original program and obtain a program slice through static analysis and to execute the program slice to acquire the communication traces, which is based on an observation that most computation and message contents in parallel applications are not relevant to their spatial and volume communication attributes and therefore can be removed for the purpose of communication trace collection. We have implemented Fact and evaluated it with NPB programs and Sweep3D. The results show that Fact can reduce resource consumption by two orders of magnitude in most cases. For example, Fact collects communication traces of 512-process Sweep3D on a 4-node (32 cores) platform in just 6.79 s, consuming 1.25 GB memory, while the original program takes 256.63 s and consumes 213.83 GB memory on a 32-node (512 cores) platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The communication volume is specified by the number of messages and the message size. The spatial attribute is characterized by the distribution of message source and destination. The temporal behavior is captured by the message generation rate.
2.
The marked MPI routines will be executed at runtime, and the unmarked will not be executed at runtime. Specific definitions of them will be given in Sect. 2.4.

References

Chodnekar, S., et al. (1997). Towards a communication characterization methodology for parallel applications. In Proceedings Third International Symposium on High-Performance Computer Architecture.
Google Scholar
Kim, J., & Lilja, D. J. (1998). Characterization of communication patterns in message-passing parallel scientific application programs. In: International Workshop on Communication, Architecture, and Applications for Network-Based Parallel Computing (pp. 202–216).
Google Scholar
Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.
Google Scholar
Chen, H., et al. (2006). MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In Proceedings of the 20th Annual International Conference on supercomputing (pp. 353–360) ACM.
Google Scholar
Preissl, R., et al. (2008). Using MPI communication patterns to guide source code transformations. In: Computational Science–ICCS 2008: 8th International Conference (pp. 253–260).
Google Scholar
Ding, Z., et al. (2005). Switch design to enable predictive multiplexed switching. In: 19th IEEE International Parallel and Distributed Processing Symposium (p. 100.1).
Google Scholar
Xue, R., et al. (2009). MPIWiz: Subgroup reproducible replay of MPI applications. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 251–260).
Google Scholar
Preissl, R., et al. (2008). Detecting patterns in MPI communication traces”. In: ICPP. 2008, pp. 230–237.
Google Scholar
Vetter, J. S., & Mueller, F. (2002). Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In Proceedings 16th International Parallel and Distributed Processing Symposium (pp. 853–865).
Google Scholar
Intel Ltd. Intel Trace Analyzer & Collector. http://www.intel.com/cd/software/products/asmo-na/eng/244171.htm
Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources.
Google Scholar
Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311.
Article Google Scholar
Mohr, B., & Wolf, F. (2003). KOJAK-A tool set for automatic performance analysis of parallel programs. In Euro-Par 2003 Parallel Processing: 9th International Euro-Par Conference.
Google Scholar
Labarta, J., et al. (1996). DiP: A parallel program development environment. In: European Conference on Parallel Processing (pp. 665–674). Springer.
Google Scholar
Kerbyson, D. J., et al. (2001). Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (pp. 37–48).
Google Scholar
Zhang, J., et al. (2009). Process mapping for MPI collective communications. In Euro-Par 2009 Parallel Processing: 15th International Euro-Par Conference.
Google Scholar
Zhai, J., et al. (2009). FACT: Fast communication trace collection for parallel applications through program slicing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. https://doi.org/10.1145/1654059.1654087
Zamani, R., & Afsahi, A. (2005). Communication characteristics of message- passing scientific and engineering applications. In: International Conference on Parallel and Distributed Computing Systems.
Google Scholar
Faraj, A., & Yuan, X. (2002). Communication characteristics in the NAS parallel benchmarks. In International Conference on Parallel and Distributed Computing Systems.
Google Scholar
Vetter, J. S., & McCracken, M. O. (2001). Statistical scalability analysis of communication operations in distributed applications. In: Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (pp. 123–132).
Google Scholar
Noeth, M., et al. (2009). ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. Journal of Parallel and Distributed Computing, 69(8), 696–710.
Article Google Scholar
Shao, S., Jones, A. K., & Melhem, R. G. (2006). A compiler-based communication analysis approach for multiprocessor systems. In: Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
Google Scholar
Ho, S., & Lin, N. (2002). Static analysis of communication structures in parallel programs. In International Computer Symposium.
Google Scholar
Weiser, M. (1984). Program slicing. IEEE Transactions on Software Engineering, 10(4), 352–357.
Article MATH Google Scholar
Gallagher, K. B., & Lyle, J. R. (1991). Using program slicing in software maintenance. IEEE Transactions on Software Engineering, 17(8), 751–761.
Article Google Scholar
Binkley, D. (1998). The application of program slicing to regression testing. Information and Software Technology, 40(11–12), 583–594.
Article Google Scholar
Harman, M., & Danicic, S. (1995). Using program slicing to simplify testing. Journal of Software Testing, Verification and Reliability, 5, 143–162.
Article Google Scholar
Chen, Y., et al. (2008). Hiding I/O latency with pre-execution prefetching for parallel applications. In: SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (pp. 1–10).
Google Scholar
Horwitz, S., Reps, T., & Binkley, D. (1990). Interprocedural slicing using dependence graphs. ACM Transactions on Programming Languages and Systems, 12(1), 26–60.
Article Google Scholar
Appel, A. W. (1997). Modern compiler implementation in C: Basic techniques. New York, USA: Cambridge University Press.
Book MATH Google Scholar
Aho, A. V., et al. (2006). Compilers: Principles, techniques, and tools (2nd Edn.). USA: Addison-Wesley Longman Publishing. ISBN: 978-0-321-48681-3.
MATH Google Scholar
Ferrante, J., Ottenstein, K. J., & Warren, J. D. (1987). The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3), 319–349.
Article MATH Google Scholar
Banning, J. (1979). An efficient way to find side effects of procedure calls and aliases of variables. In: Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (pp. 29–41).
Google Scholar
Muchnick, S. S. (1997). Advanced compiler design and implementation. San Francisco, CA, USA: Morgan Kaufmann Publishers.
Google Scholar
Strout, M. M., Kreaseck, B., & Hovland, P. (2006). Data-flow analysis for MPI programs. In 2006 International Conference on Parallel Processing (ICPP’06) (pp. 175–184).
Google Scholar
Bronevetsky, G. (2009). Communication-sensitive static dataflow for parallel message passing applications. In 2009 International Symposium on Code Generation and Optimization.
Google Scholar
SGI. Open64 compiler and tools. http://www.open64.net
LLNL. ASCI Purple Benchmark. https://asc.llnl.gov/computing_resources/purple/archive/benchmarks
Argonne National Laboratory. MPICH2. http://www.mcs.anl.gov/research/projects/mpich2
Ohio State University. MVAPICH: MPI over InfiniBand and iWARP. http://mvapich.cse.ohio-state.edu
Träff, J. L. (2002). Implementing the MPI process topology mechanism. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing.
Google Scholar
Hockney, R. W. (1994). The communication challenge for MPP: Intel paragon and meiko CS-2. Parallel Computing, 20(3) , 389–398.
Article Google Scholar
Argonne National Laboratory. MPICH1. http://www-unix.mcs.anl.gov/mpi/mpich1
Duato, J., Yalamanchili, S., & Ni, L. (2003). Interconnection networks: An engineering approach. Morgan Kaufmann Publishers.
Google Scholar
Snavely, A., et al. (2002). A framework for application performance modeling and prediction. In: SC ’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (pp. 1–17).
Google Scholar
Zhai, J., Chen, W., & Zheng, W. (2010). PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a singlenode. In PPoPP.
Google Scholar
Schulz, M. (2005). Extracting critical path graphs from MPI applications. In: 2005 IEEE International Conference on Cluster Computing (pp. 1–10).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jidong Zhai, Yuyang Jin, Wenguang Chen & Weimin Zheng

Authors

Jidong Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Yuyang Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wenguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Fast Communication Trace Collection. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_2

Download citation

DOI: https://doi.org/10.1007/978-981-99-4366-1_2
Published: 19 June 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4365-4
Online ISBN: 978-981-99-4366-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics