Domain-Specific Framework for Performance Analysis

Zhai, Jidong; Jin, Yuyang; Chen, Wenguang; Zheng, Weimin

doi:10.1007/978-981-99-4366-1_9

Jidong Zhai⁵,
Yuyang Jin⁵,
Wenguang Chen⁵ &
…
Weimin Zheng⁵

211 Accesses

Abstract

In this book, we propose several performance analysis approaches for communication analysis, memory monitoring, etc. However, to implement each such analysis, significant human efforts and domain knowledge are required. To alleviate the burden of implementing specific performance analysis tasks, we propose a domain-specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow’s built-in analysis library or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
PerFlow is available at https://github.com/thu-pacman/PerFlow.

References

Ao, Y., et al. (2017). 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE. (pp. 535–544).
Google Scholar
Ravikumar, K., Appelhans, D., & Yeung, P. K. (2019). GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–22).
Google Scholar
Zhang, H., et al. (2017). Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Conference on Annual Technical Conference (USENIX ATC’17) (pp. 181–193).
Google Scholar
Huang, K., et al. (2021). Understanding and bridging the gaps in current GNN performance optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’21) (pp. 119–132).
Google Scholar
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. In Communications of the ACM, 51(1), 107–113.
Article Google Scholar
Armbrust, M., et al. (2015). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1383–1394).
Google Scholar
Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM.
Google Scholar
Bohme, D., et al. (2010). Identifying the root causes of wait states in large-scale parallel applications. In 2010 39th International Conference on Parallel Processing. IEEE (pp. 90–100).
Google Scholar
Hidayetoğlu, M., et al. (2019). MemXCT: Memory-centric X-ray CT reconstruction with massive parallelization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–56).
Google Scholar
Adhianto, L., et al. (2010). HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6), 685–701.
Google Scholar
Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling.
Google Scholar
Tallent, N. R., Adhianto, L., & Mellor-Crummey, J. M. (2010). Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–11). IEEE Computer Society.
Google Scholar
Geimer, M., et al. (2010). The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719.
Google Scholar
Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources.
Google Scholar
Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311.
Article Google Scholar
Score-P homepage. Score-P Consortium. http://www.score-p.org
Böhme, D., et al. (2012). Scalable critical-path based performance analysis. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS’12) (pp. 1330–1340). IEEE.
Google Scholar
Schmitt, F., Dietrich, R., & Juckeland, G. (2017). Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications. The International Journal of High Performance Computing Applications, 31(6), 485–498.
Article Google Scholar
Bohme, D., Wolf, F., & Geimer, M. (2012). Characterizing load and communication imbalance in large-scale parallel applications. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). May 2012 (pp. 2538–2541). https://doi.org/10.1109/IPDPSW.2012.321
Jin, Y., et al. (2020). ScalAna: Automating scaling loss detection with graph analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14).
Google Scholar
Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.
Google Scholar
Bronevetsky, G., et al. (2010). AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN’10) (pp. 231–240). IEEE.
Google Scholar
Bhattacharyya, A., Kwasniewski, G., & Hoefler, T. (2015). Using compiler techniques to improve automatic performance modeling. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation. San Francisco, CA, USA: ACM.
Google Scholar
Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE.
Google Scholar
Zhou, F., et al. (2018). wPerf: Generic Off-CPU analysis to identify bottleneck waiting events. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (pp. 527–543).
Google Scholar
Wang, H., et al. (2018). Spindle: informed memory access monitoring. In 2018 Annual Technical Conference (pp. 561–574).
Google Scholar
Jin, Y., et al. (2022). PerFlow: A domain specific framework for automatic performance analysis of parallel applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 177–191). https://doi.org/10.1145/3503221.3508405.
Culler, D. E. (1986). Dataflow architectures. Annual Review of Computer Science, 1(1), 225–253.
Article Google Scholar
Williams, W. R., et al. (2016). Dyninst and MRNet: Foundational infrastructure for parallel tools. In Tools for High Performance Computing 2015 (pp. 1–16). Springer.
Google Scholar
Schieber, B., & Vishkin, U. (1988). On finding lowest common ancestors: Simplification and parallelization. SIAM Journal on Computing, 17(6), 1253–1262.
Article MathSciNet MATH Google Scholar
Shi, T., et al. (2020). GraphPi: High performance graph pattern matching through effective redundancy elimination. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14). IEEE.
Google Scholar
PAPI tools. http://icl.utk.edu/papi/software/.
Csardi, G., Nepusz, T., et al. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695, 1–9.
Google Scholar
Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.
Google Scholar
Hayes, J. C., et al. (2006). Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. The Astrophysical Journal Supplement Series, 165(1), 188.
Article Google Scholar
Large-scale Atomic and Molecular Massively Parallel Simulator. “Lammps”. Available at: http://lammps.sandia.gov (2013).
Ghosh, S., et al. (2018). Distributed louvain algorithm for graph community detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS’18) (pp. 885–895). IEEE.
Google Scholar
Wei, L., & Mellor-Crummey, J. (2020). Using sample-based time series data for automated diagnosis of scalability losses in parallel programs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20) (pp. 144–159).
Google Scholar
Graham, S. L., Kessler, P. B., & McKusick, M. K. (1982). Gprof: A call graph execution profiler. ACM Sigplan Notices, 17(6), 120–126.
Article Google Scholar
Reinders, J. (2005). VTune performance analyzer essentials. Intel Press.
Google Scholar
January, C., et al. (2015). Allinea MAP: Adding energy and OpenMP profiling without increasing overhead. In Tools for High Performance Computing 2014 (pp. 25–35). Springer.
Google Scholar
Kaufmann, S., & Homer, B. (2003). Craypat-cray x1 performance analysis tool. In Cray User Group (May 2003).
Google Scholar
Becker, D., et al. (2007). Automatic trace-based performance analysis of metacomputing applications. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.
Google Scholar
TAU homepage. University of Oregon. http://tau.uoregon.edu
Vampir homepage. Technical University Dresden. http://www.vampir.eu
Scalasca homepage. Julich Supercomputing Centre and German Research School for Simulation Sciences. http://www.scalasca.org
Labarta, J., et al. (1996). DiP: A parallel program development environment. In European Conference on Parallel Processing (pp. 665–674). Springer.
Google Scholar
Servat, H., et al. (2009). Detailed performance analysis using coarse grain sampling. In European Conference on Parallel Processing (pp. 185–198). Springer.
Google Scholar
Paraver homepage. Barcelona Supercomputing Center. http://www.bsc.es/paraver
Tallent, N. R., Mellor-Crummey, J. M., & Porterfield, A. (2010). Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10) (pp. 269–280).
Google Scholar
Gamblin, T., et al. (2008). Scalable load-balance measurement for SPMD codes. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. (pp. 1–12). https://doi.org/10.1109/SC.2008.5222553
Yu, T., & Pradel, M. (2016). Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis (pp. 389–400).
Google Scholar
Cummins, C., et al. (2020). Programl: Graph-based deep learning for program optimization and analysis. Preprint arXiv:2003.10536.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jidong Zhai, Yuyang Jin, Wenguang Chen & Weimin Zheng

Authors

Jidong Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Yuyang Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wenguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Domain-Specific Framework for Performance Analysis. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_9

Download citation

DOI: https://doi.org/10.1007/978-981-99-4366-1_9
Published: 19 June 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4365-4
Online ISBN: 978-981-99-4366-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics