Abstract
In this book, we propose several performance analysis approaches for communication analysis, memory monitoring, etc. However, to implement each such analysis, significant human efforts and domain knowledge are required. To alleviate the burden of implementing specific performance analysis tasks, we propose a domain-specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow’s built-in analysis library or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
PerFlow is available at https://github.com/thu-pacman/PerFlow.
References
Ao, Y., et al. (2017). 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE. (pp. 535–544).
Ravikumar, K., Appelhans, D., & Yeung, P. K. (2019). GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–22).
Zhang, H., et al. (2017). Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Conference on Annual Technical Conference (USENIX ATC’17) (pp. 181–193).
Huang, K., et al. (2021). Understanding and bridging the gaps in current GNN performance optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’21) (pp. 119–132).
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. In Communications of the ACM, 51(1), 107–113.
Armbrust, M., et al. (2015). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1383–1394).
Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM.
Bohme, D., et al. (2010). Identifying the root causes of wait states in large-scale parallel applications. In 2010 39th International Conference on Parallel Processing. IEEE (pp. 90–100).
Hidayetoğlu, M., et al. (2019). MemXCT: Memory-centric X-ray CT reconstruction with massive parallelization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–56).
Adhianto, L., et al. (2010). HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6), 685–701.
Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling.
Tallent, N. R., Adhianto, L., & Mellor-Crummey, J. M. (2010). Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–11). IEEE Computer Society.
Geimer, M., et al. (2010). The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719.
Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources.
Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311.
Score-P homepage. Score-P Consortium. http://www.score-p.org
Böhme, D., et al. (2012). Scalable critical-path based performance analysis. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS’12) (pp. 1330–1340). IEEE.
Schmitt, F., Dietrich, R., & Juckeland, G. (2017). Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications. The International Journal of High Performance Computing Applications, 31(6), 485–498.
Bohme, D., Wolf, F., & Geimer, M. (2012). Characterizing load and communication imbalance in large-scale parallel applications. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). May 2012 (pp. 2538–2541). https://doi.org/10.1109/IPDPSW.2012.321
Jin, Y., et al. (2020). ScalAna: Automating scaling loss detection with graph analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14).
Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.
Bronevetsky, G., et al. (2010). AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN’10) (pp. 231–240). IEEE.
Bhattacharyya, A., Kwasniewski, G., & Hoefler, T. (2015). Using compiler techniques to improve automatic performance modeling. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation. San Francisco, CA, USA: ACM.
Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE.
Zhou, F., et al. (2018). wPerf: Generic Off-CPU analysis to identify bottleneck waiting events. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (pp. 527–543).
Wang, H., et al. (2018). Spindle: informed memory access monitoring. In 2018 Annual Technical Conference (pp. 561–574).
Jin, Y., et al. (2022). PerFlow: A domain specific framework for automatic performance analysis of parallel applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 177–191). https://doi.org/10.1145/3503221.3508405.
Culler, D. E. (1986). Dataflow architectures. Annual Review of Computer Science, 1(1), 225–253.
Williams, W. R., et al. (2016). Dyninst and MRNet: Foundational infrastructure for parallel tools. In Tools for High Performance Computing 2015 (pp. 1–16). Springer.
Schieber, B., & Vishkin, U. (1988). On finding lowest common ancestors: Simplification and parallelization. SIAM Journal on Computing, 17(6), 1253–1262.
Shi, T., et al. (2020). GraphPi: High performance graph pattern matching through effective redundancy elimination. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14). IEEE.
PAPI tools. http://icl.utk.edu/papi/software/.
Csardi, G., Nepusz, T., et al. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695, 1–9.
Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.
Hayes, J. C., et al. (2006). Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. The Astrophysical Journal Supplement Series, 165(1), 188.
Large-scale Atomic and Molecular Massively Parallel Simulator. “Lammps”. Available at: http://lammps.sandia.gov (2013).
Ghosh, S., et al. (2018). Distributed louvain algorithm for graph community detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS’18) (pp. 885–895). IEEE.
Wei, L., & Mellor-Crummey, J. (2020). Using sample-based time series data for automated diagnosis of scalability losses in parallel programs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20) (pp. 144–159).
Graham, S. L., Kessler, P. B., & McKusick, M. K. (1982). Gprof: A call graph execution profiler. ACM Sigplan Notices, 17(6), 120–126.
Reinders, J. (2005). VTune performance analyzer essentials. Intel Press.
January, C., et al. (2015). Allinea MAP: Adding energy and OpenMP profiling without increasing overhead. In Tools for High Performance Computing 2014 (pp. 25–35). Springer.
Kaufmann, S., & Homer, B. (2003). Craypat-cray x1 performance analysis tool. In Cray User Group (May 2003).
Becker, D., et al. (2007). Automatic trace-based performance analysis of metacomputing applications. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.
TAU homepage. University of Oregon. http://tau.uoregon.edu
Vampir homepage. Technical University Dresden. http://www.vampir.eu
Scalasca homepage. Julich Supercomputing Centre and German Research School for Simulation Sciences. http://www.scalasca.org
Labarta, J., et al. (1996). DiP: A parallel program development environment. In European Conference on Parallel Processing (pp. 665–674). Springer.
Servat, H., et al. (2009). Detailed performance analysis using coarse grain sampling. In European Conference on Parallel Processing (pp. 185–198). Springer.
Paraver homepage. Barcelona Supercomputing Center. http://www.bsc.es/paraver
Tallent, N. R., Mellor-Crummey, J. M., & Porterfield, A. (2010). Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10) (pp. 269–280).
Gamblin, T., et al. (2008). Scalable load-balance measurement for SPMD codes. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. (pp. 1–12). https://doi.org/10.1109/SC.2008.5222553
Yu, T., & Pradel, M. (2016). Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis (pp. 389–400).
Cummins, C., et al. (2020). Programl: Graph-based deep learning for program optimization and analysis. Preprint arXiv:2003.10536.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Domain-Specific Framework for Performance Analysis. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_9
Download citation
DOI: https://doi.org/10.1007/978-981-99-4366-1_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4365-4
Online ISBN: 978-981-99-4366-1
eBook Packages: Computer ScienceComputer Science (R0)