Skip to main content

Domain-Specific Framework for Performance Analysis

  • Chapter
  • First Online:
Performance Analysis of Parallel Applications for HPC
  • 211 Accesses

Abstract

In this book, we propose several performance analysis approaches for communication analysis, memory monitoring, etc. However, to implement each such analysis, significant human efforts and domain knowledge are required. To alleviate the burden of implementing specific performance analysis tasks, we propose a domain-specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow’s built-in analysis library or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    PerFlow is available at https://github.com/thu-pacman/PerFlow.

References

  1. Ao, Y., et al. (2017). 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE. (pp. 535–544).

    Google Scholar 

  2. Ravikumar, K., Appelhans, D., & Yeung, P. K. (2019). GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–22).

    Google Scholar 

  3. Zhang, H., et al. (2017). Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Conference on Annual Technical Conference (USENIX ATC’17) (pp. 181–193).

    Google Scholar 

  4. Huang, K., et al. (2021). Understanding and bridging the gaps in current GNN performance optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’21) (pp. 119–132).

    Google Scholar 

  5. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. In Communications of the ACM, 51(1), 107–113.

    Article  Google Scholar 

  6. Armbrust, M., et al. (2015). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1383–1394).

    Google Scholar 

  7. Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM.

    Google Scholar 

  8. Bohme, D., et al. (2010). Identifying the root causes of wait states in large-scale parallel applications. In 2010 39th International Conference on Parallel Processing. IEEE (pp. 90–100).

    Google Scholar 

  9. Hidayetoğlu, M., et al. (2019). MemXCT: Memory-centric X-ray CT reconstruction with massive parallelization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–56).

    Google Scholar 

  10. Adhianto, L., et al. (2010). HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6), 685–701.

    Google Scholar 

  11. Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling.

    Google Scholar 

  12. Tallent, N. R., Adhianto, L., & Mellor-Crummey, J. M. (2010). Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–11). IEEE Computer Society.

    Google Scholar 

  13. Geimer, M., et al. (2010). The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719.

    Google Scholar 

  14. Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources.

    Google Scholar 

  15. Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311.

    Article  Google Scholar 

  16. Score-P homepage. Score-P Consortium. http://www.score-p.org

  17. Böhme, D., et al. (2012). Scalable critical-path based performance analysis. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS’12) (pp. 1330–1340). IEEE.

    Google Scholar 

  18. Schmitt, F., Dietrich, R., & Juckeland, G. (2017). Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications. The International Journal of High Performance Computing Applications, 31(6), 485–498.

    Article  Google Scholar 

  19. Bohme, D., Wolf, F., & Geimer, M. (2012). Characterizing load and communication imbalance in large-scale parallel applications. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). May 2012 (pp. 2538–2541). https://doi.org/10.1109/IPDPSW.2012.321

  20. Jin, Y., et al. (2020). ScalAna: Automating scaling loss detection with graph analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14).

    Google Scholar 

  21. Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.

    Google Scholar 

  22. Bronevetsky, G., et al. (2010). AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN’10) (pp. 231–240). IEEE.

    Google Scholar 

  23. Bhattacharyya, A., Kwasniewski, G., & Hoefler, T. (2015). Using compiler techniques to improve automatic performance modeling. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation. San Francisco, CA, USA: ACM.

    Google Scholar 

  24. Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE.

    Google Scholar 

  25. Zhou, F., et al. (2018). wPerf: Generic Off-CPU analysis to identify bottleneck waiting events. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (pp. 527–543).

    Google Scholar 

  26. Wang, H., et al. (2018). Spindle: informed memory access monitoring. In 2018 Annual Technical Conference (pp. 561–574).

    Google Scholar 

  27. Jin, Y., et al. (2022). PerFlow: A domain specific framework for automatic performance analysis of parallel applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 177–191). https://doi.org/10.1145/3503221.3508405.

  28. Culler, D. E. (1986). Dataflow architectures. Annual Review of Computer Science, 1(1), 225–253.

    Article  Google Scholar 

  29. Williams, W. R., et al. (2016). Dyninst and MRNet: Foundational infrastructure for parallel tools. In Tools for High Performance Computing 2015 (pp. 1–16). Springer.

    Google Scholar 

  30. Schieber, B., & Vishkin, U. (1988). On finding lowest common ancestors: Simplification and parallelization. SIAM Journal on Computing, 17(6), 1253–1262.

    Article  MathSciNet  MATH  Google Scholar 

  31. Shi, T., et al. (2020). GraphPi: High performance graph pattern matching through effective redundancy elimination. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14). IEEE.

    Google Scholar 

  32. PAPI tools. http://icl.utk.edu/papi/software/.

  33. Csardi, G., Nepusz, T., et al. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695, 1–9.

    Google Scholar 

  34. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.

    Google Scholar 

  35. Hayes, J. C., et al. (2006). Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. The Astrophysical Journal Supplement Series, 165(1), 188.

    Article  Google Scholar 

  36. Large-scale Atomic and Molecular Massively Parallel Simulator. “Lammps”. Available at: http://lammps.sandia.gov (2013).

  37. Ghosh, S., et al. (2018). Distributed louvain algorithm for graph community detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS’18) (pp. 885–895). IEEE.

    Google Scholar 

  38. Wei, L., & Mellor-Crummey, J. (2020). Using sample-based time series data for automated diagnosis of scalability losses in parallel programs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20) (pp. 144–159).

    Google Scholar 

  39. Graham, S. L., Kessler, P. B., & McKusick, M. K. (1982). Gprof: A call graph execution profiler. ACM Sigplan Notices, 17(6), 120–126.

    Article  Google Scholar 

  40. Reinders, J. (2005). VTune performance analyzer essentials. Intel Press.

    Google Scholar 

  41. January, C., et al. (2015). Allinea MAP: Adding energy and OpenMP profiling without increasing overhead. In Tools for High Performance Computing 2014 (pp. 25–35). Springer.

    Google Scholar 

  42. Kaufmann, S., & Homer, B. (2003). Craypat-cray x1 performance analysis tool. In Cray User Group (May 2003).

    Google Scholar 

  43. Becker, D., et al. (2007). Automatic trace-based performance analysis of metacomputing applications. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.

    Google Scholar 

  44. TAU homepage. University of Oregon. http://tau.uoregon.edu

  45. Vampir homepage. Technical University Dresden. http://www.vampir.eu

  46. Scalasca homepage. Julich Supercomputing Centre and German Research School for Simulation Sciences. http://www.scalasca.org

  47. Labarta, J., et al. (1996). DiP: A parallel program development environment. In European Conference on Parallel Processing (pp. 665–674). Springer.

    Google Scholar 

  48. Servat, H., et al. (2009). Detailed performance analysis using coarse grain sampling. In European Conference on Parallel Processing (pp. 185–198). Springer.

    Google Scholar 

  49. Paraver homepage. Barcelona Supercomputing Center. http://www.bsc.es/paraver

  50. Tallent, N. R., Mellor-Crummey, J. M., & Porterfield, A. (2010). Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10) (pp. 269–280).

    Google Scholar 

  51. Gamblin, T., et al. (2008). Scalable load-balance measurement for SPMD codes. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. (pp. 1–12). https://doi.org/10.1109/SC.2008.5222553

  52. Yu, T., & Pradel, M. (2016). Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis (pp. 389–400).

    Google Scholar 

  53. Cummins, C., et al. (2020). Programl: Graph-based deep learning for program optimization and analysis. Preprint arXiv:2003.10536.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Domain-Specific Framework for Performance Analysis. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4366-1_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4365-4

  • Online ISBN: 978-981-99-4366-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics