Skip to main content

Graph Analysis for Scalability Analysis

  • Chapter
  • First Online:
Performance Analysis of Parallel Applications for HPC
  • 206 Accesses

Abstract

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl’s law, and resource contention. Performance analysis tools for finding scaling bottlenecks base on either profiling or tracing. Profiling incurs low overheads but does not capture detailed dependence needed for root cause analysis. Tracing collects all information at prohibitive overheads. In this chapter, we design ScalAna that uses static analysis techniques to achieve the best of both worlds—it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a program structure graph, which records the main computation and communication patterns as well as control structures. At runtime, we adopt lightweight techniques to collect performance data according to graph structures and generate a program performance graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect root causes of scaling loss. We evaluate ScalAna with real applications. Results show that our approach can effectively locate root causes of scaling loss and only incurs 1.73% overhead on average for up to 2,048 processes. We achieve up to 11.11% performance improvement by fixing the root causes detected by ScalAna. (Ⓒ 2020 IEEE. Reproduced, with permission, from Yuyang Jin, et al., ScalAna: automating scaling loss detection with graph analysis, SC’20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. TOP500 website (2020). http://top500.org/

  2. Shi, J. Y., et al. (2012). Program scalability analysis for hpc cloud: Applying amdahl’s law to nas benchmarks. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (pp. 1215–1225). IEEE.

    Google Scholar 

  3. Liu, X., & Wu, B. (2015). ScaAnalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (p. 47). ACM.

    Google Scholar 

  4. Pearce, O., et al. (2019). Exploring dynamic load imbalance solutions with the CoMD proxy application. Future Generation Computer Systems, 92, 920–932.

    Article  Google Scholar 

  5. Schmidl, D., Muüller, M. S., & Bischof, C. (2016). OpenMP scalability limits on large SMPs and how to extend them. Technical Report. Fachgruppe Informatik.

    Google Scholar 

  6. Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling.

    Google Scholar 

  7. Tallent, N. R., Adhianto, L., & Mellor-Crummey, J. M. (2010). Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–11). IEEE Computer Society.

    Google Scholar 

  8. Tallent, N. R., et al. (2009). Diagnosing performance bottlenecks in emerging petascale applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (pp. 1–11). IEEE.

    Google Scholar 

  9. Intel Trace Analyzer and Collector. https://software.intel.com/en-us/trace-analyzer

  10. Zhai, J., Chen, W., & Zheng, W. (2010). PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a single node. In PPoPP.

    Google Scholar 

  11. Geimer, M., et al. (2010). The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719.

    Google Scholar 

  12. Linford, J. C., et al. (2017). Performance analysis of openSHMEM applications with TAU commander. In Workshop on OpenSHMEM and Related Technologies (pp. 161–179). Springer.

    Google Scholar 

  13. Adhianto, L., et al. (2010). HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6), 685–701.

    Google Scholar 

  14. Yin, H., et al. (2016). Discovering interpretable geo-social communities for user behavior prediction. In 2016 IEEE 32nd International Conference on Data Engineering (pp. 942–953). IEEE.

    Google Scholar 

  15. Yin, H., et al. (2016). Joint modeling of user check-in behaviors for real-time point-of-interest recommendation. ACM Transactions on Information Systems, 35(2), 11.

    Google Scholar 

  16. Bhattacharyya, A., Kwasniewski, G., & Hoefler, T. (2014). Using compiler techniques to improve automatic performance modeling. In In Proceedings of the 24th International Conference on Parallel Architectures and Compilation. San Francisco, CA, USA: ACM.

    Google Scholar 

  17. Calotoiu, A., et al. (2013). Using automated performance modeling to find scalability bugs in complex codes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (p. 45). ACM.

    Google Scholar 

  18. Wolf, F., et al. (2016). Automatic performance modeling of HPC applications. In Software for Exascale Computing-SPPEXA 2013–2015 (pp. 445–465). Springer.

    Google Scholar 

  19. Beckingsale, D., et al. (2017). Apollo: Reusable models for fast, dynamic tuning of input-dependent code. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 307–316). IEEE.

    Google Scholar 

  20. Linford, J. C., et al. (2009). Multi-core acceleration of chemical kinetics for simulation and prediction. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (pp. 1–11).

    Google Scholar 

  21. Calotoiu, A., et al. (2016). Fast multi-parameter performance modeling. In 2016 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 172–181). IEEE.

    Google Scholar 

  22. Jin, Y., et al. (2020). ScalAna: Automating scaling loss detection with graph analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14).

    Google Scholar 

  23. The LLVM Compiler Framework. http://llvm.org

  24. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.

    Google Scholar 

  25. Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources.

    Google Scholar 

  26. PAPI tools. http://icl.utk.edu/papi/software/

  27. Wu, X., & Mueller, F. (2011). Scalaextrap: Trace-based communication extrapolation for SPMD programs. ACM SIGPLAN Notices, 46(8), 113–122. ACM.

    Google Scholar 

  28. Noeth, M., et al. (2009). ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. Journal of Parallel and Distributed Computing, 69(8), 696–710.

    Article  Google Scholar 

  29. Vetter, J. (2002). Dynamic statistical profiling of communication activity in distributed applications. ACM SIGMETRICS Performance Evaluation Review, 30(1), 240–250.

    Article  Google Scholar 

  30. Mohr, B. (2011). PMPI tools. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1570–1575). Boston, MA: Springer US. ISBN: 978- 0-387-09766-4. https://doi.org/10.1007/978-0-387-09766-4_57

  31. Barnes, B. J., et al. (2008). A regression-based approach to scalability prediction. In Proceedings of the 22nd annual international conference on Supercomputing (pp. 368–377). ACM.

    Google Scholar 

  32. Tang, X., et al. (2018). vSensor: Leveraging fixed-workload snippets of programs for performance variance detection. ACM SIGPLAN Notices, 53(1), 124–136. ACM.

    Google Scholar 

  33. Terpstra, D., et al. (2010). Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009 (pp. 157–173). Springer.

    Google Scholar 

  34. Hayes, J. C., et al. (2006). Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. The Astrophysical Journal Supplement Series, 165(1), 188.

    Article  Google Scholar 

  35. Rodrigues, A. F., et al. (2011). The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review, 38(4), 37–42.

    Article  Google Scholar 

  36. Fischer, P. F., Lottes, J. W., & Kerkemeier, S. G. (2008). nek5000 Web page.

    Google Scholar 

  37. Mohr, B. (2014). Scalable parallel performance measurement and analysis tools-state-of-the-art and future challenges. Supercomputing Frontiers and Innovations, 1(2), 108–123.

    MathSciNet  Google Scholar 

  38. Knobloch, M., & Mohr, B. (2020). Tools for GPU computing–debugging and performance analysis of heterogenous HPC applications. Supercomputing Frontiers and Innovations, 7(1), 91–111.

    Google Scholar 

  39. Biersdorf, S., et al. (2011). Score-P: A unified performance measurement system for petascale applications. In Competence in High Performance Computing 2010 (pp. 85–97). Springer.

    Google Scholar 

  40. Score-P homepage. Score-P Consortium. http://www.score-p.org

  41. Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311.

    Article  Google Scholar 

  42. TAU homepage. University of Oregon. http://tau.uoregon.edu

  43. Müller, M. S., et al. (2007). Developing scalable applications with Vampir, VampirServer and VampirTrace. In PARCO (Vol. 15, pp. 637–644). Citeseer.

    Google Scholar 

  44. Knüpfer, A., et al. (2008). The vampir performance analysis tool-set. In Tools for high performance computing (pp. 139–155). Springer.

    Google Scholar 

  45. Vampir homepage. Technical University Dresden. http://www.vampir.eu

  46. Scalasca homepage. Julich Supercomputing Centre and German Research School for Simulation Sciences. http://www.scalasca.org

  47. Labarta, J., et al. (1996). DiP: A parallel program development environment. In European Conference on Parallel Processing (pp. 665–674). Springer.

    Google Scholar 

  48. Servat, H., et al. (2009). Detailed performance analysis using coarse grain sampling. In European Conference on Parallel Processing (pp. 185–198). Springer.

    Google Scholar 

  49. Paraver homepage. Barcelona Supercomputing Center. http://www.bsc.es/paraver

  50. Becker, D., et al. (2007). Automatic trace-based performance analysis of metacomputing applications. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.

    Google Scholar 

  51. Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE.

    Google Scholar 

  52. Krishnamoorthy, S., & Agarwal, K. (2010). Scalable communication trace compression. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (pp. 408–417). IEEE Computer Society.

    Google Scholar 

  53. Knupfer, A., & Nagel, W. E. (2005). Construction and compression of complete call graphs for post-mortem program trace analysis. In 2005 International Conference on Parallel Processing (pp. 165–172). IEEE.

    Google Scholar 

  54. Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.

    Google Scholar 

  55. January, C., et al. (2015). Allinea MAP: Adding energy and OpenMP profiling without increasing overhead. In Tools for High Performance Computing 2014 (pp. 25–35). Springer.

    Google Scholar 

  56. Kaufmann, S., & Homer, B. (2003). Craypat-cray x1 performance analysis tool. In Cray User Group (May 2003).

    Google Scholar 

  57. Wang, H., et al. (2018).Spindle: Informed memory access monitoring. In 2018 Annual Technical Conference (pp. 561–574).

    Google Scholar 

  58. Weber, M., et al. (2016). Structural clustering: A new approach to support performance analysis at scale. In 2016 IEEE International Parallel and Distributed Processing Symposium (pp. 484–493). IEEE.

    Google Scholar 

  59. Laguna, I., et al. (2015). Debugging high-performance computing applications at massive scales. Communications of the ACM, 58(9), 72–81.

    Article  Google Scholar 

  60. Zhou, B., Kulkarni, M., & Bagchi, S. (2011). Vrisha: Using scaling properties of parallel programs for bug detection and localization. In Proceedings of the 20th International Symposium on High Performance Distributed Computing (pp. 85–96). ACM.

    Google Scholar 

  61. Laguna, I., et al. (2011). Large scale debugging of parallel tasks with automaded. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 50). ACM.

    Google Scholar 

  62. Mitra, S., et al. (2014). Accurate application progress analysis for large-scale parallel debugging. ACM SIGPLAN Notices, 49(6), 193–203. ACM.

    Google Scholar 

  63. Coarfa, C., et al. (2007). Scalability analysis of SPMD codes using expectations. In Proceedings of the 21st Annual International Conference on Supercomputing (pp. 13–22). ACM.

    Google Scholar 

  64. Bohme, D., et al. (2010). Identifying the root causes of wait states in large-scale parallel applications. In 2010 39th International Conference on Parallel Processing (pp. 90–100). IEEE.

    Google Scholar 

  65. Chen, J., & Clapp, R. M. (2015). Critical-path candidates: Scalable performance modeling for MPI workloads. In 2015 IEEE International Symposium on Performance Analysis of Systems and Software (pp. 1–10). IEEE.

    Google Scholar 

  66. Yu, T., et al. (2020). COLAB: A collaborative multi-factor scheduler for asymmetric multicore processors. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (pp. 268–279).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Graph Analysis for Scalability Analysis. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4366-1_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4365-4

  • Online ISBN: 978-981-99-4366-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics