Skip to main content

Production-Run Noise Detection

  • Chapter
  • First Online:
Performance Analysis of Parallel Applications for HPC
  • 193 Accesses

Abstract

The performance variance detection approach in Chap. 7 relies on nontrivial source code analysis that is impractical for production-run parallel applications. In this chapter, we further propose Vapro, a performance variance detection and diagnosis framework for production-run parallel applications. Our approach is based on an important observation that most parallel applications contain code snippets that are repeatedly executed with fixed workload, which can be used for performance variance detection. To effectively identify these snippets at runtime even without program source code, we introduce state transition graph (STG) to track program execution and then conduct lightweight workload analysis on STG to locate variance. To diagnose the detected variance, Vapro leverages a progressive diagnosis method based on a hybrid model leveraging variance breakdown and statistical analysis. Results show that the performance overhead of Vapro is only \(1.38\%\) on average. Vapro can detect the variance in real applications caused by hardware bugs, memory, and IO. After fixing the detected variance, the standard deviation of the execution time is reduced by up to 73.5%. Compared with the state-of-the-art variance detection tool based on source code analysis, Vapro achieves \(30.0\%\) higher detection coverage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It is the cgitmax loop in cg.f:1170-1360.

  2. 2.

    In this work, the computation noise is generated by executing stress [18] on the same CPU core of applications, and the memory noise is generated by executing stream [19] on the idle cores.

  3. 3.

    On the Intel Ivy Bridge CPUs, the time fraction of frontend bound is equal to IDQ_UOPS_NOT_DELIVERED.CORE / (4 * CPU_CLK_UNHALTED.THREAD).

  4. 4.

    The event name is CYCLE_ACTIVITY.STALLS_L2_MISS.

References

  1. Ballani, H., et al. (2011). Towards predictable datacenter networks. In Proceedings of the ACM SIGCOMM 2011 Conference (pp. 242–253).

    Google Scholar 

  2. Ferreira, K. B., et al. (2013). The impact of system design parameters on application noise sensitivity. In 2010 IEEE International Conference on Cluster Computing (Vol. 16, No. 1, pp. 117–129).

    Google Scholar 

  3. Mondragon, O. H., et al. (2016). Understanding performance interference in next-generation HPC systems. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 384–395). IEEE.

    Google Scholar 

  4. Schad, J., Dittrich, J., & Quiané-Ruiz, J.-A. (2010). Runtime measurements in the cloud: Observing, analyzing, and reducing variance. In Proceedings of the VLDB Endowment, 3(1–2), 460–471.

    Article  Google Scholar 

  5. Schwarzkopf, M., Murray, D. G., & Hand, S. (2012). The seven deadly sins of cloud computing research. In 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’12).

    Google Scholar 

  6. Maricq, A., et al. (2018). Taming performance variability. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (pp. 409–425).

    Google Scholar 

  7. Beckman, P., et al. (2006). The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing (pp. 1–12). IEEE.

    Google Scholar 

  8. Hoefler, T., Schneider, T., & Lumsdaine, A. (2010). Characterizing the influence of system noise on large-scale applications by simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. SC’10 (pp. 1–11).

    Google Scholar 

  9. Tang, X., et al. (2018). vSensor: Leveraging fixed-workload snippets of programs for performance variance detection. In ACM SIGPLAN Notices (Vol. 53, No. 1, pp. 124–136). ACM.

    Google Scholar 

  10. McCalpin, J. D. (2018). HPL and DGEMM performance variability on the Xeon platinum 8160 processor. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 225–237). IEEE.

    Google Scholar 

  11. Gong, Y., He, B., & Li, D. (2014). Finding constant from change: Revisiting network performance aware optimizations on iaas clouds. In SC14: International Conference for High Performance Computing, Network- ing, Storage and Analysis (pp. 982–993). IEEE.

    Google Scholar 

  12. Gunawi, H. S., et al. (2018). Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS), 14(3), 23.

    Google Scholar 

  13. Sherwood, T., Sair, S., & Calder, B. (2003). Phase tracking and prediction. ACM SIGARCH Computer Architecture News, 31(2), 336–349. ACM.

    Google Scholar 

  14. Sherwood, T., Perelman, E., & Calder, B. (2001). Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT’01) (pp. 3–14). IEEE.

    Google Scholar 

  15. Aho, A. V., et al. (2006). Compilers: Principles, techniques, and tools (2nd Ed.). USA: Addison-Wesley Longman Publishing. ISBN: 978-0-321-48681-3.

    MATH  Google Scholar 

  16. Zheng, L., et al. (2022). Vapro: Performance variance detection and diagnosis for production-run parallel applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 150–162). https://doi.org/10.1145/3503221.3508411

  17. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.

    Google Scholar 

  18. Stress. https://packages.debian.org/buster/stress

  19. McCalpin, J. (2018). Memory Bandwidth: STREAM Benchmark Performance Results. https://www.cs.virginia.edu/stream/ (Visited on March 20, 2018).

  20. Vetter, J. (2002). Dynamic statistical profiling of communication activity in distributed applications. ACM SIGMETRICS Performance Evaluation Review, 30(1), 240–250.

    Article  Google Scholar 

  21. Yu, T., et al. (2019). Large-scale automatic K-means clustering for heterogeneous many-core supercomputer. IEEE Transactions on Parallel and Distributed Systems (TPDS), 31, 997–1008.

    Article  Google Scholar 

  22. Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data: Recent advances in clustering, J. Kogan, C. Nicholas, & M. Teboulle (Eds.). Berlin, Heidelberg: Springer. ISBN: 978-3-540-28349-2.

    Google Scholar 

  23. Inaba, M., Katoh, N., & Imai, H. (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the Tenth Annual Symposium on Computational Geometry (pp. 332–339).

    Google Scholar 

  24. Weaver, V. M., Terpstra, D., & Moore, S. (2013). Non-determinism and overcount on modern hardware performance counter implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 215–224). IEEE.

    Google Scholar 

  25. Yasin, A. (2014). A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14) (pp. 35–44). IEEE.

    Google Scholar 

  26. Farrar, D. E., & Glauber, R. R. (1967). Multicollinearity in regression analysis: The problem revisited. The Review of Economic and Statistics, 49,. 92–107.

    Article  Google Scholar 

  27. Bernat, A. R., & Miller, B. P. (2011). Anywhere, any-time binary instrumentation. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools (pp. 9–16).

    Google Scholar 

  28. Goodman, J., et al. (1988). Stability of binary exponential backoff. In Journal of the ACM (JACM), 35(3), 579–602.

    Article  MathSciNet  Google Scholar 

  29. Brim, M. J., et al. (2010). MRNet: A scalable infrastructure for the development of parallel tools and applications. In Cray user group.

    Google Scholar 

  30. The cuBERT framework. https://github.com/zhihu/cuBERT

  31. The parallel PageRank program. https://github.com/nikos912000/parallel-pagerank

  32. The MapReduce framework. https://github.com/sysprog21/mapreduce

  33. Yang, U. M., et al. (2002). BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1), 155–177.

    Article  MathSciNet  Google Scholar 

  34. Kay, J. E., et al. (2015). The community earth system model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society, 96(8), 1333–1349.

    Article  Google Scholar 

  35. Bienia, C., et al. (2008). The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT’08) (pp. 72–81).

    Google Scholar 

  36. Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 410–420).

    Google Scholar 

  37. Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling.

    Google Scholar 

  38. Dongarra, J. J., Luszczek, P., & Petitet, A. (2003). The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience, 15(9), 803–820.

    Article  Google Scholar 

  39. Intel. Addressing Potential DGEMM/HPL Perf Variability on 24-Core Intel Xeon Processor Scalable Family. White Paper, Number 606269, Revision 1.0 (2018).

    Google Scholar 

  40. De Melo, A. C. (2010). The new linux perf tools. In Slides from Linux Kongress (Vol. 18, pp. 1–42).

    Google Scholar 

  41. The Nekbone program. https://github.com/Nek5000/Nekbone.

  42. Stamatakis, A. (2006). RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22(21), 2688–2690.

    Article  Google Scholar 

  43. Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE.

    Google Scholar 

  44. Wylie, B. J. N., Geimer, M., & Wolf, F. (2008). Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Scientific Programming, 16(2–3), 167–181.

    Article  Google Scholar 

  45. Jones, T. R., Brenner, L. B., & Fier, J. M. (2003). Impacts of operating systems on the scalability of parallel applications. In Lawrence Livermore National Laboratory, Technical Report UCRL-MI-202629.

    Google Scholar 

  46. Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM.

    Google Scholar 

  47. Shah, A., Müller, M., & Wolf, F. (2018). Estimating the impact of external interference on application performance. In European Conference on Parallel Processing (pp. 46–58). Springer.

    Google Scholar 

  48. Panda, B., et al. (2019). IASO: A fail-slow detection and mitigation framework for distributed storage services. In 2019 USENIX Annual Technical Conference (USENIX ATC’19) (pp. 47–62).

    Google Scholar 

  49. Attariyan, M., Chow, M., & Flinn, J. (2012). X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Presented as Part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12) (pp. 307–320).

    Google Scholar 

  50. Dean, D. J., Nguyen, H., & Gu, X. (2012). Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing (pp. 191–200).

    Google Scholar 

  51. Zhang, W., et al. (2016). Varcatcher: A framework for tackling performance variability of parallel workloads on multi-core. IEEE Transactions on Parallel and Distributed Systems (TPDS), 28(4), 1215–1228.

    Article  Google Scholar 

  52. Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.

    Google Scholar 

  53. Laguna, I., et al. (2015). Diagnosis of performance faults in LargeScale MPI applications via probabilistic progress-dependence inference. IEEE Transactions on Parallel and Distributed Systems, 26(5), 1280–1289.

    Article  Google Scholar 

  54. Su, P., et al. (2019). Pinpointing performance inefficiencies via lightweight variance profiling. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–19).

    Google Scholar 

  55. Dean, D. J., et al. (2014). Perfscope: Practical online server performance bug inference in production cloud computing infrastructures”. In Proceedings of the ACM Symposium on Cloud Computing (pp. 1–13).

    Google Scholar 

  56. Sahoo, S. K., et al. (2013). Using likely invariants for automated software fault localization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 139–152).

    Google Scholar 

  57. Dai, T., et al. (2018). Hytrace: A hybrid approach to performance bug diagnosis in production cloud infrastructures. IEEE Transactions on Parallel and Distributed Systems (TPDS), 30(1), 107–118.

    Article  MathSciNet  Google Scholar 

  58. Mitra, S., et al. (2014). Accurate application progress analysis for large-scale parallel debugging. ACM SIGPLAN Notices, 49(6), 193–203. ACM.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Production-Run Noise Detection. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4366-1_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4365-4

  • Online ISBN: 978-981-99-4366-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics