Production-Run Noise Detection

Zhai, Jidong; Jin, Yuyang; Chen, Wenguang; Zheng, Weimin

doi:10.1007/978-981-99-4366-1_8

Jidong Zhai⁵,
Yuyang Jin⁵,
Wenguang Chen⁵ &
…
Weimin Zheng⁵

193 Accesses

Abstract

The performance variance detection approach in Chap. 7 relies on nontrivial source code analysis that is impractical for production-run parallel applications. In this chapter, we further propose Vapro, a performance variance detection and diagnosis framework for production-run parallel applications. Our approach is based on an important observation that most parallel applications contain code snippets that are repeatedly executed with fixed workload, which can be used for performance variance detection. To effectively identify these snippets at runtime even without program source code, we introduce state transition graph (STG) to track program execution and then conduct lightweight workload analysis on STG to locate variance. To diagnose the detected variance, Vapro leverages a progressive diagnosis method based on a hybrid model leveraging variance breakdown and statistical analysis. Results show that the performance overhead of Vapro is only \(1.38\%\) on average. Vapro can detect the variance in real applications caused by hardware bugs, memory, and IO. After fixing the detected variance, the standard deviation of the execution time is reduced by up to 73.5%. Compared with the state-of-the-art variance detection tool based on source code analysis, Vapro achieves \(30.0\%\) higher detection coverage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It is the cgitmax loop in cg.f:1170-1360.
2.
In this work, the computation noise is generated by executing stress [18] on the same CPU core of applications, and the memory noise is generated by executing stream [19] on the idle cores.
3.
On the Intel Ivy Bridge CPUs, the time fraction of frontend bound is equal to IDQ_UOPS_NOT_DELIVERED.CORE / (4 * CPU_CLK_UNHALTED.THREAD).
4.
The event name is CYCLE_ACTIVITY.STALLS_L2_MISS.

References

Ballani, H., et al. (2011). Towards predictable datacenter networks. In Proceedings of the ACM SIGCOMM 2011 Conference (pp. 242–253).
Google Scholar
Ferreira, K. B., et al. (2013). The impact of system design parameters on application noise sensitivity. In 2010 IEEE International Conference on Cluster Computing (Vol. 16, No. 1, pp. 117–129).
Google Scholar
Mondragon, O. H., et al. (2016). Understanding performance interference in next-generation HPC systems. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 384–395). IEEE.
Google Scholar
Schad, J., Dittrich, J., & Quiané-Ruiz, J.-A. (2010). Runtime measurements in the cloud: Observing, analyzing, and reducing variance. In Proceedings of the VLDB Endowment, 3(1–2), 460–471.
Article Google Scholar
Schwarzkopf, M., Murray, D. G., & Hand, S. (2012). The seven deadly sins of cloud computing research. In 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’12).
Google Scholar
Maricq, A., et al. (2018). Taming performance variability. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (pp. 409–425).
Google Scholar
Beckman, P., et al. (2006). The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing (pp. 1–12). IEEE.
Google Scholar
Hoefler, T., Schneider, T., & Lumsdaine, A. (2010). Characterizing the influence of system noise on large-scale applications by simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. SC’10 (pp. 1–11).
Google Scholar
Tang, X., et al. (2018). vSensor: Leveraging fixed-workload snippets of programs for performance variance detection. In ACM SIGPLAN Notices (Vol. 53, No. 1, pp. 124–136). ACM.
Google Scholar
McCalpin, J. D. (2018). HPL and DGEMM performance variability on the Xeon platinum 8160 processor. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 225–237). IEEE.
Google Scholar
Gong, Y., He, B., & Li, D. (2014). Finding constant from change: Revisiting network performance aware optimizations on iaas clouds. In SC14: International Conference for High Performance Computing, Network- ing, Storage and Analysis (pp. 982–993). IEEE.
Google Scholar
Gunawi, H. S., et al. (2018). Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS), 14(3), 23.
Google Scholar
Sherwood, T., Sair, S., & Calder, B. (2003). Phase tracking and prediction. ACM SIGARCH Computer Architecture News, 31(2), 336–349. ACM.
Google Scholar
Sherwood, T., Perelman, E., & Calder, B. (2001). Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT’01) (pp. 3–14). IEEE.
Google Scholar
Aho, A. V., et al. (2006). Compilers: Principles, techniques, and tools (2nd Ed.). USA: Addison-Wesley Longman Publishing. ISBN: 978-0-321-48681-3.
MATH Google Scholar
Zheng, L., et al. (2022). Vapro: Performance variance detection and diagnosis for production-run parallel applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 150–162). https://doi.org/10.1145/3503221.3508411
Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.
Google Scholar
Stress. https://packages.debian.org/buster/stress
McCalpin, J. (2018). Memory Bandwidth: STREAM Benchmark Performance Results. https://www.cs.virginia.edu/stream/ (Visited on March 20, 2018).
Vetter, J. (2002). Dynamic statistical profiling of communication activity in distributed applications. ACM SIGMETRICS Performance Evaluation Review, 30(1), 240–250.
Article Google Scholar
Yu, T., et al. (2019). Large-scale automatic K-means clustering for heterogeneous many-core supercomputer. IEEE Transactions on Parallel and Distributed Systems (TPDS), 31, 997–1008.
Article Google Scholar
Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data: Recent advances in clustering, J. Kogan, C. Nicholas, & M. Teboulle (Eds.). Berlin, Heidelberg: Springer. ISBN: 978-3-540-28349-2.
Google Scholar
Inaba, M., Katoh, N., & Imai, H. (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the Tenth Annual Symposium on Computational Geometry (pp. 332–339).
Google Scholar
Weaver, V. M., Terpstra, D., & Moore, S. (2013). Non-determinism and overcount on modern hardware performance counter implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 215–224). IEEE.
Google Scholar
Yasin, A. (2014). A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14) (pp. 35–44). IEEE.
Google Scholar
Farrar, D. E., & Glauber, R. R. (1967). Multicollinearity in regression analysis: The problem revisited. The Review of Economic and Statistics, 49,. 92–107.
Article Google Scholar
Bernat, A. R., & Miller, B. P. (2011). Anywhere, any-time binary instrumentation. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools (pp. 9–16).
Google Scholar
Goodman, J., et al. (1988). Stability of binary exponential backoff. In Journal of the ACM (JACM), 35(3), 579–602.
Article MathSciNet Google Scholar
Brim, M. J., et al. (2010). MRNet: A scalable infrastructure for the development of parallel tools and applications. In Cray user group.
Google Scholar
The cuBERT framework. https://github.com/zhihu/cuBERT
The parallel PageRank program. https://github.com/nikos912000/parallel-pagerank
The MapReduce framework. https://github.com/sysprog21/mapreduce
Yang, U. M., et al. (2002). BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1), 155–177.
Article MathSciNet Google Scholar
Kay, J. E., et al. (2015). The community earth system model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society, 96(8), 1333–1349.
Article Google Scholar
Bienia, C., et al. (2008). The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT’08) (pp. 72–81).
Google Scholar
Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 410–420).
Google Scholar
Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling.
Google Scholar
Dongarra, J. J., Luszczek, P., & Petitet, A. (2003). The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience, 15(9), 803–820.
Article Google Scholar
Intel. Addressing Potential DGEMM/HPL Perf Variability on 24-Core Intel Xeon Processor Scalable Family. White Paper, Number 606269, Revision 1.0 (2018).
Google Scholar
De Melo, A. C. (2010). The new linux perf tools. In Slides from Linux Kongress (Vol. 18, pp. 1–42).
Google Scholar
The Nekbone program. https://github.com/Nek5000/Nekbone.
Stamatakis, A. (2006). RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22(21), 2688–2690.
Article Google Scholar
Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE.
Google Scholar
Wylie, B. J. N., Geimer, M., & Wolf, F. (2008). Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Scientific Programming, 16(2–3), 167–181.
Article Google Scholar
Jones, T. R., Brenner, L. B., & Fier, J. M. (2003). Impacts of operating systems on the scalability of parallel applications. In Lawrence Livermore National Laboratory, Technical Report UCRL-MI-202629.
Google Scholar
Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM.
Google Scholar
Shah, A., Müller, M., & Wolf, F. (2018). Estimating the impact of external interference on application performance. In European Conference on Parallel Processing (pp. 46–58). Springer.
Google Scholar
Panda, B., et al. (2019). IASO: A fail-slow detection and mitigation framework for distributed storage services. In 2019 USENIX Annual Technical Conference (USENIX ATC’19) (pp. 47–62).
Google Scholar
Attariyan, M., Chow, M., & Flinn, J. (2012). X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Presented as Part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12) (pp. 307–320).
Google Scholar
Dean, D. J., Nguyen, H., & Gu, X. (2012). Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing (pp. 191–200).
Google Scholar
Zhang, W., et al. (2016). Varcatcher: A framework for tackling performance variability of parallel workloads on multi-core. IEEE Transactions on Parallel and Distributed Systems (TPDS), 28(4), 1215–1228.
Article Google Scholar
Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.
Google Scholar
Laguna, I., et al. (2015). Diagnosis of performance faults in LargeScale MPI applications via probabilistic progress-dependence inference. IEEE Transactions on Parallel and Distributed Systems, 26(5), 1280–1289.
Article Google Scholar
Su, P., et al. (2019). Pinpointing performance inefficiencies via lightweight variance profiling. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–19).
Google Scholar
Dean, D. J., et al. (2014). Perfscope: Practical online server performance bug inference in production cloud computing infrastructures”. In Proceedings of the ACM Symposium on Cloud Computing (pp. 1–13).
Google Scholar
Sahoo, S. K., et al. (2013). Using likely invariants for automated software fault localization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 139–152).
Google Scholar
Dai, T., et al. (2018). Hytrace: A hybrid approach to performance bug diagnosis in production cloud infrastructures. IEEE Transactions on Parallel and Distributed Systems (TPDS), 30(1), 107–118.
Article MathSciNet Google Scholar
Mitra, S., et al. (2014). Accurate application progress analysis for large-scale parallel debugging. ACM SIGPLAN Notices, 49(6), 193–203. ACM.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jidong Zhai, Yuyang Jin, Wenguang Chen & Weimin Zheng

Authors

Jidong Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Yuyang Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wenguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Production-Run Noise Detection. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_8

Download citation

DOI: https://doi.org/10.1007/978-981-99-4366-1_8
Published: 19 June 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4365-4
Online ISBN: 978-981-99-4366-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics