Abstract
Performance variance of parallel and distributed systems is becoming increasingly severe. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. Efficient online performance variance detection is an open problem in HPC research. To solve it, we propose an approach, called vSensor, to detect the performance variance of systems. The key finding of this study is that the source code of programs can better represent performance at runtime than an external detector. Specifically, many HPC applications contain code snippets that are fixed-workload patterns of execution, e.g., the workload of an invariant quantity and a linearly growing workload. This observation allows us to automatically identify these snippets of workload-related code and use them to detect performance variance. We evaluate vSensor on the Tianhe-2A system with a large number of parallel applications, and the results indicate that it can efficiently identify variations in system performance. The average overhead of 4,096 processes is less than 6% for fixed-workload v-sensors. We identify a problematic node with slow memory and network issues on Tianhe-2A system with vSensor that degrade programs’ performance by 21% and 3.37\(\times \), respectively. (Ⓒ 2022 IEEE. Reproduced, with permission, from Jidong Zhai et al., Leveraging code snippets to detect variations in the performance of HPC systems, IEEE Transactions on Parallel and Distributed Systems, 2022.)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM.
Ferreira, K. B., et al. (2013). The impact of system design parameters on application noise sensitivity. In 2010 IEEE International Conference on Cluster Computing (Vol. 16, No 1, pp. 117–129).
Mondragon, O. H., et al. (2016). Understanding performance interference in next-generation HPC systems. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 384–395). IEEE.
Wright, N. J., et al. (2009). Measuring and understanding variation in benchmark performance. In DoD High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2009 (pp. 438–443). IEEE.
TOP500 website (2020). http://top500.org/.
Skinner, D., & Kramer, W. (2005). Understanding the causes of performance variability in HPC workloads. In Proceedings of the IEEE International Workload Characterization Symposium, 2005 (pp. 137–149) IEEE.
Hoefler, T., Schneider, T., & Lumsdaine, A. (2010). Characterizing the influence of system noise on large-scale applications by simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. SC’10 (pp. 1–11).
Gong, Y., He, B., & Li, D. (2014). Finding constant from change: Revisiting network performance aware optimizations on iaas clouds. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 982–993). IEEE.
Jones, T. R., Brenner, L. B., & Fier, J. M. (2003). Impacts of operating systems on the scalability of parallel applications. In Lawrence Livermore National Laboratory, Technical Report UCRL-MI-202629.
Tallent, N. R., Adhianto, L., & Mellor-Crummey, J. M. (2010). Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–11). IEEE Computer Society.
Wylie, B. J. N., Geimer, M., & Wolf, F. (2008). Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Scientific Programming, 16(2–3), 167–181.
Geimer, M., et al. (2010). The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719.
Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE.
Zhai, J., et al. (2022). Leveraging code snippets to detect variations in the performance of HPC systems. IEEE Transactions on Parallel and Distributed Systems, 33(12), 3558–3574.
Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (p. 75). IEEE Computer Society.
MPI Documents. http://mpi-forum.org/docs/
Mucci, P., et al. (2004). Automating the large-scale collection and analysis of performance data on Linux clusters1. In Proceedings of the 5th LCI International Conference on Linux Clusters: The HPC Revolution.
Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center.
Pfeiffer, W., & Stamatakis, A. (2010). Hybrid MPI/Pthreads parallelization of the RAxML phylogenetics code. In 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)(pp. 1–8). IEEE.
Yang, U. M., et al. (2002). BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1), 155–177.
Karlin, I., Keasler, J., & Neely, J. R. (2013). Lulesh 2.0 updates and changes. Technical Report Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
Weaver, V. M., Terpstra, D., & Moore, S. (2013). Non-determinism and overcount on modern hardware performance counter implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 215–224). IEEE.
Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling.
Intel Trace Analyzer and Collector. https://software.intel.com/en-us/trace-analyzer
Tsafrir, D., et al. (2005). System noise, OS clock ticks, and fine-grained parallel applications. In Proceedings of the 19th Annual International Conference on Supercomputing. ICS’05 (pp. 303–312). New York, NY, USA: ACM. ISBN: 1-59593-167-8.
Jones, T. (2012). Linux kernel co-scheduling and bulk synchronous parallelism. International Journal of High Performance Computing Applications, 26, 1094342011433523.
Agarwal, S., Garg, R., & Vishnoi, N. K. (2005). The impact of noise on the scaling of collectives: A theoretical approach. In High Performance Computing–HiPC 2005 (pp. 280–289). Springer.
Beckman, P., et al. (2006). The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing (pp. 1–12). IEEE.
Phillips, J. C., et al. (2002). NAMD: Biomolecular simulation on thousands of processors. In Supercomputing, ACM/IEEE 2002 Conference (pp. 36–36).
Lo, Y. J., et al. (2014). Roofline model toolkit: A practical tool for architectural and program analysis. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (pp. 129–148). Springer.
Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4), 65–76.
Calotoiu, A., et al. (2016). Fast multi-parameter performance modeling. In 2016 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 172–181). IEEE.
Yeom, J.-S., et al. (2016). Data-driven performance modeling of linear solvers for sparse matrices. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (pp. 32–42). IEEE.
Lee, S., Meredith, J. S., & Vetter, J. S. (2015). Compass: A framework for automated performance modeling and prediction. In Proceedings of the 29th ACM on International Conference on Supercomputing (pp. 405–414). ACM.
Wu, X., & Mueller, F. (2013). Elastic and scalable tracing and accurate replay of non-deterministic events. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ICS’13 (pp. 59–68). ACM.
Tallent, N. R., et al. (2011). Scalable fine-grained call path tracing. In Proceedings of the International Conference on Supercomputing (pp. 63–74). ACM.
Mitra, S., et al. (2014). Accurate application progress analysis for large-scale parallel debugging. ACM SIGPLAN Notices, 49(6), 193–203. ACM.
Laguna, I., et al. (2015). Diagnosis of performance faults in LargeScale MPI applications via probabilistic progress-dependence inference. IEEE Transactions on Parallel and Distributed Systems, 26(5), 1280–1289.
Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE.
Dean, D. J., et al. (2014). Perfscope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the ACM Symposium on Cloud Computing (pp. 1–13).
Sahoo, S. K. et al. (2013). Using likely invariants for automated software fault localization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 139–152).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Zhai, J., Jin, Y., Chen, W., Zheng, W. (2023). Lightweight Noise Detection. In: Performance Analysis of Parallel Applications for HPC. Springer, Singapore. https://doi.org/10.1007/978-981-99-4366-1_7
Download citation
DOI: https://doi.org/10.1007/978-981-99-4366-1_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4365-4
Online ISBN: 978-981-99-4366-1
eBook Packages: Computer ScienceComputer Science (R0)