Advertisement

Estimating the Impact of External Interference on Application Performance

  • Aamer Shah
  • Matthias Müller
  • Felix WolfEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11014)

Abstract

The wall-clock execution time of applications on HPC clusters is commonly subject to run-to-run variation, often caused by external interference from concurrently running jobs. Because of the irregularity of this interference from the perspective of the affected job, performance analysts do not consider it an intrinsic part of application execution, which is why they wish to factor it out when measuring execution time. However, if chances are high enough that at least one interference event strikes while the job is running, merely repeating runs several times and picking the fastest run does not guarantee a measurement free of external influence. In this paper, we present a novel approach to estimate the impact of sporadic and high-impact interference on bulk-synchronous MPI applications. An evaluation with several realistic benchmarks shows that the impact of interference can be estimated already based on a single run.

Notes

Acknowledgment

This work has been supported by the German Research Foundation (DFG) through the Program Performance Engineering for Scientific Software and the ExtraPeak project, by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IH16008D, and by the US Department of Energy under Grant No. DE-SC0015524. Additional funding was provided through the Hessian LOEWE initiative within the Software-Factory 4.0 project. Finally, we would like to express our gratitude to Jülich Supercomputing Centre and High Performance Computing Center Stuttgart for giving us access to their supercomputers JUQUEEN and Hazel Hen, respectively.

References

  1. 1.
    Agarwal, S., Garg, R., Vishnoi, N.K.: The impact of noise on the scaling of collectives: a theoretical approach. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds.) HiPC 2005. LNCS, vol. 3769, pp. 280–289. Springer, Heidelberg (2005).  https://doi.org/10.1007/11602569_31CrossRefGoogle Scholar
  2. 2.
    Beckman, P., Iskra, K., Yoshii, K., Coghlan, S., Nataraj, A.: Benchmarking the effects of operating system interference on extreme-scale parallel machines. Cluster Computing 11(1), 3–16 (2008)CrossRefGoogle Scholar
  3. 3.
    Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC 2013). IEEE Computer Society, November 2013Google Scholar
  4. 4.
    Böhme, D., Geimer, M., Wolf, F., Arnold, L.: Identifying the root causes of wait states in large-scale parallel applications. In: Proceedings of the 39th International Conference on Parallel Processing (ICPP), San Diego, CA, USA, pp. 90–100. IEEE Computer Society, September 2010.  https://doi.org/10.1109/ICPP.2010.18
  5. 5.
    De, P., Kothari, R., Mann, V.: Identifying sources of operating system jitter through fine-grained kernel instrumentation. In: Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER), pp. 331–340, September 2007Google Scholar
  6. 6.
    Dongarra, J., London, K., Moore, S., Mucci, P., Terpstra, D., You, H., Zhou, M.: Experiences and lessons learned with a portable interface to hardware performance counters. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–6, April 2003Google Scholar
  7. 7.
    Garg, R., De, P.: Impact of noise on scaling of collectives: an empirical evaluation. In: Robert, Y., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2006. LNCS, vol. 4297, pp. 460–471. Springer, Heidelberg (2006).  https://doi.org/10.1007/11945918_45CrossRefGoogle Scholar
  8. 8.
    Gonzalez, J., Gimenez, J., Labarta, J.: Automatic detection of parallel applications computation phases. In: Proceedings of IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–11, May 2009Google Scholar
  9. 9.
    HLRS: Communication on Cray XC40 Aries network, May 2017. wickie.hlrs.de/platforms/index.php/Communication_on_Cray_XC40_Aries_network
  10. 10.
    Hoefler, T., Schneider, T., Lumsdaine, A.: The impact of network noise at large-scale communication performance. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–8, May 2009Google Scholar
  11. 11.
    Hoefler, T., Schneider, T., Lumsdaine, A.: Characterizing the influence of system noise on large-scale applications by simulation. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC 2010), pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010)Google Scholar
  12. 12.
    Jokanovic, A., Rodriguez, G., Sancho, J.C., Labarta, J.: Impact of inter-application contention in current and future HPC systems. In: Proceedings of the IEEE Symposium on High Performance Interconnects, pp. 15–24, August 2010Google Scholar
  13. 13.
    Kuo, C.S., Shah, A., Nomura, A., Matsouka, S., Wolf, F.: How file access patterns influence interference among cluster applications. In: Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–8 (2014)Google Scholar
  14. 14.
    Lang, S., Carns, P., Latham, R., Ross, R., Harms, K., Allcock, W.: I/O performance challenges at leadership scale. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC 2009), pp. 40:1–40:12. ACM, New York (2009)Google Scholar
  15. 15.
    Lofstead, J., Zheng, F., Liu, Q., Klasky, S., Oldfield, R., Kordenbrock, T., Schwan, K., Wolf, M.: Managing variability in the IO performance of petascale storage systems. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC 2010), pp. 1–12. IEEE Computer Society, Washington, DC, USA (2010)Google Scholar
  16. 16.
    Mondragon, O.H., Bridges, P.G., Levy, S., Ferreira, K.B., Widener, P.: Understanding performance interference in next-generation HPC systems. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC 2016), pp. 384–395, November 2016Google Scholar
  17. 17.
    Petrini, F., Kerbyson, D., Pakin, S.: The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC 2003) (2003)Google Scholar
  18. 18.
    Shah, A., Wolf, F., Zhumatiy, S., Voevodin, V.: Capturing inter-application interference on clusters. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–5, September 2013Google Scholar
  19. 19.
    Shan, H., Shalf, J.: Using IOR to analyze the I/O performance for HPC platforms. In: Cray User Group Conference (2007)Google Scholar
  20. 20.
    Tsafrir, D., Etsion, Y., Feitelson, D.G., Kirkpatrick, S.: System noise, OS clock ticks, and fine-grained parallel applications. In: Proceedings of the 19th annual International Conference on Supercomputing (ICS 2005), pp. 303–312. ACM, New York (2005)Google Scholar
  21. 21.
    Yang, X., Jenkins, J., Mubarak, M., Ross, R.B., Lan, Z.: Watch out for the bully! job interference study on dragonfly network. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC 2016), pp. 750–760, November 2016Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IT CenterRWTH Aachen UniversityAachenGermany
  2. 2.Laboratory for Parallel ProgrammingTU DarmstadtDarmstadtGermany

Personalised recommendations