Advertisement

FINJ: A Fault Injection Tool for HPC Systems

  • Alessio Netti
  • Zeynep Kiziltan
  • Ozalp Babaoglu
  • Alina Sîrbu
  • Andrea Bartolini
  • Andrea Borghesi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)

Abstract

We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool.

Keywords

Exascale systems Resiliency Fault detection Monitoring Benchmarking Open-source 

Notes

Acknowledgements

A. Netti has been supported by a research fellowship from the Oprecomp-Open Transprecision Computing project. A. Sîrbu has been partially funded by the EU project SoBigData Research Infrastructure—Big Data and Social Mining Ecosystem (grant agreement 654024).

References

  1. 1.
    Agelastos, A., et al.: The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: Proceedings of SC 2014, pp. 154–165. IEEE (2014)Google Scholar
  2. 2.
    Ashby, S., Beckman, P., Chen, J., Colella, P., Collins, B., Crawford, D., et al.: The opportunities and challenges of exascale computing. In: Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee, pp. 1–77 (2010)Google Scholar
  3. 3.
    Calhoun, J., Olson, L., Snir, M.: FlipIt: an LLVM based fault injector for HPC. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8805, pp. 547–558. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-14325-5_47CrossRefGoogle Scholar
  4. 4.
    Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovations 1(1), 5–28 (2014)Google Scholar
  5. 5.
    DeBardeleben, N., Blanchard, S., Guan, Q., Zhang, Z., Fu, S.: Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience. In: Alexander, M., et al. (eds.) Euro-Par 2011. LNCS, vol. 7156, pp. 282–291. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-29740-3_32CrossRefGoogle Scholar
  6. 6.
    Ferreira, K.B., Bridges, P., Brightwell, R.: Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of SC 2008, p. 19. IEEE Press (2008)Google Scholar
  7. 7.
    Gainaru, A., Cappello, F.: Errors and faults. In: Herault, T., Robert, Y. (eds.) Fault-Tolerance Techniques for High-Performance Computing. CCN, pp. 89–144. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-20943-2_2CrossRefGoogle Scholar
  8. 8.
    Guan, Q., Chiu, C.C., Fu, S.: CDA: a cloud dependability analysis framework for characterizing system dependability in cloud computing infrastructures. In: Proceedings of PRDC 2012, pp. 11–20. IEEE (2012)Google Scholar
  9. 9.
    Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: Proceedings of SRDS 2013, pp. 205–214. IEEE (2013)Google Scholar
  10. 10.
    Gunawi, H.S., et al.: FATE and DESTINI: a framework for cloud recovery testing. In: Proceedings of NSDI 2011, p. 239 (2011)Google Scholar
  11. 11.
    Hsueh, M.C., Tsai, T.K., Iyer, R.K.: Fault injection techniques and tools. Computer 30(4), 75–82 (1997)CrossRefGoogle Scholar
  12. 12.
    Joshi, P., Gunawi, H.S., Sen, K.: PREFAIL: a programmable tool for multiple-failure injection. In: ACM SIGPLAN Notices, vol. 46, pp. 171–188. ACM (2011)Google Scholar
  13. 13.
    Lameter, C.: Numa (non-uniform memory access): an overview. Queue 11(7), 40 (2013)CrossRefGoogle Scholar
  14. 14.
    Naughton, T., Bland, W., Vallee, G., Engelmann, C., Scott, S.L.: Fault injection framework for system resilience evaluation: fake faults for finding future failures. In: Proceedings of Resilience 2009, pp. 23–28. ACM (2009)Google Scholar
  15. 15.
    Stott, D.T., Floering, B., Burke, D., Kalbarczpk, Z., Iyer, R.K.: NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors. In: Proceedings of IPDS 2000, pp. 91–100. IEEE (2000)Google Scholar
  16. 16.
    Tuncer, O., et al.: Diagnosing performance variations in HPC applications using machine learning. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS, vol. 10266, pp. 355–373. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-58667-0_19CrossRefGoogle Scholar
  17. 17.
    Villa, O., Johnson, D.R., O’connor, M., Bolotin, E., Nellans, D., Luitjens, J., et al.: Scaling the power wall: a path to exascale. In: Proceedings of SC 2014, pp. 830–841. IEEE (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Alessio Netti
    • 1
  • Zeynep Kiziltan
    • 1
  • Ozalp Babaoglu
    • 1
  • Alina Sîrbu
    • 2
  • Andrea Bartolini
    • 3
  • Andrea Borghesi
    • 3
  1. 1.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly
  2. 2.Department of Computer ScienceUniversity of PisaPisaItaly
  3. 3.Department of Electrical, Electronic and Information EngineeringUniversity of BolognaBolognaItaly

Personalised recommendations