Advertisement

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

  • Yanqi Wang
  • Qi Zhang
  • Yi LiuEmail author
  • Depei Qian
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11276)

Abstract

Resilience/fault-tolerance has become a key challenge for large-scale parallel systems. To ensure reliability of high performance computing systems, various kinds of techniques have been proposed, such as hardware-level fault-tolerance, checkpointing, replication, algorithm-base fault-tolerance, etc. There are also many software systems to monitor and handle system-failures, e.g. management and job-scheduling system of HPC systems. To evaluate the effectiveness of these systems, it is necessary to provide some kind of tool to inject failures in a HPC system. This paper proposes HPC-SFI, a system-level fault injection tool for HPC systems. Basically, HPC-SFI can generate three kinds of system-failures in a HPC system including in-node faults, failure in the interconnection network and failure of storage/parallel-file system. In addition, HPC-SFI can inject system-faults in pseudo-random model according to pre-defined parameters and probabilities. Preliminary experimental results demonstrate effectiveness of the tool.

Notes

Acknowledgments

The research presented in this paper has been supported by National Key R&D Program of China under grant No. 2016YFB0200100 and Natural Science Foundation of China under Grant No. 91530324.

References

  1. 1.
    The Top500 List, June 2018. http://www.top500.org
  2. 2.
    Karlsson, J., Liden, P., Dahlgren, P., et al.: Using heavy-ion radiation to validate fault-handling mechanisms. IEEE Micro 14(1), 8–23 (1994)CrossRefGoogle Scholar
  3. 3.
    Gunneflo, U., Karlsson, J., Torin, J.: Evaluation of error detection schemes using fault injection by heavy-ion radiation. In: 1989 The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers, pp. 340–347. IEEE (1989)Google Scholar
  4. 4.
    Hsueh, M.C., Tsai, T.K., Iyer, R.K.: Fault injection techniques and tools. Computer 30(4), 75–82 (1997)CrossRefGoogle Scholar
  5. 5.
    Han, S., Shin, K.G., Rosenberg, H.A.: Doctor: an integrated software fault injection environment for distributed real-time systems. In: 1995 Proceedings of International Computer Performance and Dependability Symposium, pp. 204–213. IEEE (1995)Google Scholar
  6. 6.
    Carreira, J., Madeira, H., Silva, J.G., et al.: Xception: software fault injection and monitoring in processor functional units (1995)Google Scholar
  7. 7.
    Binkert, N., et al.: The Gem5 simulator. SIGARCH Comput. Arch. News 39(2), 1–7 (2011)CrossRefGoogle Scholar
  8. 8.
    Guan, Q., Debardeleben, N., Blanchard, S., et al.: F-SEFI: a fine-grained soft error fault injection tool for profiling application vulnerability. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1245–1254. IEEE (2014)Google Scholar
  9. 9.
    Bellard, F.: QEMU, a fast and portable dynamic translator. In: USENIX Annual Technical Conference, FREENIX Track, vol. 41, p. 46 (2005)Google Scholar
  10. 10.
    Levy, S., Dosanjh, M.G.F., Bridges, P.G., et al.: Using unreliable virtual hardware to inject errors in extreme-scale systems. In: Proceedings of the 3rd Workshop on Fault-tolerance for HPC at Extreme Scale, pp. 21–26. ACM (2013)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2018

Authors and Affiliations

  1. 1.Sino-German Joint Software InstituteBeihang UniversityBeijingChina

Personalised recommendations