FlipIt: An LLVM Based Fault Injector for HPC

  • Jon Calhoun
  • Luke Olson
  • Marc Snir
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8805)

Abstract

High performance computing (HPC) is increasingly subjected to faulty computations. The frequency of silent data corruptions (SDCs) in particular is expected to increase in emerging machines requiring HPC applications to handle SDCs. In this paper we, propose a robust fault injector structured through an LLVM compiler pass that allows simulation of SDCs in various applications. Although fault injection locations are enumerated at compile time, their activation is purely at runtime and based on a user-provided fault distribution. The robustness of our fault injector is in the ability to augment the runtime injection logic on a per application basis. This allows tighter control on the spacial, temporal, and probability of injected faults. The usability, scalability, and robustness of our fault injection is demonstrated with injecting faults into an algebraic multigird solver.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11–33 (2004)CrossRefGoogle Scholar
  2. 2.
    Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 32:1–32:32. ACM, New York (2011)Google Scholar
  3. 3.
    Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009)CrossRefGoogle Scholar
  4. 4.
    Carreira, J., Madeira, H., Silva, J.G.: Xception: a technique for the experimental evaluation of dependability in modern computers. IEEE Transactions on Software Engineering 24(2), 36–125 (1998)CrossRefGoogle Scholar
  5. 5.
    Casas, M., de Supinski, B.R., Bronevetsky, G., Schulz, M.: Fault resilience of the algebraic multi-grid solver. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 91–100. ACM, New York (2012)Google Scholar
  6. 6.
    de Kruijf, M., Nomura, S., Sankaralingam, K.: Relax: An architectural framework for software recovery of hardware faults. In: Proceedings of the 37th International Symposium on Computer Architecture (ISCA) (2010)Google Scholar
  7. 7.
    Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–78. IEEE Computer Society Press, Los Alamitos (2012)CrossRefGoogle Scholar
  8. 8.
    Han, S., Rosenberg, H.A., Shin, K.G.: Doctor: An integrated software fault injection environment (1995)Google Scholar
  9. 9.
    Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for linux clusters. Journal of Physics: Conference Series 46(1), 494 (2006)Google Scholar
  10. 10.
    Kogge, P.M., La Fratta, P., Vance, M.: [2010] facing the exascale energy wall. In: Proceedings of the 2010 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, IWIA 2010, pp. 51–58. IEEE Computer Society, Washington, DC (2010)Google Scholar
  11. 11.
    Lattner, C., Adve, V.: LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO2004), Palo Alto, California (March 2004)Google Scholar
  12. 12.
    Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 57:1–57:11. IEEE Computer Society Press, Los Alamitos (2012)Google Scholar
  13. 13.
    Lu, C.-d., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC 2004, p. 37. IEEE Computer Society, Washington, DC (2004)Google Scholar
  14. 14.
    Riesen, R., Ferreira, K., Da Silva, D., Lemarinier, P., Arnold, D., Bridges, P.G.: Alleviating scalability issues of checkpointing protocols. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–18. IEEE Computer Society Press, Los Alamitos (2012)CrossRefGoogle Scholar
  15. 15.
    Sato, K., Gamblin, T., Moody, A., de Supinski, B.R., Mohror, K., Maruyama, N.: Design and modeling of non-blocking checkpoint system. In: Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?, ATIP 2012, pp. 39:1–39:2. A*STAR Computational Resource Centre, Singapore (2012)Google Scholar
  16. 16.
    Sharma, V.C., Haran, A., Rakamarić, Z., Gopalakrishnan, G.: Towards formal approaches to system resilience. In: Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC (2013)Google Scholar
  17. 17.
    Sridharan, V., Liberty, D.: A study of DRAM failures in the field. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 76:1–76:11. IEEE Computer Society Press, Los Alamitos (2012)Google Scholar
  18. 18.
    Stott, D.T., Floering, B., Burke, D., Kalbarczyk, Z., Iyer, R.K.: NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors. In: Proceedings of the IEEE International Computer Performance and Dependability Symposium, pp. 91–100 (2000)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jon Calhoun
    • 1
  • Luke Olson
    • 1
  • Marc Snir
    • 1
  1. 1.University of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations