Abstract
High performance computing (HPC) is increasingly subjected to faulty computations. The frequency of silent data corruptions (SDCs) in particular is expected to increase in emerging machines requiring HPC applications to handle SDCs. In this paper we, propose a robust fault injector structured through an LLVM compiler pass that allows simulation of SDCs in various applications. Although fault injection locations are enumerated at compile time, their activation is purely at runtime and based on a user-provided fault distribution. The robustness of our fault injector is in the ability to augment the runtime injection logic on a per application basis. This allows tighter control on the spacial, temporal, and probability of injected faults. The usability, scalability, and robustness of our fault injection is demonstrated with injecting faults into an algebraic multigird solver.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download to read the full chapter text
Chapter PDF
References
Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11–33 (2004)
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 32:1–32:32. ACM, New York (2011)
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009)
Carreira, J., Madeira, H., Silva, J.G.: Xception: a technique for the experimental evaluation of dependability in modern computers. IEEE Transactions on Software Engineering 24(2), 36–125 (1998)
Casas, M., de Supinski, B.R., Bronevetsky, G., Schulz, M.: Fault resilience of the algebraic multi-grid solver. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 91–100. ACM, New York (2012)
de Kruijf, M., Nomura, S., Sankaralingam, K.: Relax: An architectural framework for software recovery of hardware faults. In: Proceedings of the 37th International Symposium on Computer Architecture (ISCA) (2010)
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–78. IEEE Computer Society Press, Los Alamitos (2012)
Han, S., Rosenberg, H.A., Shin, K.G.: Doctor: An integrated software fault injection environment (1995)
Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for linux clusters. Journal of Physics: Conference Series 46(1), 494 (2006)
Kogge, P.M., La Fratta, P., Vance, M.: [2010] facing the exascale energy wall. In: Proceedings of the 2010 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, IWIA 2010, pp. 51–58. IEEE Computer Society, Washington, DC (2010)
Lattner, C., Adve, V.: LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO2004), Palo Alto, California (March 2004)
Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 57:1–57:11. IEEE Computer Society Press, Los Alamitos (2012)
Lu, C.-d., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC 2004, p. 37. IEEE Computer Society, Washington, DC (2004)
Riesen, R., Ferreira, K., Da Silva, D., Lemarinier, P., Arnold, D., Bridges, P.G.: Alleviating scalability issues of checkpointing protocols. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–18. IEEE Computer Society Press, Los Alamitos (2012)
Sato, K., Gamblin, T., Moody, A., de Supinski, B.R., Mohror, K., Maruyama, N.: Design and modeling of non-blocking checkpoint system. In: Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?, ATIP 2012, pp. 39:1–39:2. A*STAR Computational Resource Centre, Singapore (2012)
Sharma, V.C., Haran, A., Rakamarić, Z., Gopalakrishnan, G.: Towards formal approaches to system resilience. In: Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC (2013)
Sridharan, V., Liberty, D.: A study of DRAM failures in the field. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 76:1–76:11. IEEE Computer Society Press, Los Alamitos (2012)
Stott, D.T., Floering, B., Burke, D., Kalbarczyk, Z., Iyer, R.K.: NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors. In: Proceedings of the IEEE International Computer Performance and Dependability Symposium, pp. 91–100 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Calhoun, J., Olson, L., Snir, M. (2014). FlipIt: An LLVM Based Fault Injector for HPC. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8805. Springer, Cham. https://doi.org/10.1007/978-3-319-14325-5_47
Download citation
DOI: https://doi.org/10.1007/978-3-319-14325-5_47
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14324-8
Online ISBN: 978-3-319-14325-5
eBook Packages: Computer ScienceComputer Science (R0)