Abstract
Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy.
Keywords
- Memory Access
- Fault Tolerance
- System Call
- Sandia National Laboratory
- Normalize Execution Time
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download conference paper PDF
References
Chen, C.L., Hsiao, M.Y.: Error-correcting codes for semiconductor memory applications: A state-of-the-art review. IBM Journal of Research and Development 28(2), 124–134 (1984)
Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011)
Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6), 518–528 (1984)
Oh, N., Shirvani, P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51(1), 63–75 (2002)
Oh, N., Shirvani, P., McCluskey, E.: Control-flow checking by software signatures. IEEE Transactions on Reliability 51(1), 111–122 (2002)
Rebaudengo, M., Reorda, M., Violante, M., Torchiano, M.: A source-to-source compiler for generating dependable software. In: Proceedings of First IEEE International Workshop on Source Code Analysis and Manipulation 2001, pp. 33–42 (2001)
Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: Swift: Software implemented fault tolerance. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO 2005, pp. 243–254. IEEE Computer Society, Washington, DC (2005), http://dx.doi.org/10.1109/CGO.2005.34
Sandia National Laboratory: Mantevo project home page (June 2011), https://software.sandia.gov/mantevo
Schroeder, B., Pinheiro, E., Weber, W.D.: Dram errors in the wild: a large-scale field study. In: Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2009, pp. 193–204. ACM, New York (2009), http://doi.acm.org/10.1145/1555349.1555372
Shirvani, P., Saxena, N., McCluskey, E.: Software-implemented edac protection against seus. IEEE Transactions on Reliability 49(3), 273–284 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fiala, D., Ferreira, K.B., Mueller, F., Engelmann, C. (2012). A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-29740-3_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29739-7
Online ISBN: 978-3-642-29740-3
eBook Packages: Computer ScienceComputer Science (R0)
