A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC
Abstract
Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy.
Keywords
Memory Access Fault Tolerance System Call Sandia National Laboratory Normalize Execution TimePreview
Unable to display preview. Download preview PDF.
References
- 1.Chen, C.L., Hsiao, M.Y.: Error-correcting codes for semiconductor memory applications: A state-of-the-art review. IBM Journal of Research and Development 28(2), 124–134 (1984)CrossRefGoogle Scholar
- 2.Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- 3.Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6), 518–528 (1984)CrossRefGoogle Scholar
- 4.Oh, N., Shirvani, P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51(1), 63–75 (2002)CrossRefGoogle Scholar
- 5.Oh, N., Shirvani, P., McCluskey, E.: Control-flow checking by software signatures. IEEE Transactions on Reliability 51(1), 111–122 (2002)CrossRefGoogle Scholar
- 6.Rebaudengo, M., Reorda, M., Violante, M., Torchiano, M.: A source-to-source compiler for generating dependable software. In: Proceedings of First IEEE International Workshop on Source Code Analysis and Manipulation 2001, pp. 33–42 (2001)Google Scholar
- 7.Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: Swift: Software implemented fault tolerance. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO 2005, pp. 243–254. IEEE Computer Society, Washington, DC (2005), http://dx.doi.org/10.1109/CGO.2005.34 Google Scholar
- 8.Sandia National Laboratory: Mantevo project home page (June 2011), https://software.sandia.gov/mantevo
- 9.Schroeder, B., Pinheiro, E., Weber, W.D.: Dram errors in the wild: a large-scale field study. In: Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2009, pp. 193–204. ACM, New York (2009), http://doi.acm.org/10.1145/1555349.1555372 CrossRefGoogle Scholar
- 10.Shirvani, P., Saxena, N., McCluskey, E.: Software-implemented edac protection against seus. IEEE Transactions on Reliability 49(3), 273–284 (2000)CrossRefGoogle Scholar