Abstract
Redundant multithreading (RMT) is an effective reliability solution that provides thread-level replication; however, it imposes additional overheads in terms of performance loss or energy consumption. Partial-RMT is an alternative solution that provides partial redundancy of an executing thread to reduce such overheads while trading off full coverage from faults. In this study, we propose a software-level RMT approach that offers lightweight replication of partial code regions within the same application process. Our software-level RMT approach is particularly suitable for applications with varying code criticality, where we determine the critical code regions by performing a fault injection campaign in addition to execution time profile analysis. Using the results of the previous step, the application programmer annotates the source code to indicate the specific code regions that should be executed redundantly without re-implementing the application program from scratch. Our lightweight software-level RMT tool improves the average silent data corruption (SDC) rate of 30 applications of the PolyBench benchmark suite by around 7.6\(\times\) with average performance and energy consumption overheads of 22 and 37%, respectively, compared to the original version of the program.
Similar content being viewed by others
Notes
We implemented it using [46] based on the original version of CG.
References
Cisco 12000 single event upset failures overview and work around summary (2003). http://www.cisco.com/en/US/ts/fn/200/fn25994.html. Cisco Systems
Toyota case: single bit flip that killed | EE Times. https://www.eetimes.com/toyota-case-single-bit-flip-that-killed/ (2013)
Marenostrum 4 (2017). https://www.bsc.es/marenostrum/marenostrum/technical-information
Crossroads benchmarks, micro-benchmarks, & asc code suite (2019). https://www.lanl.gov/projects/crossroads/benchmarks-performance-analysis.php
Arslan S, Topcuoglu HR, Kandemir MT, Tosun O (2019) Scheduling opportunities for asymmetrically reliable caches. J Parall Distrib Comput 126:134–151
Bautista-Gomez L, Zyulkyarov F, Unsal O, McIntosh-Smith, S (2016) Unprotected computing: a large-scale study of dram raw error rate on a supercomputer. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 645–655. IEEE
Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al. (2008) Exascale computing study: technology challenges in achieving exascale systems. Air Force Research Laboratory. https://people.eecs.berkeley.edu/~yelick/papers/Exascale%5C_final%5C_report.pdf
Biswas A, Soundararajan N, Mukherjee SS, Gurumurthi S (2009) Quantized AVF: A means of capturing vulnerability variations over small windows of time. In: IEEE workshop on silicon errors in logic - system effects
Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley R (2002) An updated set of basic linear algebra subprograms (blas). ACM Trans Math Softw 28(2):135–151. https://doi.org/10.1145/567806.567807
Chen Y, Chen P (2016) A software-based redundant execution programming model for transient fault detection and correction. In: 45th International Conference on Parallel Processing Workshops (ICPPW), pp. 66–71. 10.1109/ICPPW.2016.25
Döbel B, Härtig H, Engel M (2012) Operating system support for redundant multithreading. In: Proceedings of the Tenth ACM International Conference on Embedded Software, EMSOFT’12, p. 83–92. 10.1145/2380356.2380375
Döbel B, Härtig H (2014) Can we put concurrency back into redundant multithreading? In: 2014 International Conference on Embedded Software (EMSOFT), pp. 1–10
Fang B, Pattabiraman K, Ripeanu M, Gurumurthi S (2014) Evaluating the error resilience of parallel programs. In: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 720–725
Feng S, Gupta S, Ansari A, Mahlke S (2010) Shoestring: probabilistic soft error reliability on the cheap. In: Proceedings of the Fifteenth International Conference on Architectural SSupport for Programming Languages and Operating Systems, ASPLOS XV, p. 385–396. 10.1145/1736020.1736063
Gomaa M, Scarbrough C, Vijaykumar TN, Pomeranz I (2003) Transient-fault recovery for chip multiprocessors. In: 30th Annual International Symposium on Computer Architecture, 2003. Proceedings., pp. 98–109
Gomaa MA, Vijaykumar TN (2005) Opportunistic transient-fault detection. In: 32nd International Symposium on Computer Architecture (ISCA’05), pp. 172–183. 10.1109/ISCA.2005.38
Guo L, Li D, Laguna I, Schulz M (2018) FlipTracker: Understanding natural error resilience in HPC applications. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 94–107. 10.1109/SC.2018.00011
Hari SKS, Adve SV, Naeimi H (2012) Low-cost program-level detectors for reducing silent data corruptions. In: IEEE/IFIP international conference on dependable systems and networks (DSN 2012), pp. 1–12
Hazucha P, Karnik T, Maiz J, Walstra S, Bloechel B, Tschanz J, Dermer G, Hareland S, Armstrong P, Borkar S (2003) Neutron soft error rate measurements in a 90-nm cmos process and scaling trends in sram from 0.25-/spl mu/m to 90-nm generation. In: IEEE International Electron Devices Meeting 2003, pp. 21.5.1–21.5.4. 10.1109/IEDM.2003.1269336
Hukerikar S, Diniz PC, Lucas RF, Teranishi K (2004) Opportunistic application-level fault detection through adaptive redundant multithreading. In: International Conference on High Performance Computing Simulation (HPCS), pp. 243–250
Hukerikar S, Lucas RF (2016) Rolex: resilience-oriented language extensions for extreme-scale systems. J Supercomput 72:4662–4695
Hukerikar S, Teranishi K, Diniz PC, Lucas RF (2014) An evaluation of lazy fault detection based on adaptive redundant multithreading. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6
Hukerikar S, Teranishi K, Diniz PC, Lucas RF (2018) RedThreads: an interface for application-level fault detection/correction through adaptive redundant multithreading. Int J Parall Program 46(2):225–251. https://doi.org/10.1007/s10766-017-0492-3
Kahng AB (2013) The ITRS design technology and system drivers roadmap: process and status. In: 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6
Khudia DS, Mahlke S (2014) Harnessing soft computations for low-budget fault tolerance. In: 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 319–330
Kolodziej SP, Aznaveh M, Bullock M, David J, Davis TA, Henderson M, Hu Y, Sandstrom R (2019) The SuiteSparse matrix collection website interface. J Open Source Softw 4(35):1244
Koren I, Su SYH (1979) Reliability analysis of n-modular redundancy systems with intermittent and permanent faults. IEEE Trans Comput 28(7):514–520
Leveugle R, Calvez A, Maistri P, Vanhauwaert P (2009) Statistical fault injection: quantified error and confidence. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, p. 502–506
Li T, Ambrose JA, Ragel R, Parameswaran S (2016) Processor design for soft errors: challenges and state of the art. ACM Comput Surv (CSUR) 49(3):1–44
Lyons D (2000) Sun screen. https://www.forbes.com/forbes/2000/1113/6613068a/#2e22cb75276c. Forbes Magazine
Mitropoulou K, Porpodas V, Jones TM (2016) COMET: Communication-optimised multi-threaded error-detection technique. In: International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES), pp. 1–10
Mukherjee S (2008) Archit Design Soft Error. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Mukherjee SS, Kontz M, Reinhardt SK (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 99–110
Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. In: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, p. 2–11. 10.1145/237090.237140
Omar H, Shi Q, Ahmad M, Dogan H, Khan O (2018) Declarative resilience: a holistic soft-error resilient multicore architecture that trades off program accuracy for efficiency. ACM Trans Embed Comput Syst 17(4). https://doi.org/10.1145/3210559
Oz I, Arslan S (2019) A survey on multithreading alternatives for soft error fault tolerance. ACM Comput Surv 52(2):1–38
Parashar A, Sivasubramaniam A, Gurumurthi S (2006) Slick: slice-based locality exploitation for efficient redundant multithreading. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating SSystems, ASPLOS XII, p. 95–105. 10.1145/1168857.1168870
Pouchet, L.N.: Polybench/C (2016). https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/
Pouyan F, Azarpeyvand A, Safari S, Fakharie SM (2015) Reliability-aware simultaneous multithreaded architecture using online architectural vulnerability factor estimation. IET Comput Digital Tech 9(2):124–133. https://doi.org/10.1049/iet-cdt.2013.0162
Pouyan F, Azarpeyvand A, Safari S, Fakhraie SM (2016) Reliability aware throughput management of chip multi-processor architecture via thread migration. J Supercomput 72(4):1363–1380. https://doi.org/10.1007/s11227-016-1665-3
Prvulovic M, Zhang Z, Torrellas J (2002) ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 111–122
Reddy VK, Rotenberg E, Parthasarathy S (2006) Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, p. 83–94. 10.1145/1168857.1168869
Reinhardt SK, Mukherjee SS (2000) Transient fault detection via simultaneous multithreading. In: Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), pp. 25–36
Rosa Fd, Bandeira V, Reis R, Ost L (2018) Extensive evaluation of programming models and ISAs impact on multicore soft error reliability. In: Proceedings of the 55th Annual Design Automation Conference, pp. 1–6
Sangchoolie B, Pattabiraman K, Karlsson J (2007) One bit is (not) enough: an empirical study of the impact of single and multiple bit-flip errors. In: 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 97–108 (2017)
Sao PK (2018) Scalable and resilient sparse linear solvers. Ph.D. thesis, georgia institute of technology
Siddiqua T, Gurumurthi S (2009) Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In: IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1–12. 10.1109/MASCOT.2009.5363142
Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B et al (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl 28(2):129–173
So H, Didehban M, Ko Y, Shrivastava A, Lee K (2018) Expert: Effective and flexible error protection by redundant multithreading. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 533–538
So H, Didehban M, Shrivastava A, Lee K (2019) A software-level redundant multithreading for soft/hard error detection and recovery. In: 2019 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1559–1562
Sorin DJ, Martin MMK, Hill MD, Wood DA (2002) Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 123–134
Sridharan V, DeBardeleben N, Blanchard S, Ferreira KB, Stearley J, Shalf J, Gurumurthi S (2015) Memory errors in modern systems: the good, the bad, and the ugly. ACM SIGARCH Comput Archit News 43(1):297–310
Subasi O, Yalcin G, Zyulkyarov F, Unsal O, Labarta J (2017) Designing and modelling selective replication for fault-tolerant hpc applications. In: 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 452–457. 10.1109/CCGRID.2017.40
Sundaramoorthy K, Purser Z, Rotenburg E (2000) Slipstream processors: Improving both performance and fault tolerance. In: Proceedings of the Ninth International Conference on Architectural SSupport for Programming Languages and Operating Systems, ASPLOS IX, p. 257–268. 10.1145/378993.379247
Swift GM, Guertin SM (2000) In-flight observations of multiple-bit upset in drams. IEEE Trans Nucl Sci 47(6):2386–2391. https://doi.org/10.1109/23.903781
Vijaykumar TN, Pomeranz I, Cheng K (2002) Transient-fault recovery using simultaneous multithreading. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 87–98
Von Neumann J (1956) Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata stud 34:43–98. https://doi.org/10.1515/9781400882618-003
Wang C, Kim H, Wu Y, Ying V (2007) Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization (CGO), pp. 244–258. 10.1109/CGO.2007.7
Wei J., Thomas A, Li G, Pattabiraman K (2014) Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 375–382. 10.1109/DSN.2014.2
Wang Y, Huang Y, Vo K, Chung P, Kintala C (1995) Checkpointing and its applications. In: Twenty-fifth International Symposium on fault-tolerant Computing. Digest of papers, pp. 22–31
Zhang Y, Lee JW, Johnson NP, August DI (2010) Daft: decoupled acyclic fault tolerance. In: 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 87–97
Acknowledgements
This work was completed, while the first author, Sanem Arslan, was visiting researcher at Barcelona Supercomputing Center, Barcelona, Spain. Sanem Arslan had received financial support from the Scientific and Technological Research Council of Turkey (TUBITAK) under the program BIDEB 2219 during this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Arslan , S., Unsal, O. Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading. J Supercomput 77, 14130–14160 (2021). https://doi.org/10.1007/s11227-021-03804-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03804-6