Skip to main content

Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading

Abstract

Redundant multithreading (RMT) is an effective reliability solution that provides thread-level replication; however, it imposes additional overheads in terms of performance loss or energy consumption. Partial-RMT is an alternative solution that provides partial redundancy of an executing thread to reduce such overheads while trading off full coverage from faults. In this study, we propose a software-level RMT approach that offers lightweight replication of partial code regions within the same application process. Our software-level RMT approach is particularly suitable for applications with varying code criticality, where we determine the critical code regions by performing a fault injection campaign in addition to execution time profile analysis. Using the results of the previous step, the application programmer annotates the source code to indicate the specific code regions that should be executed redundantly without re-implementing the application program from scratch. Our lightweight software-level RMT tool improves the average silent data corruption (SDC) rate of 30 applications of the PolyBench benchmark suite by around 7.6\(\times\) with average performance and energy consumption overheads of 22 and 37%, respectively, compared to the original version of the program.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. 1.

    https://github.com/jamtrott/csr-spmv

  2. 2.

    https://github.com/jmabuin/matrix-market-suite

  3. 3.

    We implemented it using [46] based on the original version of CG.

  4. 4.

    https://github.com/sanemarslan/RMTLib

References

  1. 1.

    Cisco 12000 single event upset failures overview and work around summary (2003). http://www.cisco.com/en/US/ts/fn/200/fn25994.html. Cisco Systems

  2. 2.

    Toyota case: single bit flip that killed | EE Times. https://www.eetimes.com/toyota-case-single-bit-flip-that-killed/ (2013)

  3. 3.

    Marenostrum 4 (2017). https://www.bsc.es/marenostrum/marenostrum/technical-information

  4. 4.

    Crossroads benchmarks, micro-benchmarks, & asc code suite (2019). https://www.lanl.gov/projects/crossroads/benchmarks-performance-analysis.php

  5. 5.

    Arslan S, Topcuoglu HR, Kandemir MT, Tosun O (2019) Scheduling opportunities for asymmetrically reliable caches. J Parall Distrib Comput 126:134–151

    Article  Google Scholar 

  6. 6.

    Bautista-Gomez L, Zyulkyarov F, Unsal O, McIntosh-Smith, S (2016) Unprotected computing: a large-scale study of dram raw error rate on a supercomputer. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 645–655. IEEE

  7. 7.

    Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al. (2008) Exascale computing study: technology challenges in achieving exascale systems. Air Force Research Laboratory. https://people.eecs.berkeley.edu/~yelick/papers/Exascale%5C_final%5C_report.pdf

  8. 8.

    Biswas A, Soundararajan N, Mukherjee SS, Gurumurthi S (2009) Quantized AVF: A means of capturing vulnerability variations over small windows of time. In: IEEE workshop on silicon errors in logic - system effects

  9. 9.

    Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley R (2002) An updated set of basic linear algebra subprograms (blas). ACM Trans Math Softw 28(2):135–151. https://doi.org/10.1145/567806.567807

    MathSciNet  Article  Google Scholar 

  10. 10.

    Chen Y, Chen P (2016) A software-based redundant execution programming model for transient fault detection and correction. In: 45th International Conference on Parallel Processing Workshops (ICPPW), pp. 66–71. 10.1109/ICPPW.2016.25

  11. 11.

    Döbel B, Härtig H, Engel M (2012) Operating system support for redundant multithreading. In: Proceedings of the Tenth ACM International Conference on Embedded Software, EMSOFT’12, p. 83–92. 10.1145/2380356.2380375

  12. 12.

    Döbel B, Härtig H (2014) Can we put concurrency back into redundant multithreading? In: 2014 International Conference on Embedded Software (EMSOFT), pp. 1–10

  13. 13.

    Fang B, Pattabiraman K, Ripeanu M, Gurumurthi S (2014) Evaluating the error resilience of parallel programs. In: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 720–725

  14. 14.

    Feng S, Gupta S, Ansari A, Mahlke S (2010) Shoestring: probabilistic soft error reliability on the cheap. In: Proceedings of the Fifteenth International Conference on Architectural SSupport for Programming Languages and Operating Systems, ASPLOS XV, p. 385–396. 10.1145/1736020.1736063

  15. 15.

    Gomaa M, Scarbrough C, Vijaykumar TN, Pomeranz I (2003) Transient-fault recovery for chip multiprocessors. In: 30th Annual International Symposium on Computer Architecture, 2003. Proceedings., pp. 98–109

  16. 16.

    Gomaa MA, Vijaykumar TN (2005) Opportunistic transient-fault detection. In: 32nd International Symposium on Computer Architecture (ISCA’05), pp. 172–183. 10.1109/ISCA.2005.38

  17. 17.

    Guo L, Li D, Laguna I, Schulz M (2018) FlipTracker: Understanding natural error resilience in HPC applications. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 94–107. 10.1109/SC.2018.00011

  18. 18.

    Hari SKS, Adve SV, Naeimi H (2012) Low-cost program-level detectors for reducing silent data corruptions. In: IEEE/IFIP international conference on dependable systems and networks (DSN 2012), pp. 1–12

  19. 19.

    Hazucha P, Karnik T, Maiz J, Walstra S, Bloechel B, Tschanz J, Dermer G, Hareland S, Armstrong P, Borkar S (2003) Neutron soft error rate measurements in a 90-nm cmos process and scaling trends in sram from 0.25-/spl mu/m to 90-nm generation. In: IEEE International Electron Devices Meeting 2003, pp. 21.5.1–21.5.4. 10.1109/IEDM.2003.1269336

  20. 20.

    Hukerikar S, Diniz PC, Lucas RF, Teranishi K (2004) Opportunistic application-level fault detection through adaptive redundant multithreading. In: International Conference on High Performance Computing Simulation (HPCS), pp. 243–250

  21. 21.

    Hukerikar S, Lucas RF (2016) Rolex: resilience-oriented language extensions for extreme-scale systems. J Supercomput 72:4662–4695

    Article  Google Scholar 

  22. 22.

    Hukerikar S, Teranishi K, Diniz PC, Lucas RF (2014) An evaluation of lazy fault detection based on adaptive redundant multithreading. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6

  23. 23.

    Hukerikar S, Teranishi K, Diniz PC, Lucas RF (2018) RedThreads: an interface for application-level fault detection/correction through adaptive redundant multithreading. Int J Parall Program 46(2):225–251. https://doi.org/10.1007/s10766-017-0492-3

    Article  Google Scholar 

  24. 24.

    Kahng AB (2013) The ITRS design technology and system drivers roadmap: process and status. In: 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6

  25. 25.

    Khudia DS, Mahlke S (2014) Harnessing soft computations for low-budget fault tolerance. In: 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 319–330

  26. 26.

    Kolodziej SP, Aznaveh M, Bullock M, David J, Davis TA, Henderson M, Hu Y, Sandstrom R (2019) The SuiteSparse matrix collection website interface. J Open Source Softw 4(35):1244

    Article  Google Scholar 

  27. 27.

    Koren I, Su SYH (1979) Reliability analysis of n-modular redundancy systems with intermittent and permanent faults. IEEE Trans Comput 28(7):514–520

    MathSciNet  Article  Google Scholar 

  28. 28.

    Leveugle R, Calvez A, Maistri P, Vanhauwaert P (2009) Statistical fault injection: quantified error and confidence. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, p. 502–506

  29. 29.

    Li T, Ambrose JA, Ragel R, Parameswaran S (2016) Processor design for soft errors: challenges and state of the art. ACM Comput Surv (CSUR) 49(3):1–44

    Article  Google Scholar 

  30. 30.

    Lyons D (2000) Sun screen. https://www.forbes.com/forbes/2000/1113/6613068a/#2e22cb75276c. Forbes Magazine

  31. 31.

    Mitropoulou K, Porpodas V, Jones TM (2016) COMET: Communication-optimised multi-threaded error-detection technique. In: International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES), pp. 1–10

  32. 32.

    Mukherjee S (2008) Archit Design Soft Error. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

    Google Scholar 

  33. 33.

    Mukherjee SS, Kontz M, Reinhardt SK (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 99–110

  34. 34.

    Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. In: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, p. 2–11. 10.1145/237090.237140

  35. 35.

    Omar H, Shi Q, Ahmad M, Dogan H, Khan O (2018) Declarative resilience: a holistic soft-error resilient multicore architecture that trades off program accuracy for efficiency. ACM Trans Embed Comput Syst 17(4). https://doi.org/10.1145/3210559

  36. 36.

    Oz I, Arslan S (2019) A survey on multithreading alternatives for soft error fault tolerance. ACM Comput Surv 52(2):1–38

    Article  Google Scholar 

  37. 37.

    Parashar A, Sivasubramaniam A, Gurumurthi S (2006) Slick: slice-based locality exploitation for efficient redundant multithreading. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating SSystems, ASPLOS XII, p. 95–105. 10.1145/1168857.1168870

  38. 38.

    Pouchet, L.N.: Polybench/C (2016). https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/

  39. 39.

    Pouyan F, Azarpeyvand A, Safari S, Fakharie SM (2015) Reliability-aware simultaneous multithreaded architecture using online architectural vulnerability factor estimation. IET Comput Digital Tech 9(2):124–133. https://doi.org/10.1049/iet-cdt.2013.0162

    Article  Google Scholar 

  40. 40.

    Pouyan F, Azarpeyvand A, Safari S, Fakhraie SM (2016) Reliability aware throughput management of chip multi-processor architecture via thread migration. J Supercomput 72(4):1363–1380. https://doi.org/10.1007/s11227-016-1665-3

    Article  Google Scholar 

  41. 41.

    Prvulovic M, Zhang Z, Torrellas J (2002) ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 111–122

  42. 42.

    Reddy VK, Rotenberg E, Parthasarathy S (2006) Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, p. 83–94. 10.1145/1168857.1168869

  43. 43.

    Reinhardt SK, Mukherjee SS (2000) Transient fault detection via simultaneous multithreading. In: Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), pp. 25–36

  44. 44.

    Rosa Fd, Bandeira V, Reis R, Ost L (2018) Extensive evaluation of programming models and ISAs impact on multicore soft error reliability. In: Proceedings of the 55th Annual Design Automation Conference, pp. 1–6

  45. 45.

    Sangchoolie B, Pattabiraman K, Karlsson J (2007) One bit is (not) enough: an empirical study of the impact of single and multiple bit-flip errors. In: 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 97–108 (2017)

  46. 46.

    Sao PK (2018) Scalable and resilient sparse linear solvers. Ph.D. thesis, georgia institute of technology

  47. 47.

    Siddiqua T, Gurumurthi S (2009) Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In: IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1–12. 10.1109/MASCOT.2009.5363142

  48. 48.

    Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B et al (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl 28(2):129–173

    Article  Google Scholar 

  49. 49.

    So H, Didehban M, Ko Y, Shrivastava A, Lee K (2018) Expert: Effective and flexible error protection by redundant multithreading. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 533–538

  50. 50.

    So H, Didehban M, Shrivastava A, Lee K (2019) A software-level redundant multithreading for soft/hard error detection and recovery. In: 2019 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1559–1562

  51. 51.

    Sorin DJ, Martin MMK, Hill MD, Wood DA (2002) Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 123–134

  52. 52.

    Sridharan V, DeBardeleben N, Blanchard S, Ferreira KB, Stearley J, Shalf J, Gurumurthi S (2015) Memory errors in modern systems: the good, the bad, and the ugly. ACM SIGARCH Comput Archit News 43(1):297–310

    Article  Google Scholar 

  53. 53.

    Subasi O, Yalcin G, Zyulkyarov F, Unsal O, Labarta J (2017) Designing and modelling selective replication for fault-tolerant hpc applications. In: 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 452–457. 10.1109/CCGRID.2017.40

  54. 54.

    Sundaramoorthy K, Purser Z, Rotenburg E (2000) Slipstream processors: Improving both performance and fault tolerance. In: Proceedings of the Ninth International Conference on Architectural SSupport for Programming Languages and Operating Systems, ASPLOS IX, p. 257–268. 10.1145/378993.379247

  55. 55.

    Swift GM, Guertin SM (2000) In-flight observations of multiple-bit upset in drams. IEEE Trans Nucl Sci 47(6):2386–2391. https://doi.org/10.1109/23.903781

    Article  Google Scholar 

  56. 56.

    Vijaykumar TN, Pomeranz I, Cheng K (2002) Transient-fault recovery using simultaneous multithreading. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 87–98

  57. 57.

    Von Neumann J (1956) Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata stud 34:43–98. https://doi.org/10.1515/9781400882618-003

    MathSciNet  Article  Google Scholar 

  58. 58.

    Wang C, Kim H, Wu Y, Ying V (2007) Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization (CGO), pp. 244–258. 10.1109/CGO.2007.7

  59. 59.

    Wei J., Thomas A, Li G, Pattabiraman K (2014) Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 375–382. 10.1109/DSN.2014.2

  60. 60.

    Wang Y, Huang Y, Vo K, Chung P, Kintala C (1995) Checkpointing and its applications. In: Twenty-fifth International Symposium on fault-tolerant Computing. Digest of papers, pp. 22–31

  61. 61.

    Zhang Y, Lee JW, Johnson NP, August DI (2010) Daft: decoupled acyclic fault tolerance. In: 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 87–97

Download references

Acknowledgements

This work was completed, while the first author, Sanem Arslan, was visiting researcher at Barcelona Supercomputing Center, Barcelona, Spain. Sanem Arslan had received financial support from the Scientific and Technological Research Council of Turkey (TUBITAK) under the program BIDEB 2219 during this work.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Sanem Arslan .

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Arslan , S., Unsal, O. Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading. J Supercomput (2021). https://doi.org/10.1007/s11227-021-03804-6

Download citation

Keywords

  • Redundant multithreading
  • Fault tolerance
  • Soft error reliability
  • Software reliability