Skip to main content
Log in

Quantifying the impact of data replication on error propagation

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Various technological developments in the microprocessor world make modern computing systems more vulnerable to soft errors than in the past, and consequently fault tolerance techniques are becoming increasingly important in various application domains. While in general fault tolerance methods are known to achieve high levels of reliability, they can also introduce significant performance, energy, and memory overheads, which can be reduced by employing such techniques selectively, as opposed to indiscriminately. Data Replication is used to prevent error propagation across hardware components and application program data structures by replicating application program’s data. When using data replication, many factors need to be taken into account, including which data structures/elements to replicate, how many times to replicate a given data element, and which threads to protect (in a multithreaded application). These and similar factors define what can be termed as “replication space”. This study defines a replication space, and systematically explores protection techniques of various strengths/degrees, quantifying their impacts on memory consumption, performance, and error propagation. Our experimental analysis reveals that different degrees of protection levels bring different outcomes based on the application specifics. In particular, while error propagation is limited, to a certain extent, when employing data replication in multithreaded applications where the thread do not communicate/share data much, the speed of error propagation across threads can be quite fast in applications where threads are more tightly coupled. Additionally, our results indicate that in certain cases where error propagation is low, the effect of data replication on error propagation can be negligible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The datasets generated and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. A set of rules that modifies the code according to the data replication technique without affecting its functionality.

References

  1. Baumann, R.C.: Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability. 5(3), 305–316 (2005)

    Article  Google Scholar 

  2. Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings 29th Annual International Symposium on Computer Architecture, pp. 99–110 (2002)

  3. Rotta, R., Ferreira, R.S., Nolte, J.: Real-time dynamic hardware reconfiguration for processors with redundant functional units. In: 2020 IEEE 23rd International Symposium on Real-Time Distributed Computing (ISORC), pp. 154–155 (2020)

  4. Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: SWIFT: Software implemented fault tolerance. In: International Symposium on Code Generation and Optimization, pp. 243–254 (2005)

  5. Mahmoud, A., Hari, S.KS., Sullivan, M.B., Tsai, T., Keckler, S.W.: Optimizing software-directed instruction replication for gpu error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 842–853 (2018)

  6. Vallero, A., Savino, A., Chatzidimitriou, A., Kaliorakis, M., Kooli, M., Riera, M., et al.: SyRA: Early system reliability analysis for cross-layer soft errors resilience in memory arrays of microprocessor systems. IEEE Transact. Comput. 68(5), 765–783 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  7. Cheng, E., Mirkhani, S., Szafaryn, L.G., Cher, C.Y., Cho, H., Skadron, K., et al.: CLEAR: cross-layer exploration for architecting resilience-combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference, pp. 1–6 (2016)

  8. Borodin, D., Juurlink, B.H.: Protective redundancy overhead reduction using instruction vulnerability factor. In: Proceedings of the 7th ACM International Conference on Computing Frontiers, pp. 319–326 (2010)

  9. Benso, A., Di Carlo, S., Di Natale, G., Prinetto, P., Tagliaferri, L.: Data criticality estimation in software applications. In: International Test Conference, pp. 802–810 (2003)

  10. Mukherjee, S.S., Weaver, C.T., Emer, J., Reinhardt, S.K., Austin, T.: Measuring architectural vulnerability factors. IEEE Micro. 23(6), 70–75 (2003)

    Article  Google Scholar 

  11. Zhang, W.: Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’05), pp. 427–435 (2005)

  12. Yan, J., Zhang, W.: Compiler-guided register reliability improvement against soft errors. In: Proceedings of the 5th ACM International Conference on Embedded Software, pp. 203–209 (2005)

  13. Sridharan V, Kaeli DR. Eliminating microarchitectural dependency from architectural vulnerability. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture; 2009. p. 117–128

  14. Oz, I., Topcuoglu, H.R., Kandemir, M., Tosun, O.: Thread vulnerability in parallel applications. J. Parallel Distribut Comput. 72(10), 1171–1185 (2012)

    Article  Google Scholar 

  15. Utrera, G., Gil, M., Martorell, X.: Analyzing data-error propagation effects in high-performance computing. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 418–421 (2016)

  16. Guo, L., Li, D.: Moard: Modeling application resilience to transient faults on data objects. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 878–889 (2019)

  17. Guo, L., Li, D., Laguna, I., Schulz, M.: Fliptracker: Understanding natural error resilience in hpc applications. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 94–107 (2018)

  18. Li, Z., Menon, H., Maljovec, D., Livnat, Y., Liu, S., Mohror, K., et al.: Spotsdc: Revealing the silent data corruption propagation in high-performance computing systems. IEEE Transac. Visual. Comput. Graphics. 27(10), 3938–3952 (2020)

    Article  Google Scholar 

  19. Gu, J., Zheng, W., Zhuang, Y., Zhang, Q.: Vulnerability analysis of instructions for SDC-causing error detection. IEEE Access. 7, 168885–168898 (2019)

    Article  Google Scholar 

  20. Li, G., Pattabiraman, K., Hari, S.K.S., Sullivan, M., Tsai, T.: Modeling soft-error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 27–38 (2018)

  21. Li, G., Pattabiraman, K.: Modeling input-dependent error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 279–290 (2018)

  22. Anwer, A.R., Li, G., Pattabiraman, K., Sullivan, M., Tsai, T., Hari, S.K.S.: Gpu-trident: efficient modeling of error propagation in gpu programs. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2020)

  23. Restrepo-Calle, F., Martínez-Álvarez, A., Cuenca-Asensi, S., Jimeno-Morenilla, A.: Selective swift-r. J. Electron. Testing. 29(6), 825–838 (2013)

    Article  Google Scholar 

  24. Chielle, E., Kastensmidt, F.L., Cuenca-Asensi, S.: Overhead reduction in data-flow software-based fault tolerance techniques. In: FPGAs and Parallel Architectures for Aerospace Applications. Springer, pp. 279–291 (2016)

  25. Rebaudengo, M., Reorda, M.S., Torchiano, M., Violante, M.: Soft-error detection through software fault-tolerance techniques. In: Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT’99), pp. 210–218 (1999)

  26. Chielle, E., Kastensmidt, F.L., Cuenca-Asensi, S.: Tuning software-based fault-tolerance techniques for power optimization. In: 2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 1–7 (2014)

  27. Chielle, E., Du, B., Kastensmidt, F.L., Cuenca-Asensi, S., Sterpone, L., Reorda, M.S.: Hybrid soft error mitigation techniques for COTS processor-based systems. In: 2016 17th Latin-American Test Symposium (LATS), pp. 99–104 (2016)

  28. Tsai, T.Y., Huang, J.L.: Source code transformation for software-based on-line error detection. In: 2017 IEEE Conference on Dependable and Secure Computing, pp. 305–309 (2017)

  29. Liu, J., Kurt, M.C., Agrawal, G.: A practical approach for handling soft errors in iterative applications. In: 2015 IEEE International Conference on Cluster Computing, pp. 158–161 (2015)

  30. Thati, V.B., Vankeirsbilck, J., Pissoort, D., Boydens, J.: Hybrid technique for soft error detection in dependable embedded software: a first experiment. In: 2019 IEEE XXVIII International Scientific Conference Electronics (ET), pp. 1–4 (2019)

  31. Serrano-Cases, A., Restrepo-Calle, F., Cuenca-Asensi, S., Martínez-Álvarez, A.: Softerror mitigation for multi-core processors based on thread replication. In: 2019 IEEE Latin American Test Symposium (LATS), pp. 1–5 (2019)

  32. Quinn, H., Baker, Z., Fairbanks, T., Tripp, J.L., Duran, G.: Robust duplication with comparison methods in microcontrollers. IEEE Transact. Nuclear Sci. 64(1), 338–345 (2016)

    Article  Google Scholar 

  33. Xu, J., Tan, Q., Shen, R.: A novel optimum data duplication approach for soft error detection. In: 2008 15th Asia-Pacific Software Engineering Conference, pp. 161–168 (2008)

  34. Sangchoolie, B., Pattabiraman, K., Karlsson, J.: One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In: 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 97–108 (2017)

  35. Lu, Q., Farahani, M., Wei, J., Thomas, A., Pattabiraman, K.: Llfi: An intermediate code-level fault injection tool for hardware faults. In: 2015 IEEE International Conference on Software Quality, Reliability and Security, pp. 11–16 (2015)

  36. Ferreira, R.R., Da Rolt, J., Nazar, G.L., Moreira, A.F., Carro, L.: Adaptive low-power architecture for high-performance and reliable embedded computing. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 538–549 (2014)

  37. Hu, J., Wang, S., Ziavras, S.G.: On the exploitation of narrow-width values for improving register file reliability. IEEE Transact. Very Large Scale Integrat. (VLSI) Sys. 17(7), 953–963 (2009)

    Article  Google Scholar 

  38. Subasi, O., Arias, J., Unsal, O., Labarta, J., Cristal, A.: Nanocheckpoints: A task-based asynchronous dataflow framework for efficient and scalable checkpoint/restart. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 99–102 (2015)

  39. Ashraf, R.A., Gioiosa, R., Kestor, G., DeMara, R.F.: Exploring the effect of compiler optimizations on the reliability of HPC applications. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1274–1283 (2017)

  40. Mattson, T.G., Sanders, B., Massingill, B.: Patterns for parallel programming. Pearson Education, London (2004)

    MATH  Google Scholar 

  41. Parasyris, K., Tziantzoulis, G., Antonopoulos, C.D., Bellas, N.: GemFI: A fault injection tool for studying the behavior of applications on unreliable substrates. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 622–629 (2014)

  42. Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., et al.: The gem5 simulator. SIGARCH Comput Archit News. 39(2), 1–7 (2011)

    Article  Google Scholar 

Download references

Funding

This research was supported by The Scientific and Technological Research Council of Turkey (TUBITAK) with a research grant (Project Number: 118E715).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haluk Rahmi Topcuoglu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ozturk, Z., Topcuoglu, H.R. & Kandemir, M.T. Quantifying the impact of data replication on error propagation. Cluster Comput 26, 1985–1999 (2023). https://doi.org/10.1007/s10586-022-03726-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03726-9

Keywords

Navigation