RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C.; Lucas, Robert F.

doi:10.1007/s10766-017-0492-3

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Published: 11 February 2017

Volume 46, pages 225–251, (2018)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Saurabh Hukerikar ORCID: orcid.org/0000-0002-2612-2001¹,
Keita Teranishi²,
Pedro C. Diniz¹ &
…
Robert F. Lucas¹

354 Accesses
12 Citations
Explore all metrics

Abstract

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance.

In this paper we present RedThreads, an interface that provides application-level fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables application programmers to scope the extent of redundant computation. Additionally, the runtime system permits the use of RMT to be dynamically enabled, or disabled, based on the resiliency needs of the application and the state of the system. Our experimental results demonstrate how adaptive RMT exploits programmer insight and runtime inference to dynamically navigate the trade-off space between an application’s resilience coverage and the associated performance overhead of redundant computation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Noisy intermediate-scale quantum computers

Article Open access 07 March 2023

Efficient High-Level Programming in Plain Java

Article 05 December 2022

A Comparison of Processes and Threads Creation

References

Advanced configuration and power interface (ACPI). http://www.uefi.org/acpi/specs (2013)
Austin, T.M.: Diva: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 196–207 (1999)
Bernick, D., Bruckert, B., Vigna, P.D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstopadvanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks, DSN ’05, pp. 12–21 (2005)
Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)
Article Google Scholar
Cheng, E., Mirkhani, S., Szafaryn, L.G., Cher, C.Y., Cho, H., Skadron, K., Stan, M.R., Lilja, K., Abraham, J.A., Bose, P., Mitra, S.: Clear: cross-layer exploration for architecting resilience—combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference, DAC ’16, pp. 68:1–68:6 (2016)
Dongarra, J., Beckman, P., Moore, T., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 3–60 (2011)
Elnozahy, E., Bianchini, R., El-Ghazawi, T., et al.: System resilience at extreme scale. White Paper. Tech. rep, DARPA (2009)
Engelmann, C., Ong, H.H., Scott, S.L.: The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189–194 (2009)
Ferreira, K., Stearley, J., Laros III, J.H., et al.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)
Gomaa, M.A., Vijaykumar, T.N.: Opportunistic transient-fault detection. In: SIGARCH Computer Architecture News, pp. 172–183 (2005)
Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, vol. 3, p. 9 (2011)
Hukerikar, S., Diniz, P.C., Lucas, R.F., Teranishi, K.: Opportunistic application-level fault detection through adaptive redundant multithreading. In: International Conference on High Performance Computing Simulation (HPCS), pp. 243–250 (2014). doi:10.1109/HPCSim.2014.6903692
Hukerikar, S., Lucas, R.F.: Rolex: resilience-oriented language extensions for extreme-scale systems. J. Supercomput. 72, 1–33 (2016). doi:10.1007/s11227-016-1752-5
Article Google Scholar
Hukerikar, S., Teranishi, K., Diniz, P.C., Lucas, R.F.: An evaluation of lazy fault detection based on adaptive redundant multithreading. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2014) doi:10.1109/HPEC.2014.7040999
Kogge, P., Bergman, K., Borkar, S., et al.: Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPA (2008)
Liao, C., Quinlan, D.J., Vuduc, R., Panas, T.: Effective source-to-source outlining to support whole program empirical optimization pp. 308–322 (2010)
Lidman, J., Quinlan, D.J., Liao, C., McKee, S.A.: ROSE::FTTransform—a source-to-source translation framework for exascale fault-tolerance research. In: Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pp. 1–6 (2012). doi:10.1109/DSNW.2012.6264672
Moon, T.K.: Error correction coding: mathematical methods and algorithms. Wiley, New York (2005)
Book MATH Google Scholar
Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: SIGARCH Computer Architecture News, pp. 99–110. Wiley-Interscience, Hoboken, N.J. (2002)
Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. pp. 63–75 (2002)
Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: Slick: Slice-based locality exploitation for efficient redundant multithreading. SIGOPS Oper. Syst. Rev. 5, 95–105 (2006)
Article Google Scholar
Quinlan, D., et al.: Rose Compiler (2000) http://www.rosecompiler.org
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 25–36 (2000)
Reis, G., Chang, J., Vachharajani, N., et al.: SWIFT: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization, pp. 243–254 (2005)
Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA ’13, pp. 4:1–4:8 (2013)
Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135–148 (2009)
Article Google Scholar
Siddiqua, T., Gurumurthi, S.: Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In: 2009 IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1–12 (2009)
Slegel, T., Averill R.M., I., Check, M., et. al: IBM’s S/390 G5 Microprocessor Design. In: IEEE Micro, pp. 12–23 (1999)
Somers, J.: Stratus ftserver–intel fault tolerant platform. Intel Developer Forum (2002)
Stearley, J., Ferreira, K., Robinson, D., et al.: Does partial replication pay off? In: IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W) (2012)
The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee (2010)
USC: Center for high-performance computing. https://hpcc.usc.edu/
Vadlamani, R., Zhao, J., Burleson, W., Tessier, R.: Multicore soft error rate stabilization using adaptive dual modular redundancy. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, pp. 27–32 (2010)
Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: 29th Annual International Symposium on Computer Architecture, pp. 87–98 (2002)
von Neumann, J.: Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata Studies, pp. 43–98. ACM, New York, NY (1956)
Wang, C., Kim, H., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization, pp. 244–258 (2007). doi:10.1109/CGO.2007.7
Zhang, Y., Lee, J.W., Johnson, N.P., August, D.I.: DAFT: Decoupled acyclic fault tolerance. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 87–98 (2010)

Download references

Author information

Authors and Affiliations

Information Sciences Institute, University of Southern California, 4676 Admiralty Way Suite 1001, Marina del Rey, CA, 90292, USA
Saurabh Hukerikar, Pedro C. Diniz & Robert F. Lucas
Sandia National Laboratories, 7011 East Avenue, Livermore, CA, 94551, USA
Keita Teranishi

Authors

Saurabh Hukerikar
View author publications
You can also search for this author in PubMed Google Scholar
Keita Teranishi
View author publications
You can also search for this author in PubMed Google Scholar
Pedro C. Diniz
View author publications
You can also search for this author in PubMed Google Scholar
Robert F. Lucas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saurabh Hukerikar.

Additional information

The authors would like to acknowledge the support for this work provided through Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research under Award Number DE-SC0006844. Partial support for this work was also provided by the US Army Research Office (Award W911NF-13-1-0219). Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hukerikar, S., Teranishi, K., Diniz, P.C. et al. RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading. Int J Parallel Prog 46, 225–251 (2018). https://doi.org/10.1007/s10766-017-0492-3

Download citation

Received: 07 June 2016
Accepted: 28 January 2017
Published: 11 February 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10766-017-0492-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Abstract

Access this article

Similar content being viewed by others

Noisy intermediate-scale quantum computers

Efficient High-Level Programming in Plain Java

A Comparison of Processes and Threads Creation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Abstract

Access this article

Similar content being viewed by others

Noisy intermediate-scale quantum computers

Efficient High-Level Programming in Plain Java

A Comparison of Processes and Threads Creation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation