Skip to main content

DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 8664)

Abstract

Compiler-based error detection methodologies replicate the instructions of the program and insert checks wherever it is needed. The checks evaluate code correctness and decide whether or not an error has occurred. The replicated instructions and the checks cause a large slowdown. In this work, we focus on reducing the error detection overhead and improving the system’s performance without degrading fault-coverage. DRIFT achieves this by decoupling the execution of the code (original and replicated) from the checks.

The checks are compare and jump instructions. The latter ones sequentialize the code and prohibit the compiler from performing aggressive instruction scheduling optimizations. We call this phenomenon basic-block fragmentation. DRIFT reduces the impact of basic-block fragmentation by breaking the synchronized execute-check-confirm-execute cycle. In this way, DRIFT generates a scheduler-friendly code with more ILP. As a result, it reduces the performance overhead down to 1.29\(\times \) (on average) and outperforms the state-of-the-art by up to 29.7 % retaining the same fault-coverage. The evaluation was done on an Itanium2 by running MediabenchII and SPEC2000 benchmark suites.

Keywords

  • Compiler error detection
  • Fault tolerance

This work was supported in part by the EC under grant ERA 249059 (FP7).

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-09967-5_13
  • Chapter length: 17 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-09967-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

References

  1. GCC: GNU compiler collection. http://gcc.gnu.org

  2. SKI, an IA64 instruction set simulator. http://ski.sourceforge.net

  3. Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: MICRO (1999)

    Google Scholar 

  4. Bernick, D., et al.: Nonstop advanced architecture. In: DSN (2005)

    Google Scholar 

  5. Chang, J., et al.: Automatic instruction-level software-only recovery. In: DSN (2006)

    Google Scholar 

  6. Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23, 14–19 (2003)

    CrossRef  Google Scholar 

  7. Feng, S., et al.: Shoestring: probabilistic soft error reliability on the cheap. In: ASPLOS (2010)

    Google Scholar 

  8. Fritts, J., et al.: Mediabench II video: expediting the next generation of video systems research. In: SPIE (2005)

    Google Scholar 

  9. Ghosh, Y., et al.: Runtime asynchronous fault tolerance via speculation. In: CGO (2012)

    Google Scholar 

  10. Henning, J.: SPEC CPU2000: measuring CPU performance in the new millennium. IEEE Comput. 33, 28–35 (2000)

    CrossRef  Google Scholar 

  11. Hwu, W.-M.W., et al.: The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 229–248 (1993)

    CrossRef  Google Scholar 

  12. LaFrieda, C., et al.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: DSN (2007)

    Google Scholar 

  13. Li, M., et al.: Understanding the propagation of hard errors to software and implications for resilient system design. In: ASPLOS (2008)

    Google Scholar 

  14. Lowney, P.G., et al.: The multiflow trace scheduling compiler. J. Supercomput. 7, 51–142 (1993)

    CrossRef  Google Scholar 

  15. Mahlke, S., et al.: Sentinel scheduling for vliw and superscalar processors. In: ASPLOS (1992)

    Google Scholar 

  16. Mahmood, A., et al.: Concurrent error detection using watchdog processors-a survey. IEEE Trans. Comput. 37, 160–174 (1988)

    CrossRef  Google Scholar 

  17. McNairy, C., et al.: Itanium 2 processor microarchitecture. IEEE Micro 23, 44–55 (2003)

    CrossRef  Google Scholar 

  18. Michalak, S., et al.: Predicting the number of fatal soft errors in Los Alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5, 329–335 (2005)

    CrossRef  Google Scholar 

  19. Mukherjee, S., et al.: Detailed design and evaluation of redundant multithreading alternatives. In: ISCA (2002)

    Google Scholar 

  20. Oh, N., et al.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51, 63–75 (2002)

    CrossRef  Google Scholar 

  21. Reinhardt, S., et al.: Transient fault detection via simultaneous multithreading. In: ISCA (2000)

    Google Scholar 

  22. Reis, G., et al.: SWIFT: software implemented fault tolerance. In: CGO (2005)

    Google Scholar 

  23. Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: FTCS (1999)

    Google Scholar 

  24. Shivakumar, P., et al.: Modeling the effect of technology trends on the soft error rate of combinational logic. In: DSN (2002)

    Google Scholar 

  25. Shye, A., et al.: Using process-level redundancy to exploit multiple cores for transient fault tolerance. In: DSN (2007)

    Google Scholar 

  26. Slegel, T., et al.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)

    CrossRef  Google Scholar 

  27. Smolens, J., et al.: Reunion: complexity-effective multicore redundancy. In: MICRO (2006)

    Google Scholar 

  28. Sorin, D.: Fault tolerant computer architecture. Synthesis Lectures on Computer Architecture (2009)

    Google Scholar 

  29. Srinivasan, J., et al.: The impact of technology scaling on lifetime reliability. In: DSN (2004)

    Google Scholar 

  30. Wang, C., et al.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: CGO (2007)

    Google Scholar 

  31. Wang, N., et al.: ReStore: symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secure Comput. 3, 188–201 (2006)

    CrossRef  Google Scholar 

  32. Zhang, Y., et al.: DAFT: decoupled acyclic fault tolerance. In: PACT (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Konstantina Mitropoulou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Mitropoulou, K., Porpodas, V., Cintra, M. (2014). DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09967-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09966-8

  • Online ISBN: 978-3-319-09967-5

  • eBook Packages: Computer ScienceComputer Science (R0)