DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8664)


Compiler-based error detection methodologies replicate the instructions of the program and insert checks wherever it is needed. The checks evaluate code correctness and decide whether or not an error has occurred. The replicated instructions and the checks cause a large slowdown. In this work, we focus on reducing the error detection overhead and improving the system’s performance without degrading fault-coverage. DRIFT achieves this by decoupling the execution of the code (original and replicated) from the checks.

The checks are compare and jump instructions. The latter ones sequentialize the code and prohibit the compiler from performing aggressive instruction scheduling optimizations. We call this phenomenon basic-block fragmentation. DRIFT reduces the impact of basic-block fragmentation by breaking the synchronized execute-check-confirm-execute cycle. In this way, DRIFT generates a scheduler-friendly code with more ILP. As a result, it reduces the performance overhead down to 1.29\(\times \) (on average) and outperforms the state-of-the-art by up to 29.7 % retaining the same fault-coverage. The evaluation was done on an Itanium2 by running MediabenchII and SPEC2000 benchmark suites.


Compiler error detection Fault tolerance 


  1. 1.
    GCC: GNU compiler collection.
  2. 2.
    SKI, an IA64 instruction set simulator.
  3. 3.
    Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: MICRO (1999)Google Scholar
  4. 4.
    Bernick, D., et al.: Nonstop advanced architecture. In: DSN (2005)Google Scholar
  5. 5.
    Chang, J., et al.: Automatic instruction-level software-only recovery. In: DSN (2006)Google Scholar
  6. 6.
    Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23, 14–19 (2003)CrossRefGoogle Scholar
  7. 7.
    Feng, S., et al.: Shoestring: probabilistic soft error reliability on the cheap. In: ASPLOS (2010)Google Scholar
  8. 8.
    Fritts, J., et al.: Mediabench II video: expediting the next generation of video systems research. In: SPIE (2005)Google Scholar
  9. 9.
    Ghosh, Y., et al.: Runtime asynchronous fault tolerance via speculation. In: CGO (2012)Google Scholar
  10. 10.
    Henning, J.: SPEC CPU2000: measuring CPU performance in the new millennium. IEEE Comput. 33, 28–35 (2000)CrossRefGoogle Scholar
  11. 11.
    Hwu, W.-M.W., et al.: The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 229–248 (1993)CrossRefGoogle Scholar
  12. 12.
    LaFrieda, C., et al.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: DSN (2007)Google Scholar
  13. 13.
    Li, M., et al.: Understanding the propagation of hard errors to software and implications for resilient system design. In: ASPLOS (2008)Google Scholar
  14. 14.
    Lowney, P.G., et al.: The multiflow trace scheduling compiler. J. Supercomput. 7, 51–142 (1993)CrossRefGoogle Scholar
  15. 15.
    Mahlke, S., et al.: Sentinel scheduling for vliw and superscalar processors. In: ASPLOS (1992)Google Scholar
  16. 16.
    Mahmood, A., et al.: Concurrent error detection using watchdog processors-a survey. IEEE Trans. Comput. 37, 160–174 (1988)CrossRefGoogle Scholar
  17. 17.
    McNairy, C., et al.: Itanium 2 processor microarchitecture. IEEE Micro 23, 44–55 (2003)CrossRefGoogle Scholar
  18. 18.
    Michalak, S., et al.: Predicting the number of fatal soft errors in Los Alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5, 329–335 (2005)CrossRefGoogle Scholar
  19. 19.
    Mukherjee, S., et al.: Detailed design and evaluation of redundant multithreading alternatives. In: ISCA (2002)Google Scholar
  20. 20.
    Oh, N., et al.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51, 63–75 (2002)CrossRefGoogle Scholar
  21. 21.
    Reinhardt, S., et al.: Transient fault detection via simultaneous multithreading. In: ISCA (2000)Google Scholar
  22. 22.
    Reis, G., et al.: SWIFT: software implemented fault tolerance. In: CGO (2005)Google Scholar
  23. 23.
    Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: FTCS (1999)Google Scholar
  24. 24.
    Shivakumar, P., et al.: Modeling the effect of technology trends on the soft error rate of combinational logic. In: DSN (2002)Google Scholar
  25. 25.
    Shye, A., et al.: Using process-level redundancy to exploit multiple cores for transient fault tolerance. In: DSN (2007)Google Scholar
  26. 26.
    Slegel, T., et al.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)CrossRefGoogle Scholar
  27. 27.
    Smolens, J., et al.: Reunion: complexity-effective multicore redundancy. In: MICRO (2006)Google Scholar
  28. 28.
    Sorin, D.: Fault tolerant computer architecture. Synthesis Lectures on Computer Architecture (2009)Google Scholar
  29. 29.
    Srinivasan, J., et al.: The impact of technology scaling on lifetime reliability. In: DSN (2004)Google Scholar
  30. 30.
    Wang, C., et al.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: CGO (2007)Google Scholar
  31. 31.
    Wang, N., et al.: ReStore: symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secure Comput. 3, 188–201 (2006)CrossRefGoogle Scholar
  32. 32.
    Zhang, Y., et al.: DAFT: decoupled acyclic fault tolerance. In: PACT (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.School of InformaticsUniversity of EdinburghEdinburghUK
  2. 2.Intel Labs BraunschweigBraunschweigGermany

Personalised recommendations