International Journal of Parallel Programming

, Volume 40, Issue 1, pp 118–140 | Cite as

DAFT: Decoupled Acyclic Fault Tolerance

  • Yun ZhangEmail author
  • Jae W. Lee
  • Nick P. Johnson
  • David I. August


Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such faults, but software techniques are more appealing for their low cost and flexibility. Recent software proposals have not achieved widespread acceptance because they either increase register pressure, double memory usage, or are too slow in the absence of hardware extensions. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Evaluation results demonstrate that speculation allows DAFT to improves the performance of software redundant multithreading by 2.17× with no degradation of fault coverage.


Fault tolerance Compiler Speculation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hareland, S., Maiz, J., Alavi, M., Mistry, K., Walsta, S., Dai, C.: Impact of CMOS Scaling and SOI on Software Error Rates of Logic Processes. VLSI Technology Digest of Technical Papers (2001)Google Scholar
  2. 2.
    Baumann R.C.: Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Trans. Device Mater. Reliab. 1(1), 17–22 (2001)CrossRefGoogle Scholar
  3. 3.
    O’Gorman T.J., Ross J.M., Taber A.H., Ziegler J.F., Muhlfeld H.P., Montrose I.C.J., Curtis H.W., Walsh J.L.: Field testing for cosmic ray soft errors in semiconductor memories. IBM J. Res. Dev. 40, 41–49 (1996)CrossRefGoogle Scholar
  4. 4.
    Reis, G.A., Chang, J., August, D.I., Cohn, R., Mukherjee, S.S.: Configurable transient fault detection via dynamic binary translation. In: Proceedings of the 2nd Workshop on Architectural Reliability (2006)Google Scholar
  5. 5.
    Segura J., Hawkins C.F.: CMOS Electronics: How It Works, How It Fails. Wiley-IEEE Press, New York (2004)CrossRefGoogle Scholar
  6. 6.
    Baumann, R.C.: Soft errors in commercial semiconductor technology: overview and scaling trends. In: IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pp. 121_01.1–121_01.14 (2002)Google Scholar
  7. 7.
    Michalak S.E., Harris K.W., Hengartner N.W., Takala B.E., Wender S.A.: Predicting the number of fatal soft errors in Los Alamos national labratory’s ASC Q computer. IEEE Trans. Device Mater. Reliab. 5(3), 329–335 (2005)CrossRefGoogle Scholar
  8. 8.
    Mahmood A., McCluskey E.J.: Concurrent error detection using watchdog processors—a survey. IEEE Trans. Comput. 37(2), 160–174 (1988)CrossRefGoogle Scholar
  9. 9.
    Slegel T.J., Averill R.M. III, Check M.A., Giamei B.C., Krumm B.W., Krygowski C.A., Li W.H., Liptay J.S., MacDougall J.D., McPherson T.J., Navarro J.A., Schwarz E.M., Shum K., Webb C.F.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19, 12–23 (1999)CrossRefGoogle Scholar
  10. 10.
    Yeh Y.: Triple-triple redundant 777 primary flight computer. Proc. IEEE Aeros. Appl. Conf. 1, 293–307 (1996)CrossRefGoogle Scholar
  11. 11.
    Yeh, Y.: Design considerations in Boeing 777 fly-by-wire computers. In: Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium, pp. 64–72 (November 1998)Google Scholar
  12. 12.
    Horst, R.W., Harris, R.L., Jardine, R.L.: Multiple instruction issue in the nonstop cyclone processor. In: Proceedings of the 17th International Symposium on Computer Architecture, pp. 216–226 (May 1990)Google Scholar
  13. 13.
    Ando, H., Yoshida, Y., Inoue, A., Sugiyama, I., Asakawa, T., Morita, K., Muta, T., Motokurumada, T., Okada, S., Yamashita, H., Satsukawa, Y., Konmoto, A., Yamashita, R., Sugiyama, H.: A 1.3GHz Fifth Generation SPARC64 Microprocessor. International Solid-State Circuits Conference (2003)Google Scholar
  14. 14.
    Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 25–36, ACM Press (2000)Google Scholar
  15. 15.
    Wang, C., Kim, H.-S., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: CGO ’07: Proceedings of the International Symposium on Code Generation and Optimization, pp. 244–258, IEEE Computer Society, Washington, DC, USA (2007)Google Scholar
  16. 16.
    Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: SWIFT: software implemented fault tolerance. In: Proceedings of the 3rd International Symposium on Code Generation and Optimization (March 2005)Google Scholar
  17. 17.
    Shye, A., Moseley, T., Reddi, V.J., Blomstedt, J., Connors, D.A.: Using process-level redundancy to exploit multiple cores for transient fault tolerance. In: International Conference on Dependable Systems and Networks, IEEE Computer Society, Los Alamitos, CA, USA (2007)Google Scholar
  18. 18.
    Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, p. 84, IEEE Computer Society (1999)Google Scholar
  19. 19.
    Mukherjee S.S., Kontz M., Reinhardt S.K.: Detailed design and evaluation of redundant multithreading alternatives. SIGARCH Comput. Archit. News 30(2), 99–110 (2002)CrossRefGoogle Scholar
  20. 20.
    Weaver, C., Emer, J., Mukherjee, S.S., Reinhardt, S.K.: Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor. In: Proceedings of the 31st Annual International Symposium on Computer Architecture (2004)Google Scholar
  21. 21.
    Vijaykumar, T.N., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: The 29th Annual International Symposium on Computer Architecture, pp. 87–98, IEEE Computer Society (2002)Google Scholar
  22. 22.
    Oh N., Shirvani P.P., McCluskey E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51, 63–75 (2002)CrossRefGoogle Scholar
  23. 23.
    Gomaa, M., Scarbrough, C., Vijaykumar, T.N., Pomeranz, I.: Transient-fault recovery for chip multiprocessors. In: Proceedings of the 30th annual international symposium on Computer architecture, pp. 98–109. ACM Press (2003)Google Scholar
  24. 24.
    Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I., Mukherjee, S.S.: Design and evaluation of hybrid fault-detection systems. In: Proceedings of the 32th Annual International Symposium on Computer Architecture, pp. 148–159 (June 2005)Google Scholar
  25. 25.
    Avizienis A.: The N-version approach to fault-tolerant software. IEEE Trans. Softw. Eng. 11, 1491–1501 (1985)CrossRefGoogle Scholar
  26. 26.
    Berger, E.D., Zorn, B.G.: DieHard: probabilistic memory safety for unsafe languages. In: Proceedings of the ACM SIGPLAN ’06 Conference on Programming Language Design and Implementation (June 2006)Google Scholar
  27. 27.
    Brilliant S.S., Knight J.C., Leveson N.G.: Analysis of faults in an N-version software experiment. IEEE Trans. Softw. Eng. 16(2), 238–247 (1990)CrossRefGoogle Scholar
  28. 28.
    Novark, G., Berger, E.D., Zorn, B.G.: Exterminator: automatically correcting memory errors with high probability. In: PLDI ’07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pp. 1–11. ACM, New York, NY, USA (2007)Google Scholar
  29. 29.
    James, W.D., Jr, J.E.L.: A user-level checkpointing library for POSIX threads programs. In: The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (1999)Google Scholar
  30. 30.
    Whisnant, K., Kalbarczyk, Z., Iyer, R.K.: Micro-checkpointing: checkpointing for multithreaded applications. In: Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW), IEEE Computer Society, Washington, DC, USA (2000)Google Scholar
  31. 31.
    Rieker, M., Ansel, J.: Transparent user-level checkpointing for the native POSIX thread library for Linux. In: International Conference on Parallel and Distributed Processing Techniques and Applications (2006)Google Scholar
  32. 32.
    Vachharajani, N., Rangan, R., Raman, E., Bridges, M.J., Ottoni, G., August, D.I.: Speculative Decoupled Software Pipelining. In: PACT ’07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pp. 49–59. IEEE Computer Society, Washington, DC, USA (2007)Google Scholar
  33. 33.
    ISO/IEC 9899-1999 Programming Languages – C, Second Edition (1999)Google Scholar
  34. 34.
    Jablin, T.B., Zhang, Y., Jablin, J.A., Huang, J., Kim, H., August, D.I.: Liberty queues for EPIC architectures. In: Proceedings of the 8th Workshop on Explicitly Parallel Instruction Computing Techniques (April 2010)Google Scholar
  35. 35.
    Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis & transformation. In: CGO ’04: Proceedings of the International Symposium on Code Generation and Optimization, p. 75. IEEE Computer Society, Washington, DC, USA (2004)Google Scholar
  36. 36.
    Ferrante J., Ottenstein K.J., Warren J.D.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 319–349 (1987)CrossRefzbMATHGoogle Scholar
  37. 37.
    Ottoni, G., Rangan, R., Stoler, A., August, D.I.: Automatic thread extraction with decoupled software pipelining. In: MICRO ’05: Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 105–118, IEEE Computer Society, Washington, DC, USA (2005)Google Scholar
  38. 38.
    Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’05, pp. 190–200. ACM, New York, NY, USA (2005)Google Scholar
  39. 39.
    Walker D., Mackey L., Ligatti J., Reis G.A., August D.I.: Static typing for a faulty lambda calculus. SIGPLAN Not. 41(9), 38–49 (2006)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Yun Zhang
    • 1
    Email author
  • Jae W. Lee
    • 1
  • Nick P. Johnson
    • 1
  • David I. August
    • 1
  1. 1.Department of Computer SciencePrinceton UniversityPrincetonUSA

Personalised recommendations