An Automated Performance-Aware Approach to Reliability Transformations

  • Jacob Lidman
  • Sally A. McKee
  • Daniel J. Quinlan
  • Chunhua Liao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8805)


Soft errors are expected to increase as feature sizes shrink and the number of cores increases. Redundant execution can be used to cope with such errors. This paper deals with the problem of automatically finding the number of redundant executions needed to achieve a preset reliability threshold. Our method uses geometric programming to calculate the minimal reliability for each instruction while still ensuring that the reliability of the program satisfies a given threshold. We use this to approximate an upper bound on the number of redundant instructions. Using this, we perform a limit study to find the implications of different redundant execution schemes. In particular we notice that the overhead of higher redundancy has serious implications to reliability. We therefore create a scheme where we only perform more executions if needed. Applying the results from our optimization improves reliability by up to 58.25%. We show that it is possible to achieve up to 8% better performance than Triple Modular Redundancy (TMR). We also show cases where our approach is insufficient.


High Performance Computing Fault Tolerance N-Modular Redundancy Reliability Optimization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward Exascale Resilience. International Journal of High Performance Computing Applications 23(4), 374–388 (2009)CrossRefGoogle Scholar
  2. 2.
    Engelmann, C., Ong, H.H., Scott, S.L.: The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In: Proc. IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189–194 (February 2009), Related workGoogle Scholar
  3. 3.
    Li, D., Lee, S., Vetter, J.S.: Evaluating the viability of application-driven cooperative CPU/GPU fault detection. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 670–679. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  4. 4.
    Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 78:1–78:12 (2012)Google Scholar
  5. 5.
    Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining Partial Redundancy and Checkpointing for HPC. In: Proceedings of the International Conference on Distributed Computing Systems (ICDCS 2012), pp. 615–626 (2012)Google Scholar
  6. 6.
    Shamsunder, R., Rosenkrantz, D.J., Ravi, S.S.: Exploiting Data Flow Information in Algorithm-Based Fault Tolerance. In: Proc. International Symposium on Fault-Tolerant Computing (FTCS), pp. 280–289 (June 1993)Google Scholar
  7. 7.
    Lu, G., Zheng, Z., Chien, A.A.: When is Multi-version Checkpointing Needed? In: Proc. 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), pp. 49–56 (June 2013)Google Scholar
  8. 8.
    Misailovic, S., Carbin, M., Achour, S., Zichao, Q., Rinard, M.: Reliability-Aware Optimization of Approximate Computational Kernels with Rely. MIT-CSAIL-TR-2014-001 (January 2014)Google Scholar
  9. 9.
    Boyd, S., Kim, S.-J., Vandenberghe, L., Hassibi, A.: A tutorial on geometric programming. Optimization and Engineering 8(1), 67–127 (2007)CrossRefMathSciNetzbMATHGoogle Scholar
  10. 10.
    Carbin, M., Misailovic, S., Rinard, M.C.: Verifying Quantitative Reliability for Programs That Execute on Unreliable Hardware. In: Proc. SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA), pp. 33–52 (October 2013)Google Scholar
  11. 11.
    Quinlan, D., Liao, C.: The ROSE Source-to-Source Compiler Infrastructure. In: Cetus Users and Compiler Infrastructure Workshop, with the International Conference on Parallel Architectures and Compilation Techniques, PACT (October 2011)Google Scholar
  12. 12.
    Grant, M., Boyd, S.: CVX: Matlab Software for Disciplined Convex Programming, version 2.0 beta (June 2014),
  13. 13.
    Grant, M.C., Boyd, S.P.: Graph Implementations for Nonsmooth Convex Programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control. LNCIS, vol. 371, pp. 95–110. Springer, Heidelberg (2008)Google Scholar
  14. 14.
    Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T., Sardashti, S., Sen, R., Sewell, K., Shoaib, M., Vaish, N., Hill, M.D., Wood, D.A.: The Gem5 Simulator. Computer Architecture News 39(2), 1–7 (2011)CrossRefGoogle Scholar
  15. 15.
    Lidman, J., Quinlan, D.J., Liao, C., McKee, S.A.: ROSE:FTTransform – A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research. In: Proc. 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), pp. 1–6 (June 2012)Google Scholar
  16. 16.
    Rinard, M., Cadar, C., Dumitran, D., Roy, D.M., Leu, T., Beebee Jr., W.S.: Enhancing Server Availability and Security Through Failure-Oblivious Computing. In: Proc. 6th Symposium on Operating Systems Design & Implementation (OSDI), p. 21 (Decedmber 2004)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jacob Lidman
    • 1
  • Sally A. McKee
    • 1
  • Daniel J. Quinlan
    • 2
  • Chunhua Liao
    • 2
  1. 1.Department of Computer Science and EngineeringChalmers University of TechnologyGothenburgSweden
  2. 2.Lawrence Livermore National LaboratoryLivermoreUSA

Personalised recommendations