Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications

  • Eduardo BerrocalEmail author
  • Leonardo Bautista-Gomez
  • Sheng Di
  • Zhiling Lan
  • Franck Cappello
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)


Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extreme-scale systems. If not dealt with properly, SDC has the potential to influence important scientific results, leading scientists to wrong conclusions. In previous work, our detector was able to detect SDC in HPC applications to a certain level by using the peculiarities of the data (more specifically, its “smoothness” in time and space) to make predictions. Accurate predictions allow us to detect corruptions when data values are far “enough” from them. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which our lightweight data-analytic detectors would perform poorly. Our results indicate that our new approach can protect the MPI applications analyzed with 49–53 % less overhead than that of full duplication with similar detection recall.


Silent data corruption detection Partial replication Data analysis HPC applications 



This material was based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program, under Contract DE-AC02-06CH11357, and by the ANR RESCUE and the INRIA-Illinois-ANL- BSC-JSC-Riken Joint Laboratory on Extreme Scale Computing. The work at the Illinois Institute of Technology is supported in part by U.S. National Science Foundation grants CNS-1320125 and CCF-1422009.


  1. 1.
    Fusion cluster at Argonne National Laboratory.
  2. 2.
    Bautista-Gomez, L.A., Cappello, F.: Detecting silent data corruption through data dynamic monitoring for scientific applications. In: PPoPP 2014, pp. 381–382 (2014)Google Scholar
  3. 3.
    Bautista-Gomez, L.A., Cappello, F.: Detecting and correcting data corruption in stencil applications through multivariate interpolation. In: 1st International Workshop on Fault Tolerant Systems (part of Cluster 2015), pp. 595–602 (2015)Google Scholar
  4. 4.
    Bautista-Gomez, L.A., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: SC 2011, pp. 32:1–32:32 (2011)Google Scholar
  5. 5.
    Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. 29(4), 1–20 (2014)Google Scholar
  6. 6.
    Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., Cappello, F.: Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: HPDC 2015 (short paper) (2015)Google Scholar
  7. 7.
    Borkar, S.: Major challenges to achieve exascale performance. Intel Corporation, April 2009Google Scholar
  8. 8.
    Briere, D., Traverse, P.: AIRBUS A320/A330/A340 electrical flight controls - a family of fault-tolerant systems. In: Proceedings of the IEEE International Symposium on Fault-Tolerant Computing, pp. 616–623 (1993)Google Scholar
  9. 9.
    Chalermarrewong, T., Achalakul, T., See, S.C.W.: Failure prediction of data centers using time series and fault tree analysis. In: ICPads 2012, pp. 794–799 (2012)Google Scholar
  10. 10.
    Dell, T.J.: A white paper on the benefits of chipkill-correct ECC for PC server main memory. In: IBM Microelectronics Division, pp. 1–23 (1997)Google Scholar
  11. 11.
    Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: SC 2012, pp. 78:1–78:12 (2012)Google Scholar
  12. 12.
    Fryxell, B., Olson, K., Ricker, P., Timmes, F.X., Zingale, M., Lamb, D.Q., MacNeice, P., Rosner, R., Truran, J.W., Tufo, H.: Flash: an adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. Astrophys. J. Suppl. Ser. (ApJS) 131, 273–334 (2000)CrossRefGoogle Scholar
  13. 13.
    Hengartner, N.W., Takala, E., Michalak, S.E., Wender, S.A.: Evaluating experiments for estimating the bit failure cross-section of semiconductors using a colored spectrum neutron beam. Technometrics 50(1), 8–14 (2008)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)CrossRefzbMATHGoogle Scholar
  15. 15.
    Hukerikar, S., Diniz, P.C., Lucas, R.F., Teranishi, K.: Opportunistic application-level fault detection through adaptive redundant multithreading. In: HPCS 2014 (2014)Google Scholar
  16. 16.
    Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of dram errors and the implications for system design. In: ASPLOS XVII, pp. 111–122 (2012)Google Scholar
  17. 17.
    Liu, J., Kurt, M.C., Agrawal, G.: A practical approach for handling soft errors in iterative applications. In: Cluster 2015, pp. 158–161 (2015)Google Scholar
  18. 18.
    Mukherjee, S., Kontz, M., Reinhardt, S.: Detailed design and evaluation of redundant multi-threading alternatives. In: ISCA 2002, pp. 99–110 (2002)Google Scholar
  19. 19.
    Mukherjee, S.S., Emer, J., Reinhardt, S.K.: The soft error problem: an architectural perspective. In: HPCA 2005 (2005)Google Scholar
  20. 20.
    Nakka, N., Pattabiraman, K., Iyer, R.: Processor-level selective replication. In: DSN 2007, pp. 544–553 (2007)Google Scholar
  21. 21.
    Sedov, L.I.: Similarity and Dimensional Methods in Mechanics, 10th edn. Academic Press, New York (1959)zbMATHGoogle Scholar
  22. 22.
    Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. 28(2), 129–173 (2014)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Stearly, J., Ferreira, K., Robinson, D., Laros, J., Pedretti, K., Arnold, D., Bridges, P., Riesen, R.: Does partial replication pay off? In: DSN 2012 (2012)Google Scholar
  24. 24.
    Subasi, O., Arias, J., Unsal, O., Labarta, J., Cristal, A.: Programmer-directed partial redundancy for resilient HPC. In: CF 2015 (2015)Google Scholar
  25. 25.
    Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: SC 2008 (2008)Google Scholar
  26. 26.
    Yim, K.S.: Characterization of impact of transient faults and detection of data corruption errors in large-scale n-body programs using graphics processing units. In: IPDPS 2014, pp. 458–467 (2014)Google Scholar
  27. 27.
    Zachary, A.L., Malagoli, A., Colella, P.: A higher-order godunov method for multidimensional ideal magnetohydrodynamics. SIAM J. Sci. Comput. 15(2), 263–284 (1994)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Eduardo Berrocal
    • 1
    Email author
  • Leonardo Bautista-Gomez
    • 2
  • Sheng Di
    • 2
  • Zhiling Lan
    • 1
  • Franck Cappello
    • 2
  1. 1.Illinois Institute of TechnologyChicagoUSA
  2. 2.Argonne National LaboratoryLemontUSA

Personalised recommendations