Computing and Visualization in Science

, Volume 18, Issue 2–3, pp 65–77 | Cite as

An error-resilient redundant subspace correction method



Due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability (increase the mean time to failure) of computing systems, a lot of efforts have been devoted to developing techniques to forecast, prevent, and recover from errors at different levels, including architecture, application, and algorithm. In this paper, we focus on algorithmic error resilient iterative solvers and introduce a redundant subspace correction method. Using a general framework of redundant subspace corrections, we construct iterative methods, which have the following properties: (1) maintain convergence when error occurs assuming it is detectable; (2) introduce low computational overhead when no error occurs; (3) require only small amount of point-to-point communication compared to traditional methods and maintain good load balance; (4) improve the mean time to failure. Preliminary numerical experiments demonstrate the efficiency and effectiveness of the new subspace correction method. For simplicity, the main ideas of the proposed framework were demonstrated using the Schwarz methods without a coarse space, which do not scale well in practice.


Fault-tolerance Error resilience Subspace correction Schwarz methods 



Cui and Zhang are partially supported by National Key Research and Development Program 2016YFB0201304, by China NSF Grants 91430215 and 91530323, and by National Center for Mathematics and Interdisciplinary Sciences of Chinese Academy of Sciences (NCMIS). Xu is partially supported by NSF DMS-0915153 and DOE DE-SC0006903.


  1. 1.
    Abts, D., Thompson, J., Schwoerer, G.: Architectural Support for Mitigating Dram Soft Errors in Large-Scale Supercomputers. Tech. rep. (2006)Google Scholar
  2. 2.
    Bjorstad, P.E., Skogen, M.: Domain decomposition algorithms of schwarz type, designed for massively parallel computers. In: 5th International Symposium on Domain Decomposition Methods for Partial Differential Equations. SIAM, Philadelphia, pp. 362–375 (1992)Google Scholar
  3. 3.
    Boley, D.L., Brent, R.P., Golub, G.H., Luk, F.T.: Algorithmic fault tolerance using the lanczos method. SIAM J. Matrix Anal. Appl. 13(1), 312–332 (1992)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Bramble, J.H., Pasciak, J.E., Xu, J.: Parallel multilevel preconditioners. Math. Comput. 55(191), 1–22 (1990)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Bronevetsky, G., de Supinski, B.R.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 155–164 (2008)Google Scholar
  6. 6.
    Chen, Z., Dongarra, J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19(12), 1628–1641 (2008)CrossRefGoogle Scholar
  7. 7.
    Deng, Y.: Applied Parallel Computing. World Scientific, Singapore (2013)MATHGoogle Scholar
  8. 8.
    Dongarra, J., Beckman, P., Moore, T., Aerts, P., Aloisio, G., Andre, J.C., Barkai, D., Berthou, J.Y., Boku, T., Braunschweig, B., Cappello, F., Chapman, B.: Choudhary, a., Dosanjh, S., Dunning, T., Fiore, S., Geist, a., Gropp, B., Harrison, R., Hereld, M., Heroux, M., Hoisie, a., Hotta, K., Ishikawa, Y., Johnson, F., Kale, S., Kenway, R., Keyes, D., Kramer, B., Labarta, J., Lichnewsky, a., Lippert, T., Lucas, B., Maccabe, B., Matsuoka, S., Messina, P., Michielse, P., Mohr, B., Mueller, M.S., Nagel, W.E., Nakashima, H., Papka, M.E., Reed, D., Sato, M., Seidel, E., Shalf, J., Skinner, D., Snir, M., Sterling, T., Stevens, R., Streitz, F., Sugar, B., Sumimoto, S., Tang, W., Taylor, J., Thakur, R., Trefethen, a., Valero, M., van der Steen, a., Vetter, J., Williams, P., Wisniewski, R., Yelick, K.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). doi: 10.1177/1094342010391989
  9. 9.
    Dryja, M., Widlund, O.: Some domain decomposition algorithms for elliptic problems. In: Hayes, L., Kincaid, D. (eds.) Iterative Methods for Large Linear Systems, pp. 273–291. Academic Press, San Diego (1989)Google Scholar
  10. 10.
    Dryja, M., Widlund, O.B.: Additive schwarz methods for elliptic finite element problems in three dimensions. In: Fifth Conference on Domain Decomposition Methods for Partial Differential Equations, Philadelphia, PA (1992)Google Scholar
  11. 11.
    Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. In: International Conference on Cluster Computing, pp. 272–280 (2011)Google Scholar
  12. 12.
    Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. Procedia Comput. Sci. 9, 216–225 (2012). doi: 10.1016/j.procs.2012.04.023 CrossRefGoogle Scholar
  13. 13.
    Gropp, W.D.: Parallel computing and domain decomposition. In: Fifth Conference on Domain Decomposition Methods for Partial Differential Equations, pp. 349–361 (1992)Google Scholar
  14. 14.
    Hackbusch, W.: Elliptic Differential Equations: Theory and Numerical Treatment, Computational Mathematics Series. Springer, Berlin (1992)CrossRefGoogle Scholar
  15. 15.
    Hackbusch, W.: Iterative Solution of Large Sparse Systems of Equations, Applied Mathematical Sciences, vol. 95. Springer, New York (1994)CrossRefMATHGoogle Scholar
  16. 16.
    Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2011)Google Scholar
  17. 17.
    Huang, K.h., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. c(6), 518–528 (1984)Google Scholar
  18. 18.
    Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Keyes, D.E.: Exaflop/s: the why and the how. Comptes Rendus Mécanique 339(2–3), 70–77 (2011). doi: 10.1016/j.crme.2010.11.002 CrossRefMATHGoogle Scholar
  20. 20.
    Kikuchi, N.: Finite Element Methods in Mechanics. Cambridge University Press, Cambridge (1986)CrossRefMATHGoogle Scholar
  21. 21.
    Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30(1), 102–116 (2007)MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Laprie, J.: Dependable computing: Concepts, limits, challenges. In: The 25th IEEE International Symposium on Fault-Tolerant Computing, pp. 42–54 (1995)Google Scholar
  23. 23.
    Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–9 (2008). doi: 10.1109/IPDPS.2008.4536279
  24. 24.
    Luk, F., Park, H.: An analysis of algorithm-based fault tolerance techniques. In: 30th Annual Technical Symposium on International Society for Optics and Photonics, pp. 172–184 (1986)Google Scholar
  25. 25.
    Malkowski, K., Raghavan, P., Kandemir, M.: Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12 (2010). doi: 10.1109/IPDPS.2010.5470411
  26. 26.
    Michalak, S., Harris, K., Hengartner, N., Takala, B., Wender, S.: Predicting the number of fatal soft errors in Los Alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5(3), 329–335 (2005). doi: 10.1109/TDMR.2005.855685 CrossRefGoogle Scholar
  27. 27.
    Miskov-Zivanov, N., Marculescu, D.: Soft error rate analysis for sequential circuits. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1436–1441 (2007)Google Scholar
  28. 28.
    Monk, P.: Finite Element Methods for Maxwell’s Equations. Numerical Mathematics and Scientific Computation. Clarendon Press, Oxford (2003)CrossRefGoogle Scholar
  29. 29.
    Mukherjee, S., Emer, J., Reinhardt, S.K.: The soft error problem: An architectural perspective. In: Proc. 11th Int’l Symp. on High-Performance Computer Architecture (HPCA) (2005)Google Scholar
  30. 30.
    PHG (Parallel Hierarchical Grid).
  31. 31.
    Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)CrossRefGoogle Scholar
  32. 32.
    Reddi, V.: Hardware and software co-design for robust and resilient execution. In: 2012 International Conference on Collaboration Technologies and Systems, p. 380 (2012)Google Scholar
  33. 33.
    Roy-Chowdhury, A., Banerjee, P.: A fault-tolerant parallel algorithm for iterative solution of the laplace equation. In: International Conference on Parallel Processing, vol. 3, pp. 133–140 (1993)Google Scholar
  34. 34.
    Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM (1996). doi: 10.1109/MCSE.1996.1231631
  35. 35.
    Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing, pp. 152–161 (2011)Google Scholar
  36. 36.
    Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 69–78. ACM, New York (2012)Google Scholar
  37. 37.
    Smith, B.F.: A parallel implementation of an iterative substructuring algorithm for problems in three dimensions. SIAM J. Sci. Comput. 14(2), 406–423 (1993)MathSciNetCrossRefMATHGoogle Scholar
  38. 38.
    Stoyanov, M.K., Webster, C.G.: Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults. Tech. rep., Oak Ridge National Laboratory (ORNL) (2013)Google Scholar
  39. 39.
    Toselli, A., Widlund, O.B.: Domain Decomposition Methods: Algorithms and Theory, Springer Series in Computational Mathematics, vol. 34. Springer, Berlin (2005)CrossRefMATHGoogle Scholar
  40. 40.
    Treaster, M.: A Survey of Fault-tolerance and Fault-recovery techniques in Parallel Systems. Tech. rep., ACM Computing Research Repository (2005)Google Scholar
  41. 41.
    Xu, J.: Iterative methods by space decomposition and subspace correction. SIAM Rev. 34, 581–613 (1992)MathSciNetCrossRefMATHGoogle Scholar
  42. 42.
    Xu, J., Zikatanov, L.: The method of alternating projections and the method of subspace corrections in Hilbert space. J. Am. Math. Soc. 15(3), 573–597 (2002). doi: 10.1090/S0894-0347-02-00398-3 MathSciNetCrossRefMATHGoogle Scholar
  43. 43.
    Zhang, W.: Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 427–435 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.State Key Laboratory of Scientific and Engineering ComputingAcademy of Mathematics and Systems Science, Chinese Academy of SciencesBeijingChina
  2. 2.Department of MathematicsPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations