The Resiliency of Multilevel Methods on Next-Generation Computing Platforms: Probabilistic Model and Its Analysis

Part of the Advances in Mechanics and Mathematics book series (AMMA, volume 41)


The reduced reliability of next-generation exascale systems means that the resiliency properties of a numerical algorithm will become an important factor both in the choice of algorithm and in its analysis. The multigrid algorithm is the workhorse for the distributed solution of linear systems but little is known about its resiliency properties and convergence behavior in a fault-prone environment. In the current work, we propose a probabilistic model for the effect of faults involving random diagonal matrices. We summarize results of the theoretical analysis of the model for the rate of convergence of fault-prone multigrid methods which show that the standard multigrid method will not be resilient. Finally, we present a modification of the standard multigrid algorithm that will be resilient.


  1. 1.
    Ainsworth, M., Glusa, C.: Is the Multigrid Method Fault Tolerant? The Two-Grid Case. SIAM Journal on Scientific Computing 39(2), C116–C143 (2017). MathSciNetCrossRefGoogle Scholar
  2. 2.
    Ainsworth, M., Glusa, C.: Is the multigrid method fault tolerant? The Multilevel Case. SIAM Journal on Scientific Computing 39(6), C393–C416 (2017)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Avižienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11–33 (2004). CrossRefGoogle Scholar
  4. 4.
    Bramble, J.H.: Multigrid methods, vol. 294. CRC Press (1993)Google Scholar
  5. 5.
    Cappello, F.: Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications 23(3), 212–226 (2009)CrossRefGoogle Scholar
  6. 6.
    Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. International Journal of High Performance Computing Applications 23(4), 374–388 (2009). CrossRefGoogle Scholar
  7. 7.
    Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1(1), 5–28 (2014). Google Scholar
  8. 8.
    Casas, M., de Supinski, B.R., Bronevetsky, G., Schulz, M.: Fault Resilience of the Algebraic Multi-grid Solver. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS ‘12, pp. 91–100. ACM, New York, NY, USA (2012).
  9. 9.
    Cui, T., Xu, J., Zhang, C.S.: An error-resilient redundant subspace correction method. Computing and Visualization in Science 18(2), 65–77 (2017). Scholar
  10. 10.
    Elliott, J., Mueller, F., Stoyanov, M., Webster, C.G.: Quantifying the impact of single bit flips on floating point arithmetic. Tech. Rep. ORNL/TM-2013/282, Oak Ridge National Laboratory (2013)Google Scholar
  11. 11.
    Glusa, C.: Multigrid and domain decomposition methods in fault-prone environments. Ph.D. thesis, Brown University (2017).Google Scholar
  12. 12.
    Göddeke, D., Altenbernd, M., Ribbrock, D.: Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing. Parallel Computing 49, 117–135 (2015)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Hackbusch, W.: Multi-grid methods and applications, vol. 4. Springer-Verlag Berlin (1985). zbMATHGoogle Scholar
  14. 14.
    Hackbusch, W.: Iterative solution of large sparse systems of equations, Applied Mathematical Sciences, vol. 95. Springer-Verlag, New York (1994). CrossRefGoogle Scholar
  15. 15.
    Herault, T., Robert, Y.: Fault-Tolerance Techniques for High-Performance Computing. Springer International Publishing (2015). zbMATHGoogle Scholar
  16. 16.
    Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on 100(6), 518–528 (1984)CrossRefGoogle Scholar
  17. 17.
    Huber, M., Gmeiner, B., Rüde, U., Wohlmuth, B.: Resilience for massively parallel multigrid solvers. SIAM Journal on Scientific Computing 38(5), S217–S239 (2016)MathSciNetCrossRefGoogle Scholar
  18. 18.
    McCormick, S.F., Briggs, W.L., Henson, V.E.: A multigrid tutorial. SIAM, Philadelphia (2000)zbMATHGoogle Scholar
  19. 19.
    Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the Impact of Soft Errors on Iterative Methods in Scientific Computing. In: Proceedings of the International Conference on Supercomputing, ICS ‘11, pp. 152–161. ACM, New York, NY, USA (2011).
  20. 20.
    Sloan, J., Kumar, R., Bronevetsky, G.: Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on, pp. 1–12. IEEE, Boston, MA, USA (2012)Google Scholar
  21. 21.
    Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., et al.: Addressing failures in exascale computing. International Journal of High Performance Computing Applications 28(2), 129–173 (2014)CrossRefGoogle Scholar
  22. 22.
    Stoyanov, M., Webster, C.: Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults. SIAM Journal on Scientific Computing 37(5), C532–C553 (2015). MathSciNetCrossRefGoogle Scholar
  23. 23.
    Trottenberg, U., Oosterlee, C.W., Schüller, A.: Multigrid. Academic Press Inc., San Diego, CA (2001). With contributions by A. Brandt, P. Oswald and K. StübenGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Division of Applied MathematicsBrown UniversityProvidenceUSA
  2. 2.Center for Computing ResearchSandia National LaboratoriesAlbuquerqueUSA

Personalised recommendations