Skip to main content
Log in

Multi-Fault Tolerance for Cartesian Data Distributions

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Ali, N., Carns, P.H., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R.B., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: IEEE International Conference on Cluster Computing, pp. 1–10, Aug (2009)

  2. Ali, N., Krishnamoorthy, S., Govind, N., Kowalski, K., Sadayappan, P.: Application-specific fault tolerance via data access characterization. In International European Conference on Parallel and Distributed Computing, Aug (2011a)

  3. Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approachq to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, Feb (2011b)

  4. Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers, May (2011c)

  5. Bosilca G., Delmas R., Dongarra J., Langou J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)

    Article  Google Scholar 

  6. Bronevetsky, G., Moody, A.: Scalable I/O systems via node-local storage: approaching 1 TB/sec file I/O. Technical report LLNL-TR-415791, Lawrence Livermore National Laboratory, Aug (2009)

  7. Burkard R., Dell’Amico M., Martello S.: Assignment Problems. Society for Industrial and Applied Mathematics, Philadelphia (2009)

    Book  MATH  Google Scholar 

  8. Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: IEEE International Parallel & Distributed Processing Symposium, Apr (2006)

  9. Chen Z., Dongarra J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19(12), 1628–1641 (2008)

    Article  Google Scholar 

  10. Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine fault-tolerant mapreduce: faults are not just crashes. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 32–39 (2011)

  11. Darte A., Mellor-Crummey J., Fowler R., Chavarría-Miranda D.: Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations. J. Parallel Distrib. Comput. 63(9), 887–911 (2003)

    Article  MATH  Google Scholar 

  12. Dean, J., Ghemawat S.: MapReduce: simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004)

  13. Elnozahy E.N., Alvisi L., Wang Y.-M., Johnson D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  14. Engelmann, C., Vallée, G., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257, Feb (2009)

  15. Fagg, G.E., Dongarra, J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000)

  16. Gabow H.N.: An efficient implementation of edmonds’ algorithm for maximum matching on graphs. J. ACM 23(2), 221–234 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  17. Gupta, R., Beckman, P., Park, B.-H., Lusk, E., Hargrove, P., Geist, A., Panda, D., Lumsdaine, A., Dongarra, J.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: Proceedings of the International Conference on Parallel Processing, pp. 237–245 (2009)

  18. Halappanavar, M.: Algorithms for vertex-weighted matching in graphs. PhD thesis, Old Dominion University, Norfolk, VA (2009)

  19. Hargrove P.H., Duell J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46(1), 494–499 (2006)

    Article  Google Scholar 

  20. Hopcroft J., Karp R.: A \({n^{\frac{5}{2}}}\) algorithm for maximum matchings in bipartite graphs. SIAM J. Comput. 2, 225–231 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  21. Huang K.-H., Abraham J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)

    Article  MATH  Google Scholar 

  22. HPL. http://www.netlib.org/benchmark/hpl

  23. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)

  24. Kuhn H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)

    Article  Google Scholar 

  25. Lawler E.: Combinatorial Optimization: Networks and Matroids. Dover Publications, Mineola (2001)

    MATH  Google Scholar 

  26. Lovasz L., Plummer M.D.: Matching Theory. North-Holland Publishing Co., Amsterdam (1986)

    MATH  Google Scholar 

  27. Motwani R.: Average-case analysis of algorithms for matchings and related problems. J. ACM 41(6), 1329–1356 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  28. Nieplocha J., Palmer B., Tipparaju V., Krishnan M., Trease H., Aprà à E.: Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 20, 203–231 (2006)

    Article  Google Scholar 

  29. Panda, D.K.: MVAPICH. http://mvapich.cse.ohio-state.edu

  30. Papadimitriou C.H., Steiglitz K.: Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall Inc., Upper Saddle River (1982)

    MATH  Google Scholar 

  31. Plank, J., Li, K.: Faster checkpointing with N + 1 parity. In: International Symposium on Fault-Tolerant Computing, pp. 288–297, June (1994)

  32. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223, Jan (1995)

  33. Plank J.S., Li K., Puening M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)

    Article  Google Scholar 

  34. Schrijver A.: Combinatorial Optimization: Polyhedra and Efficiency. Springer Publishing Co., New York (2003)

    MATH  Google Scholar 

  35. Schroeder B., Gibson G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78(1), 1–11 (2007)

    Google Scholar 

  36. Tipparaju, V., Krishnan, M., Palmer, B., Petrini, F., Nieplocha, J.: Towards fault resilient global arrays. In: International Conference on Parallel Computing, vol. 15, pp. 339–345 (2007)

  37. The ScaLAPACK project. http://www.netlib.org/scalapack

  38. Valiev M., Bylaska E., Govind N., Kowalski K., Straatsma T., Dam H.V., Wang D., Nieplocha J., Apra E., Windus T., de Jong W.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)

    Article  MATH  Google Scholar 

  39. Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–12, Nov (2008)

  40. Wolsey L.A.: Integer Programming. Wiley, Hoboken (1998)

    MATH  Google Scholar 

  41. Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: IEEE International Conference on Cluster Computing, pp. 93–103, Sept (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sriram Krishnamoorthy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ali, N., Krishnamoorthy, S., Halappanavar, M. et al. Multi-Fault Tolerance for Cartesian Data Distributions. Int J Parallel Prog 41, 469–493 (2013). https://doi.org/10.1007/s10766-012-0218-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-012-0218-5

Keywords

Navigation