Abstract
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.
Similar content being viewed by others
References
Ali, N., Carns, P.H., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R.B., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: IEEE International Conference on Cluster Computing, pp. 1–10, Aug (2009)
Ali, N., Krishnamoorthy, S., Govind, N., Kowalski, K., Sadayappan, P.: Application-specific fault tolerance via data access characterization. In International European Conference on Parallel and Distributed Computing, Aug (2011a)
Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approachq to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, Feb (2011b)
Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers, May (2011c)
Bosilca G., Delmas R., Dongarra J., Langou J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)
Bronevetsky, G., Moody, A.: Scalable I/O systems via node-local storage: approaching 1 TB/sec file I/O. Technical report LLNL-TR-415791, Lawrence Livermore National Laboratory, Aug (2009)
Burkard R., Dell’Amico M., Martello S.: Assignment Problems. Society for Industrial and Applied Mathematics, Philadelphia (2009)
Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: IEEE International Parallel & Distributed Processing Symposium, Apr (2006)
Chen Z., Dongarra J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19(12), 1628–1641 (2008)
Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine fault-tolerant mapreduce: faults are not just crashes. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 32–39 (2011)
Darte A., Mellor-Crummey J., Fowler R., Chavarría-Miranda D.: Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations. J. Parallel Distrib. Comput. 63(9), 887–911 (2003)
Dean, J., Ghemawat S.: MapReduce: simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004)
Elnozahy E.N., Alvisi L., Wang Y.-M., Johnson D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Engelmann, C., Vallée, G., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257, Feb (2009)
Fagg, G.E., Dongarra, J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000)
Gabow H.N.: An efficient implementation of edmonds’ algorithm for maximum matching on graphs. J. ACM 23(2), 221–234 (1976)
Gupta, R., Beckman, P., Park, B.-H., Lusk, E., Hargrove, P., Geist, A., Panda, D., Lumsdaine, A., Dongarra, J.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: Proceedings of the International Conference on Parallel Processing, pp. 237–245 (2009)
Halappanavar, M.: Algorithms for vertex-weighted matching in graphs. PhD thesis, Old Dominion University, Norfolk, VA (2009)
Hargrove P.H., Duell J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46(1), 494–499 (2006)
Hopcroft J., Karp R.: A \({n^{\frac{5}{2}}}\) algorithm for maximum matchings in bipartite graphs. SIAM J. Comput. 2, 225–231 (1973)
Huang K.-H., Abraham J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)
Kuhn H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)
Lawler E.: Combinatorial Optimization: Networks and Matroids. Dover Publications, Mineola (2001)
Lovasz L., Plummer M.D.: Matching Theory. North-Holland Publishing Co., Amsterdam (1986)
Motwani R.: Average-case analysis of algorithms for matchings and related problems. J. ACM 41(6), 1329–1356 (1994)
Nieplocha J., Palmer B., Tipparaju V., Krishnan M., Trease H., Aprà à E.: Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 20, 203–231 (2006)
Panda, D.K.: MVAPICH. http://mvapich.cse.ohio-state.edu
Papadimitriou C.H., Steiglitz K.: Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall Inc., Upper Saddle River (1982)
Plank, J., Li, K.: Faster checkpointing with N + 1 parity. In: International Symposium on Fault-Tolerant Computing, pp. 288–297, June (1994)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223, Jan (1995)
Plank J.S., Li K., Puening M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)
Schrijver A.: Combinatorial Optimization: Polyhedra and Efficiency. Springer Publishing Co., New York (2003)
Schroeder B., Gibson G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78(1), 1–11 (2007)
Tipparaju, V., Krishnan, M., Palmer, B., Petrini, F., Nieplocha, J.: Towards fault resilient global arrays. In: International Conference on Parallel Computing, vol. 15, pp. 339–345 (2007)
The ScaLAPACK project. http://www.netlib.org/scalapack
Valiev M., Bylaska E., Govind N., Kowalski K., Straatsma T., Dam H.V., Wang D., Nieplocha J., Apra E., Windus T., de Jong W.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–12, Nov (2008)
Wolsey L.A.: Integer Programming. Wiley, Hoboken (1998)
Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: IEEE International Conference on Cluster Computing, pp. 93–103, Sept (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ali, N., Krishnamoorthy, S., Halappanavar, M. et al. Multi-Fault Tolerance for Cartesian Data Distributions. Int J Parallel Prog 41, 469–493 (2013). https://doi.org/10.1007/s10766-012-0218-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-012-0218-5