Multi-Fault Tolerance for Cartesian Data Distributions

Ali, Nawab; Krishnamoorthy, Sriram; Halappanavar, Mahantesh; Daily, Jeff

doi:10.1007/s10766-012-0218-5

Multi-Fault Tolerance for Cartesian Data Distributions

Published: 02 November 2012

Volume 41, pages 469–493, (2013)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Nawab Ali¹,
Sriram Krishnamoorthy¹,
Mahantesh Halappanavar¹ &
…
Jeff Daily¹

291 Accesses
9 Citations
Explore all metrics

Abstract

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ali, N., Carns, P.H., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R.B., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: IEEE International Conference on Cluster Computing, pp. 1–10, Aug (2009)
Ali, N., Krishnamoorthy, S., Govind, N., Kowalski, K., Sadayappan, P.: Application-specific fault tolerance via data access characterization. In International European Conference on Parallel and Distributed Computing, Aug (2011a)
Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approachq to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, Feb (2011b)
Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers, May (2011c)
Bosilca G., Delmas R., Dongarra J., Langou J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)
Article Google Scholar
Bronevetsky, G., Moody, A.: Scalable I/O systems via node-local storage: approaching 1 TB/sec file I/O. Technical report LLNL-TR-415791, Lawrence Livermore National Laboratory, Aug (2009)
Burkard R., Dell’Amico M., Martello S.: Assignment Problems. Society for Industrial and Applied Mathematics, Philadelphia (2009)
Book MATH Google Scholar
Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: IEEE International Parallel & Distributed Processing Symposium, Apr (2006)
Chen Z., Dongarra J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19(12), 1628–1641 (2008)
Article Google Scholar
Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine fault-tolerant mapreduce: faults are not just crashes. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 32–39 (2011)
Darte A., Mellor-Crummey J., Fowler R., Chavarría-Miranda D.: Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations. J. Parallel Distrib. Comput. 63(9), 887–911 (2003)
Article MATH Google Scholar
Dean, J., Ghemawat S.: MapReduce: simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004)
Elnozahy E.N., Alvisi L., Wang Y.-M., Johnson D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
Engelmann, C., Vallée, G., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257, Feb (2009)
Fagg, G.E., Dongarra, J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000)
Gabow H.N.: An efficient implementation of edmonds’ algorithm for maximum matching on graphs. J. ACM 23(2), 221–234 (1976)
Article MathSciNet MATH Google Scholar
Gupta, R., Beckman, P., Park, B.-H., Lusk, E., Hargrove, P., Geist, A., Panda, D., Lumsdaine, A., Dongarra, J.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: Proceedings of the International Conference on Parallel Processing, pp. 237–245 (2009)
Halappanavar, M.: Algorithms for vertex-weighted matching in graphs. PhD thesis, Old Dominion University, Norfolk, VA (2009)
Hargrove P.H., Duell J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46(1), 494–499 (2006)
Article Google Scholar
Hopcroft J., Karp R.: A \({n^{\frac{5}{2}}}\) algorithm for maximum matchings in bipartite graphs. SIAM J. Comput. 2, 225–231 (1973)
Article MathSciNet MATH Google Scholar
Huang K.-H., Abraham J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)
Article MATH Google Scholar
HPL. http://www.netlib.org/benchmark/hpl
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)
Kuhn H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)
Article Google Scholar
Lawler E.: Combinatorial Optimization: Networks and Matroids. Dover Publications, Mineola (2001)
MATH Google Scholar
Lovasz L., Plummer M.D.: Matching Theory. North-Holland Publishing Co., Amsterdam (1986)
MATH Google Scholar
Motwani R.: Average-case analysis of algorithms for matchings and related problems. J. ACM 41(6), 1329–1356 (1994)
Article MathSciNet MATH Google Scholar
Nieplocha J., Palmer B., Tipparaju V., Krishnan M., Trease H., Aprà à E.: Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 20, 203–231 (2006)
Article Google Scholar
Panda, D.K.: MVAPICH. http://mvapich.cse.ohio-state.edu
Papadimitriou C.H., Steiglitz K.: Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall Inc., Upper Saddle River (1982)
MATH Google Scholar
Plank, J., Li, K.: Faster checkpointing with N + 1 parity. In: International Symposium on Fault-Tolerant Computing, pp. 288–297, June (1994)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223, Jan (1995)
Plank J.S., Li K., Puening M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)
Article Google Scholar
Schrijver A.: Combinatorial Optimization: Polyhedra and Efficiency. Springer Publishing Co., New York (2003)
MATH Google Scholar
Schroeder B., Gibson G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78(1), 1–11 (2007)
Google Scholar
Tipparaju, V., Krishnan, M., Palmer, B., Petrini, F., Nieplocha, J.: Towards fault resilient global arrays. In: International Conference on Parallel Computing, vol. 15, pp. 339–345 (2007)
The ScaLAPACK project. http://www.netlib.org/scalapack
Valiev M., Bylaska E., Govind N., Kowalski K., Straatsma T., Dam H.V., Wang D., Nieplocha J., Apra E., Windus T., de Jong W.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)
Article MATH Google Scholar
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–12, Nov (2008)
Wolsey L.A.: Integer Programming. Wiley, Hoboken (1998)
MATH Google Scholar
Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: IEEE International Conference on Cluster Computing, pp. 93–103, Sept (2004)

Download references

Author information

Authors and Affiliations

Pacific Northwest National Laboratory, Richland, WA, 99352, USA
Nawab Ali, Sriram Krishnamoorthy, Mahantesh Halappanavar & Jeff Daily

Authors

Nawab Ali
View author publications
You can also search for this author in PubMed Google Scholar
Sriram Krishnamoorthy
View author publications
You can also search for this author in PubMed Google Scholar
Mahantesh Halappanavar
View author publications
You can also search for this author in PubMed Google Scholar
Jeff Daily
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sriram Krishnamoorthy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ali, N., Krishnamoorthy, S., Halappanavar, M. et al. Multi-Fault Tolerance for Cartesian Data Distributions. Int J Parallel Prog 41, 469–493 (2013). https://doi.org/10.1007/s10766-012-0218-5

Download citation

Received: 01 September 2011
Accepted: 08 August 2012
Published: 02 November 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10766-012-0218-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Fault Tolerance for Cartesian Data Distributions

Abstract

Access this article

Similar content being viewed by others

Multigrid at Scale?

PhoeniQ: Failure-Tolerant Query Processing in Multi-node Environments

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-Fault Tolerance for Cartesian Data Distributions

Abstract

Access this article

Similar content being viewed by others

Multigrid at Scale?

PhoeniQ: Failure-Tolerant Query Processing in Multi-node Environments

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation