Abstract
Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time.
Keywords
- parallel computing
- checkpoint/rollback
- over-decomposition
- global restart
- partial restart
Download conference paper PDF
References
Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using smpss. Concurr. Comput. : Pract. Exper. (2009)
Besseron, X., Gautier, T.: Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications. In: MCO 2008 (2008)
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. Parallel and Distributed Computing (1996)
Bongo, L.A., Vinter, B., Anshus, O.J., Larsen, T., Bjorndalen, J.M.: Using overdecomposition to overlap communication latencies with computation and take advantage of smt processors. In: ICPP Workshops (2006)
Bouteiller, A., Hérault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: a multiprotocol automatic fault tolerant MPI. High Performance Computing Applications (2006)
Chakravorty, S., Kale, L.V.: A fault tolerant protocol for massively parallel systems. In: IPDPS (2004)
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems (1985)
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (2002)
Galilée, F., Roch, J.L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building data flow graph in a parallel language. In: PACT 1998 (1998)
Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for mpi programs over infiniband. In: ICPP 2006 (2006)
Gautier, T., Besseron, X., Pigeon, L.: Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)
Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In: IPDPS (2011)
Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS (2007)
Jafar, S., Krings, A.W., Gautier, T.: Flexible rollback recovery in dynamic heterogeneous grid computing. IEEE Transactions on Dependable and Secure Computing (2008)
Jafar, S., Pigeon, L., Gautier, T., Roch, J.L.: Self-adaptation of parallel applications in heterogeneous and dynamic architectures. In: ICTTA 2006 (2006)
Jose, J., Luo, M., Sur, S., Panda, D.K.: Unifying UPC and MPI Runtimes: Experience with MVAPICH. In: PGAS 2010 (2010)
Kale, L.V., Mendes, C., Meneses, E.: Adaptive runtime support for fault tolerance. Talk at Los Alamos Computer Science Symposium 2009 (2009)
Kale, L.V., Zheng, G.: Charm++ and AMPI: Adaptive runtime strategies via migratable objects. In: Advanced Computational Infrastructures for Parallel and Distributed Applications. Wiley-Interscience (2009)
Naik, V.K., Setia, S.K., Squillante, M.S.: Processor allocation in multiprogrammed distributed-memory parallel computer systems. Parallel Distributed Computing (1997)
Rabenseifner, R., Hager, G., Jost, G.: Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: PDP 2009 (2009)
Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC 2009 (2009)
Tamir, Y., Séquin, C.H.: Error recovery in multicomputers using global checkpoints. In: ICPP 1984 (1984)
Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Cluster Computing (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Besseron, X., Gautier, T. (2012). Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-29740-3_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29739-7
Online ISBN: 978-3-642-29740-3
eBook Packages: Computer ScienceComputer Science (R0)
