Recent Advances in Parallel Virtual Machine and Message Passing Interface
Volume 3666 of the series Lecture Notes in Computer Science pp 67-75
Scalable Fault Tolerant MPI: Extending the Recovery Algorithm
- Graham E. FaggAffiliated withDept. of Computer Science, The University of Tennessee
- , Thara AngskunAffiliated withDept. of Computer Science, The University of Tennessee
- , George BosilcaAffiliated withDept. of Computer Science, The University of Tennessee
- , Jelena Pjesivac-GrbovicAffiliated withDept. of Computer Science, The University of Tennessee
- , Jack J. DongarraAffiliated withDept. of Computer Science, The University of Tennessee
Abstract
Fault Tolerant MPI (FT-MPI) [6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.
- Title
- Scalable Fault Tolerant MPI: Extending the Recovery Algorithm
- Book Title
- Recent Advances in Parallel Virtual Machine and Message Passing Interface
- Book Subtitle
- 12th European PVM/MPI Users’ Group Meeting Sorrento, Italy, September 18-21, 2005. Proceedings
- Pages
- pp 67-75
- Copyright
- 2005
- DOI
- 10.1007/11557265_13
- Print ISBN
- 978-3-540-29009-4
- Online ISBN
- 978-3-540-31943-6
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- 3666
- Series ISSN
- 0302-9743
- Publisher
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Topics
- Industry Sectors
- eBook Packages
- Editors
-
- Beniamino Di Martino (16)
- Dieter Kranzlmüller (17)
- Jack Dongarra (18)
- Editor Affiliations
-
- 16. Dipartimento di Ingegneria dell’ Informazione, Second University of Naples - Italy
- 17. GUP, Institute of Graphics and Parallel Processing, Johannes Kepler University
- 18. Computer Science Department, University of Tennessee
- Authors
-
- Graham E. Fagg (19)
- Thara Angskun (19)
- George Bosilca (19)
- Jelena Pjesivac-Grbovic (19)
- Jack J. Dongarra (19)
- Author Affiliations
-
- 19. Dept. of Computer Science, The University of Tennessee, 1122 Volunteer Blvd., Suite 413, Knoxville, TN, 37996-3450, USA
Continue reading...
To view the rest of this content please follow the download PDF link above.