FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study

  • David Dewolfs
  • Jan Broeckhove
  • Vaidy Sunderam
  • Graham E. Fagg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4192)


There is a growing interest in deploying MPI over very large numbers of heterogenous, geographically distributed resources. FT-MPI provides the fault-tolerance necessary at this scale, but presents some issues when crossing multiple administrative domains. Using the H2O metacomputing framework, we add cross-administrative domain interoperability and “pluggability” to FT-MPI. The latter feature allows us, using proxies, to transparently replace one vulnerable module – its name service – with fault-tolerant replacements. We present an algorithm for improving performance of operations over the proxies. We evaluate its performance in a comparison using the original name service, OpenLDAP and current Emory research project HDNS.


FT-MPI H2O metacomputing fault-tolerance hetero- geneity 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Dewolfs, D., Kurzyniec, D., Sunderam, V., Broeckhove, J., Dhaene, T., Fagg, G.E.: Applicability of Generic Naming Services and Fault-Tolerant Metacomputing with FT-MPI. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 268–275. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  2. 2.
    Kurzyniec, D., Sunderam, V.: Combining FT-MPI with H20: Fault-tolerant MPI across administrative boundaries. In: Proceedings of the HCW 2005-14th Heterogeneous Computing Workshop (2005)Google Scholar
  3. 3.
    Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Eighth IEEE International Symposium on High Performance Distributed Computing, p. 31 (1999)Google Scholar
  4. 4.
    Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE SC2003 Conference, p. 25 (2003)Google Scholar
  5. 5.
    Chen, Y., Li, K., Plank, J.S.: CLIP: A checkpointing tool for message-passing parallel programs (1997), Available at:
  6. 6.
    Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531 (1992)Google Scholar
  7. 7.
    Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Process fault-tolerance: Sematics, design and applications for high-performance computing. International Journal for High Performance Applications and Supercomputing (2004)Google Scholar
  8. 8.
    Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards self-organising distributed computing frameworks: The H2O approach. Parallel Processing Letters 13(2), 273–290 (2003)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Louca, S., Neophytou, N., Lachanas, A., Eviripidou, P.: MPI-FT: Portable fault-tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)CrossRefGoogle Scholar
  10. 10.
    Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: 10th International Parallel Processing Symposium, pp. 526–531 (1996)Google Scholar
  11. 11.
    Migliardi, M., Sunderam, V.: The Harness Metacomputing Framework. In: The Ninth SIAM Conference on Parallel Processing for Scientific Computing, S. Antonio (1999)Google Scholar
  12. 12.
    Gorissen, D., Wendykier, P., Kurzyniec, D., Sunderam, V.: Integrating Heterogeneous Information Services Using JNDI. In: Proceedings of the HCW 2006 - 15th Heterogeneous Computing Workshop, Rhodes Island, Greece (April 2006)Google Scholar
  13. 13.
    Fagg, G.E., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Scalable Fault Tolerant MPI: Extending the Recovery Algorithm. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 67–75. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • David Dewolfs
    • 1
  • Jan Broeckhove
    • 1
  • Vaidy Sunderam
    • 1
  • Graham E. Fagg
    • 1
  1. 1.Depts. of Math and Computer Science of the University of Antwerp, Emory University, the University of TennesseeAntwerp, Atlanta, GA, Knoxville, TNBelgium, USA

Personalised recommendations