Advertisement

FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study

  • David Dewolfs
  • Jan Broeckhove
  • Vaidy Sunderam
  • Graham E. Fagg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4192)

Abstract

There is a growing interest in deploying MPI over very large numbers of heterogenous, geographically distributed resources. FT-MPI provides the fault-tolerance necessary at this scale, but presents some issues when crossing multiple administrative domains. Using the H2O metacomputing framework, we add cross-administrative domain interoperability and “pluggability” to FT-MPI. The latter feature allows us, using proxies, to transparently replace one vulnerable module – its name service – with fault-tolerant replacements. We present an algorithm for improving performance of operations over the proxies. We evaluate its performance in a comparison using the original name service, OpenLDAP and current Emory research project HDNS.

Keywords

FT-MPI H2O metacomputing fault-tolerance hetero- geneity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dewolfs, D., Kurzyniec, D., Sunderam, V., Broeckhove, J., Dhaene, T., Fagg, G.E.: Applicability of Generic Naming Services and Fault-Tolerant Metacomputing with FT-MPI. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 268–275. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  2. 2.
    Kurzyniec, D., Sunderam, V.: Combining FT-MPI with H20: Fault-tolerant MPI across administrative boundaries. In: Proceedings of the HCW 2005-14th Heterogeneous Computing Workshop (2005)Google Scholar
  3. 3.
    Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Eighth IEEE International Symposium on High Performance Distributed Computing, p. 31 (1999)Google Scholar
  4. 4.
    Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE SC2003 Conference, p. 25 (2003)Google Scholar
  5. 5.
    Chen, Y., Li, K., Plank, J.S.: CLIP: A checkpointing tool for message-passing parallel programs (1997), Available at: http://citeseerist.psu.edu/chen97clip.html
  6. 6.
    Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531 (1992)Google Scholar
  7. 7.
    Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Process fault-tolerance: Sematics, design and applications for high-performance computing. International Journal for High Performance Applications and Supercomputing (2004)Google Scholar
  8. 8.
    Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards self-organising distributed computing frameworks: The H2O approach. Parallel Processing Letters 13(2), 273–290 (2003)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Louca, S., Neophytou, N., Lachanas, A., Eviripidou, P.: MPI-FT: Portable fault-tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)CrossRefGoogle Scholar
  10. 10.
    Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: 10th International Parallel Processing Symposium, pp. 526–531 (1996)Google Scholar
  11. 11.
    Migliardi, M., Sunderam, V.: The Harness Metacomputing Framework. In: The Ninth SIAM Conference on Parallel Processing for Scientific Computing, S. Antonio (1999)Google Scholar
  12. 12.
    Gorissen, D., Wendykier, P., Kurzyniec, D., Sunderam, V.: Integrating Heterogeneous Information Services Using JNDI. In: Proceedings of the HCW 2006 - 15th Heterogeneous Computing Workshop, Rhodes Island, Greece (April 2006)Google Scholar
  13. 13.
    Fagg, G.E., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Scalable Fault Tolerant MPI: Extending the Recovery Algorithm. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 67–75. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • David Dewolfs
    • 1
  • Jan Broeckhove
    • 1
  • Vaidy Sunderam
    • 1
  • Graham E. Fagg
    • 1
  1. 1.Depts. of Math and Computer Science of the University of Antwerp, Emory University, the University of TennesseeAntwerp, Atlanta, GA, Knoxville, TNBelgium, USA

Personalised recommendations