An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI

  • Angelo Duarte
  • Dolores Rexachs
  • Emilio Luque
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4192)


Independence of special elements, transparency and scalability are very significant features required from the fault tolerance schemes for modern clusters of computers. In order to attend such requirements we developed the RADIC architecture (Redundant Array of Distributed Independent Checkpoints). RADIC is an architecture based on a fully distributed array of processes that collaborate in order to create a distributed fault tolerance controller. This controller works without special, central or stable elements. RADIC implements the fault tolerance activities, transparently to the user application, using a message-log rollback-recovery protocol. Using the RADIC concepts we implemented a prototype, RADICMPI, which contains some standard MPI directives and includes all functionalities of RADIC. We tested RADICMPI in a real environment by injecting failures in nodes of the cluster and monitoring the behavior of the application. Our tests confirmed the correct operation of RADICMPI and the effectiveness of the RADIC mechanism.


Fault Tolerance Application Process Parallel Application Stable Element Intelligent Management 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agbaria, A.M., Friedman, R.: Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In: Proceedings of 8th International Symposium on High Performance Distributed Computing, August 1999, pp. 167–176 (1999)Google Scholar
  2. 2.
    Rao, S., Alvisi, L., Vin, H.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Proceedings of IEEE Fault-Tolerant Computing Symposium (FTCS-29), Madison, USA (June 1999)Google Scholar
  3. 3.
    Fagg, G., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Euro PVM/MPI User’s Group Meeting 2000, Berlin, Germany, pp. 346–353. Springer, Heidelberg (2000)Google Scholar
  4. 4.
    Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)CrossRefGoogle Scholar
  5. 5.
    Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjellum, A., Dandass, Y., Apte, M.: MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance portable parallel computing. In: Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid, Melbourne, Australia (2001)Google Scholar
  6. 6.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of SuperComputing 2002 (SC 2002) (November 2002)Google Scholar
  7. 7.
    Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In: Proceedings of LACSI Symposium, Sante Fe, New Mexico, USA (October 2003)Google Scholar
  8. 8.
    Aulwes, R.T., Daniel, D.J., Desai, N.N., Graham, R.L., Risinger, L.D., Taylor, M.A., Woodall, T.S., Sukalski, M.W.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: Proceedings of 18th International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos (2004)Google Scholar
  9. 9.
    Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104 (2004)Google Scholar
  10. 10.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computer Survey 34(3), 375–408 (2002)CrossRefGoogle Scholar
  11. 11.
    Kalaiselvi, S., Rajaraman, V.: A Survey of Checkpointing Algorithms for Parallel and Distributed Computers. In: SADHANA:Academic Proceedings in Engineering Sciences, Bangalore, India, October 2000, vol. 25, part 5, pp. 489–510 (2000)Google Scholar
  12. 12.
    Duarte, A., Rexachs, D., Luque, E.: A distributed scheme for fault-tolerance in large Clusters of Workstations. In: Proceedings of Parrallel Computer 2005 (Parco 2005), Málaga. Spain, September 13-16 (in press, 2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Angelo Duarte
    • 1
  • Dolores Rexachs
    • 1
  • Emilio Luque
    • 1
  1. 1.Computer Architecture and Operating Systems DepartmentUniversity Autonoma of BarcelonaBarcelonaSpain

Personalised recommendations