Proactive Fault Tolerance in MPI Applications Via Task Migration

  • Sayantan Chakravorty
  • Celso L. Mendes
  • Laxmikant V. Kalé
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4297)


Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.


Message Passing Interface Evacuation Time Runtime System Collective Operation Processor Virtualization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Gropp, W., Lusk, E., Skjellum, A.: Using MPI, 2nd edn. MIT Press, Cambridge (1999)Google Scholar
  2. 2.
    Gropp, W., Lusk, E.: Fault tolerance in message passing interface programs. International Journal of High Performance Computing Applications 18(3), 363–372 (2004)CrossRefGoogle Scholar
  3. 3.
    Huang, C.: System support for checkpoint and restart of Charm++ and AMPI applications. Master’s thesis, Dep. of Computer Science, University of Illinois, Urbana, IL (2004), Available at:
  4. 4.
    Zheng, G., Shi, L., Kalé, L.V.: FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE International Conference on Cluster Computing, San Diego, CA (2004)Google Scholar
  5. 5.
    Chakravorty, S., Kalé, L.V.: A fault tolerant protocol for massively parallel machines. In: FTPDS Workshop at IPDPS 2004, Santa Fe, NM. IEEE Press, Los Alamitos (2004)Google Scholar
  6. 6.
    Chakravorty, S., Mendes, C.L., Kale, L.V.: Proactive fault tolerance in large systems. In: HPCRI Workshop in conjunction with HPCA 2005 (2005)Google Scholar
  7. 7.
    Hewlett-Packard, Intel, Microsoft, Phoenix, Toshiba: Advanced configuration and power interface specification. ACPI Specification Document, Revision 3.0 (2004), Available from:
  8. 8.
    Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: Proceedings og the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pp. 426–435 (2003)Google Scholar
  9. 9.
    Oliner, A.J., Sahoo, R.K., Moreira, J.E., Gupta, M., Sivasubramaniam, A.: Fault-aware job scheduling for BlueGene/L systems. Technical Report RC23077, IBM Research (2004)Google Scholar
  10. 10.
    Kalé, L.V., Krishnan, S.: Charm++: Parallel programming with message-driven objects. In: Wilson, G.V., Lu, P. (eds.) Parallel Programming using C++, pp. 175–213. MIT Press, Cambridge (1996)Google Scholar
  11. 11.
    Huang, C., Lawlor, O., Kalé, L.V.: Adaptive MPI. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Gioachin, F., Sharma, A., Chakravorty, S., Mendes, C.L., Kalé, L.V., Quinn, T.: Scalable Cosmological Simulations on Parallel Machines. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395, pp. 476–489. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  13. 13.
    Kalé, L.V., Kumar, S., Zheng, G., Lee, C.W.: Scaling molecular dynamics to 3000 processors with projections: A performance analysis case study. In: Terascale Performance Analysis Workshop, International Conference on Computational Science (ICCS), Melbourne, Australia (2003)Google Scholar
  14. 14.
    Lawlor, O.S., Kalé, L.V.: Supporting dynamic parallel object arrays. Concurrency and Computation: Practice and Experience 15, 371–393 (2003)MATHCrossRefGoogle Scholar
  15. 15.
    Antoniu, G., Bouge, L., Namyst, R.: An efficient and transparent thread migration scheme in the PM 2 runtime system. In: Juan, S., Rico, P. (eds.) Proc. 3rd Workshop on Runtime Systems for Parallel Programming (RTSPP). LNCS, vol. 1586, pp. 496–510. Springer, Heidelberg (1999)Google Scholar
  16. 16.
    Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium, pp. 526–531 (1996)Google Scholar
  17. 17.
    Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. Cluster Computing 6(3), 227–236 (2003)CrossRefGoogle Scholar
  18. 18.
    Chen, Y., Plank, J.S., Li, K.: Clip: A checkpointing tool for message-passing parallel programs. In: Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pp. 1–11 (1997)Google Scholar
  19. 19.
    Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3(3), 204–226 (1985)CrossRefGoogle Scholar
  20. 20.
    Fagg, G.E., Dongarra, J.J.: Building and using a fault-tolerant MPI implementation. International Journal of High Performance Computing Applications 18(3), 353–361 (2004)CrossRefGoogle Scholar
  21. 21.
    Batchu, R., Skjellum, A., Cui, Z., Beddhu, M., Neelamegam, J.P., Dandass, Y., Apte, M.: Mpi/fttm: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 26. IEEE Computer Society, Los Alamitos (2001)CrossRefGoogle Scholar
  22. 22.
    Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)CrossRefGoogle Scholar
  23. 23.
    Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: A fault tolerant MPI for volatile nodes based on the pessimistic sender based message logging programming via processor virtualization. In: Proceedings of Supercomputing 2003, Phoenix, AZ (2003)Google Scholar
  24. 24.
    Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers 41(5), 526–531 (1992)CrossRefGoogle Scholar
  25. 25.
    Pertet, S., Narasimhan, P.: Proactive recovery in distributed CORBA applications. In: Proceedings of the International Conference on Dependable Systems and Networks, pp. 357–366 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Sayantan Chakravorty
    • 1
  • Celso L. Mendes
    • 1
  • Laxmikant V. Kalé
    • 1
  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-Champaign 

Personalised recommendations