Advertisement

RADIC Based Fault Tolerance System with Dynamic Resource Controller

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10862)

Abstract

The continuously growing High-Performance Computing requirements increments the number of components and at the same time failure probabilities. Long-running parallel applications are directly affected by this phenomena, disrupting its executions on failure occurrences. MPI, a well-known standard for parallel applications follows a fail-stop semantic, requiring the application owners restart the whole execution when hard failures appear losing time and computation data. Fault Tolerance (FT) techniques approach this issue by providing high availability to the users’ applications execution, though adding significant resource and time costs. In this paper, we present a Fault Tolerance Manager (FTM) framework based on RADIC architecture, which provides FT protection to parallel applications implemented with MPI, in order to successfully complete executions despite failures. The solution is implemented in the application-layer following the uncoordinated and semi-coordinated rollback recovery protocols. It uses a sender-based message logger to store exchanged messages between the application processes; and checkpoints only the processes data required to restart them in case of failures. The solution uses the concepts of ULFM for failure detection and recovery. Furthermore, a dynamic resource controller is added to the proposal, which monitors the message logger buffers and performs actions to maintain an acceptable level of protection. Experimental validation verifies the FTM functionality using two private clusters infrastructures.

Keywords

High-Performance Computing Fault Tolerance Application layer FT Sender-based message logging 

References

  1. 1.
    Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014).  https://doi.org/10.14529/jsfi140101. http://superfri.org/superfri/article/view/14CrossRefGoogle Scholar
  2. 2.
    Castro-León, M., Meyer, H., Rexachs, D., Luque, E.: Fault tolerance at system level based on RADIC architecture. J. Parallel Distrib. Comput. 86, 98–111 (2015).  https://doi.org/10.1016/j.jpdc.2015.08.005. http://www.sciencedirect.com/science/article/pii/S0743731515001434CrossRefGoogle Scholar
  3. 3.
    Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013).  https://doi.org/10.1007/s11227-013-0884-0. http://link.springer.com/10.1007/s11227-013-0884-0CrossRefGoogle Scholar
  4. 4.
    Wang, C., Vazhkudai, S., Ma, X., Mueller, F.: Transparent Fault Tolerance for Job Input Data in HPC Environments (2014). http://optout.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/springer14.pdf

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.CAOS - Computer Architecture and Operating SystemsUniversidad Autónoma de BarcelonaBarcelonaSpain

Personalised recommendations