Advertisement

Addressing the Last Roadblock for Message Logging in HPC: Alleviating the Memory Requirement Using Dedicated Resources

  • Tatiana MartsinkevichEmail author
  • Thomas Ropars
  • Franck Cappello
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9523)

Abstract

Currently used global application checkpoint-restart will not be a suitable solution for HPC applications running on large scale as, given the predicted fault rates, it will impose a high load on the I/O subsystem and lead to inefficient resource usage. Combining application checkpointing with message logging is appealing as it allows restarting only the processes that actually failed. One major issue with message logging protocols is the high amount of memory required to store logs. In this work we propose to use additional dedicated resources to save the part of the logs that would not fit in the memory of a compute node. We show that, combined with a cluster-based hierarchical logging technique, only few dedicated nodes would be required to accommodate the memory requirement of message logging protocols. We additionally show that the proposed technique achieves a reasonable performance overhead.

Keywords

High-performance computing Fault tolerance Message logging Hierarchical message-logging protocols Dedicated resources 

Notes

Acknowledgements

Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr).

This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357.

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.

References

  1. 1.
    Alvisi, L., Marzullo, K.: Message logging: pessimistic, optimistic, causal, and optimal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998)CrossRefGoogle Scholar
  2. 2.
    Bouteiller, A., Bosilca, G., Dongarra, J.: Redesigning the message logging model for high performance. Concurrency Comput. Pract. Experience 22, 2196–2211 (2010)CrossRefGoogle Scholar
  3. 3.
    Bouteiller, A., Collin, B., Herault, T., Lemarinier, P., Cappello, F.: Impact of event logger on causal message logging protocols for fault tolerant MPI. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), vol. 1, p. 97, April 2005Google Scholar
  4. 4.
    Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J.: Correlated set coordination in fault tolerant message logging protocols. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 51–64. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  5. 5.
    Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovations 1(1), 1–28 (2014)Google Scholar
  6. 6.
    Cappello, F., Guermouche, A., Snir, M.: On communication determinism in parallel HPC applications. In: 19th International Conference on Computer Communications and Networks (ICCCN 2010) (2010)Google Scholar
  7. 7.
    Cores, I., Rodriguez, G., Martin, M., González, P.: Reducing application-level checkpoint file sizes: towards scalable fault tolerance solutions. In: 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications, pp. 371–378, July 2012Google Scholar
  8. 8.
    Di Martino, C., Kalbarczyk, Z., Iyer, R., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at petascale: the case of Blue Waters. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 610–621, June 2014Google Scholar
  9. 9.
    Ferreira, K.B., Riesen, R., Arnold, D., Ibtesham, D., Brightwell, R.: The viability of using compression to decrease message log sizes. In: Caragiannis, I., Alexander, M., Badia, R.M., Cannataro, M., Costan, A., Danelutto, M., Desprez, F., Krammer, B., Sahuquillo, J., Scott, S.L., Weidendorfer, J. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 484–493. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  10. 10.
    Jin, H., Ke, T., Chen, Y., Sun, X.H.: Checkpointing orchestration: toward a scalable hpc fault-tolerant environment. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2012, pp. 276–283 (2012)Google Scholar
  11. 11.
    Johnson, D.B., Zwaenepoel, W.: Sender-based message logging. In: Digest of Papers: The 17th Annual International Symposium on Fault-Tolerant Computing, pp. 14–19 (1987)Google Scholar
  12. 12.
    Meneses, E., Mendes, C.L., Kalé, L.V.: Team-based message logging: preliminary results. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGRID 2010, pp. 697–702 (2010)Google Scholar
  13. 13.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11 (2010)Google Scholar
  14. 14.
    Riesen, R., Ferreira, K., Da Silva, D., Lemarinier, P., Arnold, D., Bridges, P.G.: Alleviating scalability issues of checkpointing protocols. In: IEEE/ACM SuperComputing 2012, SC 2012 (2012)Google Scholar
  15. 15.
    Ropars, T., Guermouche, A., Uçar, B., Meneses, E., Kalé, L.V., Cappello, F.: On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications. In: Proceedings of the 17th international conference on Parallel processing, Euro-Par 2011, pp. 567–578 (2011)Google Scholar
  16. 16.
    Ropars, T., Martsinkevich, T., Guermouche, A., Schiper, A., Cappello, F.: SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing. In: IEEE/ACM SuperComputing 2013 (SC13) (2013)Google Scholar
  17. 17.
    Ropars, T., Morin, C.: Active optimistic and distributed message logging for message-passing applications. Concurrency Comput. Pract. Experience 23(17), 2167–2178 (2011)CrossRefGoogle Scholar
  18. 18.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Tatiana Martsinkevich
    • 1
    Email author
  • Thomas Ropars
    • 2
  • Franck Cappello
    • 3
  1. 1.InriaUniversity of Paris SudOrsayFrance
  2. 2.InriaBordeauxFrance
  3. 3.Argonne National Laboratory ArgonneLemontUSA

Personalised recommendations