Skip to main content

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6305))

Abstract

With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault tolerant; most are in need for a seamless recovery framework. Among the automatic fault tolerant techniques proposed for MPI, message logging is preferable for its scalable recovery. The major challenge for message logging protocols is the performance penalty on communications during failure-free periods, mostly coming from the payload copy introduced for each message. In this paper, we investigate different approaches for logging payload and compare their impact on network performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Meuer, W.H.: The top500 project: Looking back over 15 years of supercomputing experience. Informatik-Spektrum 31(3), 203–222 (2008)

    Article  Google Scholar 

  2. The MPI Forum: MPI: a message passing interface. In: Supercomputing 1993: Proceedings of the 1993 ACM/IEEE conference on Supercomputing, pp. 878–883. ACM Press, New York (1993)

    Google Scholar 

  3. Fagg, G.E., Gabriel, E., Bosilca, G., Angskun, T., Chen, Z., Pjesivac-Grbovic, J., London, K., Dongarra, J.J.: Extending the MPI specification for process fault tolerance on high performance computing systems. In: Proceedings of the International Supercomputer Conference (ICS) 2004, Primeur (2004)

    Google Scholar 

  4. Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: IEEE International Conference on Cluster Computing (Cluster 2004). IEEE CS Press, Los Alamitos (2004)

    Google Scholar 

  5. Bouteiller, A., Ropars, T., Bosilca, G., Morin, C., Dongarra, J.: Reasons to be pessimist or optimist for failure recovery in high performance clusters. In: IEEE (ed.): Proceedings of the 2009 IEEE Cluster Conference, New Orleans, Louisiana, USA (2009)

    Google Scholar 

  6. Bouteiller, A., Bosilca, G., Dongarra, J.: Redesigning the message logging model for high performance. In: Proceedings of the International Supercomputer Conference (ISC 2008), Dresden, Germany. Wiley, Chichester (2008) (to appear)

    Google Scholar 

  7. Strom, R.E., Bacon, D.F., Yemini, S.: Volatile logging in n-fault-tolerant distributed systems. In: Society, I.C. (ed.) Proceedings of the Eighteenth International Symposium on Fault Tolerant Computing (1988)

    Google Scholar 

  8. Strom, R.E., Yemini, S.: Optimistic recovery: an asynchronous approah to fault-tolerance in distributed systems. In: Proceedings of the 14th International Symposium on Fault-Tolerant Computing. IEEE Computer Society Press, Los Alamitos (1984)

    Google Scholar 

  9. Manivannan, D., Singhal, M.: A low-overhead recovery technique using quasi-synchronous checkpointing. In: International Conference on Distributed Computing Systems, p. 100 (1996)

    Google Scholar 

  10. Vaidyanathan, K., Chai, L., Huang, W., Panda, D.K.: Efficient asynchronous memory copy operations on multi-core systems and i/oat. In: CLUSTER 2007: Proceedings of the 2007 IEEE International Conference on Cluster Computing, Washington, DC, USA, pp. 159–168. IEEE Computer Society Press, Los Alamitos (2007)

    Chapter  Google Scholar 

  11. Goglin, B.: Improving message passing over ethernet with i/oat copy offload in open-mx. In: Proceedings of the 2008 IEEE International Conference on Cluster Computing, pp. 223–231. IEEE, Los Alamitos (2008)

    Chapter  Google Scholar 

  12. Stricker, T., Gross, T.: Optimizing memory system performance for communication in parallel computers. In: ISCA 1995: Proceedings of the 22nd annual international symposium on Computer architecture, pp. 308–319. ACM, New York (1995)

    Chapter  Google Scholar 

  13. Geoffray, P.: Opiom: Off-processor i/o with myrinet. Future Generation Comp. Syst. 18(4), 491–499 (2002)

    Article  MATH  Google Scholar 

  14. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings of 11th uropean PVM/MPI Users’ Group Meeting, Budapest, Hungary, pp. 97–104 (2004)

    Google Scholar 

  15. Bouteiller, A., Bosilca, G., Dongarra, J.: Retrospect: Deterministic replay of mpi applications for interactive distributed debugging. In: Proccedings of the 14th European PVM/MPI User’s Group Meeting (EuroPVM/MPI), pp. 297–306 (2007)

    Google Scholar 

  16. Snell, Q.O., Mikler, A.R., Gustafson, J.L.: Netpipe: A network protocol independent performance evaluator. In: IASTED International Conference on Intelligent Information Management and Systems (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Dongarra, J.J. (2010). Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2010. Lecture Notes in Computer Science, vol 6305. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15646-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15646-5_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15645-8

  • Online ISBN: 978-3-642-15646-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics