Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Conference on Parallel Processing

Euro-Par 2011: Euro-Par 2011: Parallel Processing Workshops pp 262–271Cite as

  1. Home
  2. Euro-Par 2011: Parallel Processing Workshops
  3. Conference paper
Reducing the Impact of Soft Errors on Fabric-Based Collective Communications

Reducing the Impact of Soft Errors on Fabric-Based Collective Communications

  • José Carlos Sancho30,
  • Ana Jokanovic30 &
  • Jesus Labarta30 
  • Conference paper
  • 1077 Accesses

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7156)

Abstract

Collective operations might have a big impact on the performance of scientific applications, specially at large scale. Recently, it has been proposed Fabric-based collectives to address some scalability issues caused by the OS jitter. However, soft errors are becoming the next factor that significantly might degrade collective’s performance at scale. This paper evaluates two approaches to mitigate the negative effect of soft errors on Fabric-based collectives. These approaches are based on replicating multiple times the individual packets of the collective. One of them replicates packets through independent output ports at every switch (spatial replication), whereas the other only uses one output port but sending consecutively multiple packets through it (temporal replication). Results on a 1,728-node cluster showed that temporal replication achieves a 50% better performance than spatial replication in the presence of random soft errors.

Keywords

  • Output Port
  • Computing Node
  • Soft Error
  • Network Failure
  • Collective Communication

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Download conference paper PDF

References

  1. InfiniBand website: Infiniband trade association, official website on, http://www.infinibandta.org

  2. Öhring, S.R., Ibel, M., Das, S.K., Kumar, M.J.: On generalized fat trees. In: Proceedings of the 9th International Parallel Processing Symposium, p. 37. IEEE Computer Society, Washington, DC (1995)

    CrossRef  Google Scholar 

  3. Barker, K.J., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho, J.C.: Entering the petaflop era: the architecture and performance of roadrunner. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 1:1–1:11 (2008)

    Google Scholar 

  4. Mellanox: Fabric collective accelerator (2011), http://www.mellanox.com/related-docs/prod_acceleration_software/fca.pdf

  5. InfiniBand specification: Infiniband trade association, infiniband architecture specification, vol. 1, release 1.0.a (2001)

    Google Scholar 

  6. Minkenberg, C., Rodriguez, G.: Trace-driven co-simulation of high-performance computing systems using OMNeT++. In: Proceedings of the 2nd International Conference on Simulation Tools and Techniques, Simutools 2009 (2009)

    Google Scholar 

  7. Fault Tolerance Working Group: Run-though stabilization interfaces and semantics, svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/runthroughstabilization

  8. Hursey, J., Graham, R.: Preserving collective performance across process failure for a fault tolerant. In: 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) held in conjunction with the 25th International Parallel and Distributed Processing Symposium (IPDPS), Anchorage, Alaska (May 2011)

    Google Scholar 

  9. Jaros, J.: Evolutionary Design of Fault Tolerant Collective Communications. In: Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp. 261–272. Springer, Heidelberg (2008)

    CrossRef  Google Scholar 

  10. Koop, M.J., Shamis, P., Rabinovitz, I., Panda, D.K.: Designing high-performance and resilient message passing on infiniband. In: Communication Architecture for Scalable Systems Workshop held in conjunction with the 25th International Parallel and Distributed Processing Symposium (IPDPS), Atlanta, Georgia USA (April 2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Barcelona Supercomputing Center, Barcelona, Spain

    José Carlos Sancho, Ana Jokanovic & Jesus Labarta

Authors
  1. José Carlos Sancho
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Ana Jokanovic
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Jesus Labarta
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria

    Michael Alexander

  2. ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy

    Pasqua D’Ambra

  3. University of Amsterdam, 1090, Amsterdam, Netherlands

    Adam Belloum

  4. Innovative Computing Laboratory, The University of Tennessee, US

    George Bosilca

  5. Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy

    Mario Cannataro

  6. Computer Science Department, University of Pisa, Italy

    Marco Danelutto

  7. Second University of Naples, Italy

    Beniamino Di Martino

  8. TUMünchen,, Boltzmannstr. 3, ,, 85748, Garching, Germany

    Michael Gerndt

  9. Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Emmanuel Jeannot & Raymond Namyst & 

  10. Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Jean Roman

  11. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6164, Oak Ridge, TN, USA

    Stephen L. Scott

  12. Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austria

    Jesper Larsson Traff

  13. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA

    Geoffroy Vallée

  14. Technische Universität München, Germany

    Josef Weidendorfer

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sancho, J.C., Jokanovic, A., Labarta, J. (2012). Reducing the Impact of Soft Errors on Fabric-Based Collective Communications. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_30

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-29740-3_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29739-7

  • Online ISBN: 978-3-642-29740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature