Abstract
Collective operations might have a big impact on the performance of scientific applications, specially at large scale. Recently, it has been proposed Fabric-based collectives to address some scalability issues caused by the OS jitter. However, soft errors are becoming the next factor that significantly might degrade collective’s performance at scale. This paper evaluates two approaches to mitigate the negative effect of soft errors on Fabric-based collectives. These approaches are based on replicating multiple times the individual packets of the collective. One of them replicates packets through independent output ports at every switch (spatial replication), whereas the other only uses one output port but sending consecutively multiple packets through it (temporal replication). Results on a 1,728-node cluster showed that temporal replication achieves a 50% better performance than spatial replication in the presence of random soft errors.
Keywords
- Output Port
- Computing Node
- Soft Error
- Network Failure
- Collective Communication
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download conference paper PDF
References
InfiniBand website: Infiniband trade association, official website on, http://www.infinibandta.org
Öhring, S.R., Ibel, M., Das, S.K., Kumar, M.J.: On generalized fat trees. In: Proceedings of the 9th International Parallel Processing Symposium, p. 37. IEEE Computer Society, Washington, DC (1995)
Barker, K.J., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho, J.C.: Entering the petaflop era: the architecture and performance of roadrunner. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 1:1–1:11 (2008)
Mellanox: Fabric collective accelerator (2011), http://www.mellanox.com/related-docs/prod_acceleration_software/fca.pdf
InfiniBand specification: Infiniband trade association, infiniband architecture specification, vol. 1, release 1.0.a (2001)
Minkenberg, C., Rodriguez, G.: Trace-driven co-simulation of high-performance computing systems using OMNeT++. In: Proceedings of the 2nd International Conference on Simulation Tools and Techniques, Simutools 2009 (2009)
Fault Tolerance Working Group: Run-though stabilization interfaces and semantics, svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/runthroughstabilization
Hursey, J., Graham, R.: Preserving collective performance across process failure for a fault tolerant. In: 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) held in conjunction with the 25th International Parallel and Distributed Processing Symposium (IPDPS), Anchorage, Alaska (May 2011)
Jaros, J.: Evolutionary Design of Fault Tolerant Collective Communications. In: Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp. 261–272. Springer, Heidelberg (2008)
Koop, M.J., Shamis, P., Rabinovitz, I., Panda, D.K.: Designing high-performance and resilient message passing on infiniband. In: Communication Architecture for Scalable Systems Workshop held in conjunction with the 25th International Parallel and Distributed Processing Symposium (IPDPS), Atlanta, Georgia USA (April 2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sancho, J.C., Jokanovic, A., Labarta, J. (2012). Reducing the Impact of Soft Errors on Fabric-Based Collective Communications. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-29740-3_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29739-7
Online ISBN: 978-3-642-29740-3
eBook Packages: Computer ScienceComputer Science (R0)
