Using Performance Tools to Support Experiments in HPC Resilience

  • Thomas Naughton
  • Swen Böhm
  • Christian Engelmann
  • Geoffroy Vallée
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8374)


The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between “performance tools” and “resilience tools”. As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community.

In this paper, we describe the initial motivation to leverage standard HPC performance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in providing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerance.


Fault Tolerance Message Passing Interface High Performance Comput Trace Data Resilience Community 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ahn, D.H., de Supinski, B.R., Laguna, I., Lee, G.L., Liblit, B., Miller, B.P., Schulz, M.: Scalable temporal order analysis for large scale debugging. In: Proceedings of the ACM/IEEE Conference on High Performance Computing (SC). ACM (2009)Google Scholar
  2. 2.
    Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Böhm, S., Engelmann, C.: xSim: The extreme-scale simulator. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), Istanbul, Turkey, July 4-8, pp. 280–286. IEEE Computer Society, Los Alamitos (2011)Google Scholar
  4. 4.
    Daly, J., Harrod, B., Hoang, T., Nowell, L., Adolf, B., Borkar, S., DeBardeleben, N., Elnozahy, M., Heroux, M., Rogers, D., Ross, R., Sarkar, V., Schulz, M., Snir, M., Woodward, P., Aulwes, R., Bancroft, M., Bronevetsky, G., Carlson, B., Geist, A., Hall, M., Hollingsworth, J., Lucas, B., Lumsdaine, A., Macaluso, T., Quinlan, D., Sachs, S., Shalf, J., Smith, T., Stearley, J., Still, B., Wu, J.: Inter-Agency Workshop on HPC Resilience at Extreme Scale (February 2012)Google Scholar
  5. 5.
    DeBardeleben, N., Laros, J., Daly, J.T., Scott, S.L., Engelmann, C., Harrod, B.: High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper (December 2009)Google Scholar
  6. 6.
    Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. International Journal of High Performance Computer Applications 25(1) (2011)Google Scholar
  7. 7.
    Hursey, J., January, C., O’Connor, M., Hargrove, P.H., Lecomber, D., Squyres, J.M., Lumsdaine, A.: Checkpoint/Restart-enabled parallel debugging. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 219–228. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  8. 8.
    Janssen, C.L., Adalsteinsson, H., Cranford, S., Kenny, J.P., Pinar, A., Evensky, D.A., Mayo, J.: A simulator for large-scale parallel computer architectures. International Journal of Parallel and Distributed System Technology 1(2), 57–73 (2010)CrossRefGoogle Scholar
  9. 9.
    Laguna, I., Gamblin, T., de Supinski, B.R., Bagchi, S., Bronevetsky, G., Anh, D.H., Schulz, M., Rountree, B.: Large scale debugging of parallel tasks with automaded. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 50:1–50:10. ACM, New York (2011)Google Scholar
  10. 10.
    Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM (November 2012)Google Scholar
  11. 11.
    Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., DeBardeleben, N.A., Diniz, P., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Workshop report: Addressing failures in exascale computing (April 2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Thomas Naughton
    • 1
    • 2
  • Swen Böhm
    • 1
  • Christian Engelmann
    • 1
  • Geoffroy Vallée
    • 1
  1. 1.Computer Science and Mathematics DivisionOak Ridge National LaboratoryOak RidgeUSA
  2. 2.School of Systems EngineeringThe University of ReadingReadingUK

Personalised recommendations