Application-Specific Fault Tolerance via Data Access Characterization

  • Nawab Ali
  • Sriram Krishnamoorthy
  • Niranjan Govind
  • Karol Kowalski
  • Ponnuswamy Sadayappan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6853)


Recent trends in semiconductor technology and supercomputer design predict an increasing probability of faults during an application’s execution. Designing an application that is resilient to system failures requires careful evaluation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance approaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the information collected. The application signatures developed capture application characteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance.


Fault tolerance Data access characterization NWChem 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    EMSL Basis Set Exchange,
  2. 2.
  3. 3.
  4. 4.
    Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approach to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed, and Network-Based Computing, pp. 24–31 (February 2011)Google Scholar
  5. 5.
    Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers (May 2011)Google Scholar
  6. 6.
    Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Reviews of Modern Physics 79(1), 291–352 (2007)CrossRefGoogle Scholar
  7. 7.
    Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing 69(4), 410–416 (2009)CrossRefGoogle Scholar
  8. 8.
    Cappello, F., Guermouche, A., Snir, M.: On communication determinism in parallel HPC applications. In: 19th International Conference on Computer Communications and Networks, pp. 1–8 (August 2010)Google Scholar
  9. 9.
    Carns, P.H., Latham, R., Ross, R.B., Iskra, K., Lang, S., Riley, K.: 24/7 characterization of petascale I/O workloads. In: Proceedings of the First Workshop on Interfaces and Architectures for Scientific Data Storage, pp. 1–10 ( September 2009)Google Scholar
  10. 10.
    Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: Proceedings of the 20th International Parallel & Distributed Processing Symposium (April 2006)Google Scholar
  11. 11.
    Cullen, J.M., Zerner, M.C.: The linked singles and doubles model–an approximate theory of electron correlation based on the coupled-cluster ansatz. The Journal of Chemical Physics 77(8), 4088–4109 (1982)CrossRefGoogle Scholar
  12. 12.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)CrossRefGoogle Scholar
  13. 13.
    Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2), 97–108 (2004)CrossRefGoogle Scholar
  14. 14.
    Graham, S.L., Kessler, P.B., McKusick, M.K.: Gprof: A call graph execution profiler. In: Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction, vol. 17(6), pp. 120–126 (1982)Google Scholar
  15. 15.
    Harrison, R.J., et al.: Toward high-performance computational chemistry: II. a scalable self-consistent field program. Journal of Computational Chemistry 17(1), 124–132 (1996)CrossRefGoogle Scholar
  16. 16.
    Helgaker, T., Jorgensen, P., Olsen, J.: Molecular Electronic-Structure Theory. John Wiley & Sons Ltd., Chichester (2004)Google Scholar
  17. 17.
    Jong, W.A., et al.: Utilizing high performance computing for chemistry: parallel computational chemistry. Physical Chemistry Chemical Physics 12(26), 6896–6920 (2010)CrossRefGoogle Scholar
  18. 18.
    Kohn, W., Sham, L.J.: Self-consistent equations including exchange and correlation effects. Physical Review 140(4A), A1133–A1138 (1965)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Martin, R.M.: Electronic Structure: Basic Theory and Practical Methods. Cambridge University Press, Cambridge (2004)MATHGoogle Scholar
  20. 20.
    Nieplocha, J., Palmer, B., Tipparaju, V., Krishnan, M., Trease, H., Aprà, E.: Advances, applications and performance of the global arrays shared memory programming toolkit. International Journal of High Performance Computing Applications 20(2), 203–231 (2006)CrossRefGoogle Scholar
  21. 21.
    Nieuwejaar, N., Kotz, D., Purakayastha, A., Sclatter Ellis, C., Best, M.: File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems 7(10), 1075–1089 (1996)CrossRefGoogle Scholar
  22. 22.
    Parr, R.G., Yang, W.: Density-Functional Theory of Atoms and Molecules. Oxford University Press, Inc., New York (1989)Google Scholar
  23. 23.
    Perdew, J.P., Schmidt, K.: Jacob’s ladder of density functional approximations for the exchange-correlation energy. In: AIP Conference Proceedings, vol. 577(1), pp. 1–20 (2001)Google Scholar
  24. 24.
  25. 25.
    Purvis, G.D., Bartlett, R.J.: A full coupled-cluster singles and doubles model–the inclusion of disconnected triples. The Journal of Chemical Physics 76(4), 1910–1918 (1982)CrossRefGoogle Scholar
  26. 26.
    Roth, P.C.: Characterizing the I/O behavior of scientific applications on the Cray XT. In: Proceedings of the International Workshop on Petascale Data Storage, Reno, NV, pp. 50–55 (2007)Google Scholar
  27. 27.
    Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. Journal of Physics: Conference Series 78(1) (2007)Google Scholar
  28. 28.
    Schulz, M., de Supinski, B.R.: PNMPI tools: A whole lot greater than the sum of their parts. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–10 (2007)Google Scholar
  29. 29.
    Shende, S.S., Malony, A.D.: The TAU parallel performance system. International Journal of High Performance Computing Applications 20(2), 287–311 (2006)CrossRefGoogle Scholar
  30. 30.
    Szabo, A., Ostlund, N.S.: Modern Quantum Chemistry. McGraw-Hill Inc., New York (1996)Google Scholar
  31. 31.
    Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications 19(1), 49–66 (2005)CrossRefGoogle Scholar
  32. 32.
    Valiev, M., et al.: NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications 181(9), 1477–1489 (2010)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Nawab Ali
    • 1
  • Sriram Krishnamoorthy
    • 1
  • Niranjan Govind
    • 1
  • Karol Kowalski
    • 1
  • Ponnuswamy Sadayappan
    • 2
  1. 1.Pacific Northwest National LaboratoryRichland
  2. 2.The Ohio State UniversityColumbus

Personalised recommendations