Advertisement

The Malthusian Catastrophe Is Upon Us! Are the Largest HPC Machines Ever Up?

  • Patricia Kovatch
  • Matthew Ezell
  • Ryan Braby
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7156)

Abstract

Thomas Malthus, an English political economist who lived from 1766 to 1834, predicted that the earth’s population would be limited by starvation since population growth increases geometrically and the food supply only grows linearly. He said, “the power of population is indefinitely greater than the power in the earth to provide subsistence for man,” thus defining the Malthusian Catastrophe. There is a parallel between this prediction and the conventional wisdom regarding super-large machines: application problem size and machine complexity is growing geometrically, yet mitigation techniques are only improving linearly.

To examine whether the largest machines are usable, the authors collected and examined component failure rates and Mean Time Between System Failure data from the world’s largest production machines, including Oak Ridge National Laboratory’s Jaguar and the University of Tennessee’s Kraken. The authors also collected MTBF data for a variety of Cray XT series machines from around the world, representing over 6 Petaflops of compute power. An analysis of the data is provided as well as plans for future work. High performance computing’s Malthusian Catastrophe hasn’t happened yet, and advances in system resiliency should keep this problem at bay for many years to come.

Keywords

high performance computing resiliency MTBF failures scalability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
    Piazzalunga, D.: Project Triangle. Figure in public domain, downloaded from, http://en.wikipedia.org/wiki/File:Project_Triangle.svg
  3. 3.
    Stearley, J.: Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS). In: 6th LCI Conference on Linux Clusters (April 2005)Google Scholar
  4. 4.
    Top500 Supercomputer Sites, http://top500.org/
  5. 5.
    The Computer Failure Data Repository, http://cfdr.usenix.org/
  6. 6.
    Gottumukkala, N., Nassar, R., Paun, M., Leangsuksun, C., Scott, S.: Reliability of a System of k Nodes for High Performance Computing Applications. IEEE Transactions on Reliability 59(1), 162–169 (2010)CrossRefGoogle Scholar
  7. 7.
    Johnson, S.: Cray Inc. Personal CommunicationGoogle Scholar
  8. 8.
    Andrews, P., Kovatch, P., Hazlewood, V., Baer, T.: Scheduling a 100,000 core Supercomputer for Maximum Utilization and Capability. In: 39th International Conference on Parallel Processing Workshops (2010)Google Scholar
  9. 9.
    Becklehimer, J., Willis, C., Lothian, J., Maxwell, D., Vasil, D.: Real Time Health Monitoring of the Cray XT3/XT4 Using the Simple Event Correlator (SEC). Cray Users Group (2007)Google Scholar
  10. 10.
    Schroeder, B., Gibson, G.: A Large-Scale Study of Failures in High-Performance Computing SystemsGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Patricia Kovatch
    • 1
  • Matthew Ezell
    • 1
  • Ryan Braby
    • 1
  1. 1.National Institute for Computational SciencesThe University of TennesseeKnoxvilleUSA

Personalised recommendations