Super-Scalable Algorithms for Computing on 100,000 Processors

  • Christian Engelmann
  • Al Geist
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3514)


In the next five years, the number of processors in high-end systems for scientific computing is expected to rise to tens and even hundreds of thousands. For example, the IBM BlueGene/L can have up to 128,000 processors and the delivery of the .rst system is scheduled for 2005. Existing deficiencies in scalability and fault-tolerance of scientific applications need to be addressed soon. If the number of processors grows by a magnitude and efficiency drops by a magnitude, the overall effective computing performance stays the same. Furthermore, the mean time to interrupt of high-end computer systems decreases with scale and complexity. In a 100,000-processor system, failures may occur every couple of minutes and traditional checkpointing may no longer be feasible. With this paper, we summarize our recent research in super-scalable algorithms for computing on 100,000 processors. We introduce the algorithm properties of scale invariance and natural fault tolerance, and discuss how they can be applied to two different classes of algorithms. We also describe a super-scalable diskless checkpointing algorithm for problems that can’t be transformed into a superscalable variant, or where other solutions are more efficient. Finally, a 100,000-processor simulator is presented as a platform for testing and experimentation.


Fault Tolerance Scale Invariance Parallel Virtual Machine High Performance Cluster Computing Checkpointing Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Adiga, N.R., et al.: An overview of the BlueGene/L supercomputer. Proceedings of SC, also IBM research report RC22570, W0209-033 (2002)Google Scholar
  2. 2.
    Lawrence Livermore National Laboratory, Livermore, CA, USA: ASCII BlueGene/L Computing Platform at
  3. 3.
    Geist, G.A., Engelmann, C.: Development of naturally fault tolerant algorithms for computing on 100,000 processors (2002) (to be published)Google Scholar
  4. 4.
    Bosilca, G., Chen, Z., Dongarra, J., Langou, J.: Recovery patterns for iterative methods in a parallel unstable environment. Submitted to SIAM Journal on Scientific Computing (2005)Google Scholar
  5. 5.
    Space Sciences Laboratory, University of California Berkeley, USA, SETI@HOME at
  6. 6.
    Basney, J., Livny, M.: Deploying a high throughput computing cluster. In: Buyya, R. (ed.) High Performance Cluster Computing: Architectures and Systems, vol. 1. Prentice Hall PTR, Englewood Cliffs (1999)Google Scholar
  7. 7.
    Computer Sciences Department, University of Wisconsin, USA, Condor at
  8. 8.
    Chazan, D., Miranker, M.: Chaotic relaxation. Linear Algebra and its Applications 2, 199–222 (1969)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Baudet, G.M.: Asynchronous iterative methods for multiprocessors. Journal of the ACM 25, 226–244 (1978)zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Liu, G.R.: Mesh Free Methods: Moving beyond the Finite Element Method. CRC Press, Boca Raton (2002)CrossRefGoogle Scholar
  11. 11.
    von Eicken, T., Culler, D.E., Goldstein, S.C., Schauser, K.E.: Active Messages: A mechanism for integrated communication and computation. In: 19th International Symposium on Computer Architecture, Gold Coast, Australia, pp. 256–266 (1992)Google Scholar
  12. 12.
    Geist, G.A., Beguelin, A., Dongarra, J.J., Jiang, W., Manchek, R., Sunderam, V.S.: PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge (1994)zbMATHGoogle Scholar
  13. 13.
    Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI: The Complete Reference. MIT Press, Cambridge (1996)Google Scholar
  14. 14.
    Engelmann, C., Geist, G.A.: A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform. In: Proceedings of CLADE, pp. 47–52 (2003)Google Scholar
  15. 15.
    University of Paris South, France: MPICH-V at
  16. 16.
    Indiana University, Bloomington, IN, USA: LAM-MPI at
  17. 17.
    Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Building fault survivable MPI programs with FTMPI using diskless checkpointing. Submitted to PPoPP (2005)Google Scholar
  18. 18.
    Zheng, G., Singla, A.K., Unger, J.M., Kale, L.V.: A parallel-object programming model for petaflops machines and blue gene/cyclops. In: Proceedings of IPDPS (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Christian Engelmann
    • 1
  • Al Geist
    • 1
  1. 1.Computer Science and Mathematics DivisionOak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations