Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

  • Mark James
  • Paul Springer
  • Hans Zima
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6272)


This paper describes an approach to providing software fault tolerance for future deep-space robotic NASA missions, which will require a high degree of autonomy supported by an enhanced on-board computational capability. Such systems have become possible as a result of the emerging many-core technology, which is expected to offer 1024-core chips by 2015. We discuss the challenges and opportunities of this new technology, focusing on introspection-based adaptive fault tolerance that takes into account the specific requirements of applications, guided by a fault model. Introspection supports runtime monitoring of the program execution with the goal of identifying, locating, and analyzing errors. Fault tolerance assertions for the introspection system can be provided by the user, domain-specific knowledge, or via the results of static or dynamic program analysis. This work is part of an on-going project at the Jet Propulsion Laboratory in Pasadena, California.


Fault Tolerance Dust Devil Java Modeling Language Sensor Actuator Single Event Upset 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lardner, D.: Babbages’s Calculating Engine. Edinburgh Review (July 1834); Reprinted in Morrison, P., Morrison, E. (eds.). Charles Babbage and His Calculating Engines. Dover, New York (1961)Google Scholar
  2. 2.
    Avizienis, A., Laprie, J.C., Randell, B.: Fundamental Concepts of Dependability. Technical report, UCLA (2000) (CSD Report No. 010028)Google Scholar
  3. 3.
    Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing 1(1) (January-March 2004)Google Scholar
  4. 4.
    Castano, R., Estlin, T., Anderson, R.C., Gaines, D.M., Castano, A., Bornstein, B., Chouinard, C., Judd, M.: OASIS: Onboard Autonomous Science Investigation System for Opportunistic Rover Science. Journal of Field Robotics 24(5), 379–397 (2007)CrossRefGoogle Scholar
  5. 5.
    Tile64 Processor Family (2007),
  6. 6.
    Shirvani, P.P.: Fault-Tolerant Computing for Radiation Environments. Technical Report 01-6, Center for Reliable Computing, Stanford University, Stanford, California 94305 (June 2001) (Ph.D. Thesis)Google Scholar
  7. 7.
    Lamport, L., Shostak, R., Pease, M.: The Byzantine Generals Problem. ACM Trans. Programming Languages and Systems 4(3), 382–401 (1982)CrossRefzbMATHGoogle Scholar
  8. 8.
    Aggarwal, N., Ranganathan, P., Jouppi, N.P., Smith, J.E.: Isolation in Commodity Multicore Processors. IEEE Computer 40(6), 49–59 (2007)CrossRefGoogle Scholar
  9. 9.
    Li, M., Tao, W., Goldberg, D., Hsu, I., Tamir, Y.: Design and Validation of Portable Communication Infrastructure for Fault-Tolerant Cluster Middleware. In: Cluster 2002: Proceedings of the IEEE International Conference on Cluster Computing, p. 266. IEEE Computer Society, Washington (September 2002)Google Scholar
  10. 10.
    Samson, J., Gardner, G., Lupia, D., Patel, M., Davis, P., Aggarwal, V., George, A., Kalbarcyzk, Z., Some, R.: High Performance Dependable Multiprocessor II. In: Proceedings 2007 IEEE Aerospace Conference, pp. 1–22 (March 2007)Google Scholar
  11. 11.
    James, M., Shapiro, A., Springer, P., Zima, H.: Adaptive Fault Tolerance for Scalable Cluster Computing in Space. International Journal of High Performance Computing Applications (IJHPCA) 23(3) (2009)Google Scholar
  12. 12.
    Zima, H.P., Chapman, B.M.: Supercompilers for Parallel and Vector Computers. ACM Press Frontier Series (1991)Google Scholar
  13. 13.
    Nielson, F., Nielson, H.R., Hankin, C.: Principles of Program Analysis. Springer, New York (1999)CrossRefzbMATHGoogle Scholar
  14. 14.
    Havelund, K., Goldberg, A.: Verify Your Runs. In: Meyer, B., Woodcock, J. (eds.) VSTTE 2005. LNCS, vol. 4171, pp. 374–383. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Weiser, M.: Program Slicing. IEEE Transactions on Software Engineering 10, 352–357 (1984)CrossRefzbMATHGoogle Scholar
  16. 16.
    Strout, M.M., Kreaseck, B., Hovland, P.: Data Flow Analysis for MPI Programs. In: Proceedings of the 2006 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2006) (June 2006)Google Scholar
  17. 17.
    Some, R., Ngo, D.: REE: A COTS-Based Fault Tolerant Parallel Processing Supercomputer for Spacecraft Onboard Scientific Data Analysis. In: Proceedings of the Digital Avionics Systems System Conference, pp. 7.B.3-1–7.B.3-12 (1999)Google Scholar
  18. 18.
    Kalbarczyk, Z.T., Iyer, R.K., Bagchi, S., Whisnant, K.: Chameleon: A software infrastructure for adaptive fault tolerance. IEEE Trans. Parallel Distrib. Syst. 10(6), 560–579 (1999)CrossRefGoogle Scholar
  19. 19.
    Goldberg, A., Havelund, K., McGann, C.: Runtime Verification for Autonomous Spacecraft Software. In: Proceedings 2005 IEEE Aerospace Conference, pp. 507–516 (March 2005)Google Scholar
  20. 20.
    Mehlitz, P.C., Penix, J.: Design for Verification with Dynamic Assertions. In: Proceedings of the 2005 29th Annual IEEE/NASA Software Engineering Workshop, SEW 2005 (2005)Google Scholar
  21. 21.
    Kang, D.I., Suh, J., McMahon, J.O., Crago, S.P.: Preliminary Study toward Intelligent Run-time Resource Management Techniques for Large Multi-Core Architectures. In: Proceedings of the 2007 Workshop on High Performance Embedded Computing, HPEC 2007 (September 2007)Google Scholar
  22. 22.
    Zima, H.P.: Introspection in a Massively Parallel PIM-Based Architecture. In: Joubert, G.R. (ed.) Advances in Parallel Computing, vol. 13, pp. 441–448. Elsevier B.V., Amsterdam (2004)Google Scholar
  23. 23.
    Iyer, R.K., Kalbarczyk, Z., Pattabiraman, K., Healey, W., Hwu, W.M.W., Klemperer, P., Farivar, R.: Toward Application-Aware Security and Reliability. IEEE Security and Privacy 5(1), 57–62 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Mark James
    • 1
  • Paul Springer
    • 1
  • Hans Zima
    • 1
    • 2
  1. 1.Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena
  2. 2.University of ViennaAustria

Personalised recommendations