Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning

  • Jürgen Hofer
  • Thomas Fahringer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4749)


With increasing size and complexity of Grids manual diagnosis of individual application faults becomes impractical and time-consuming. Quick and accurate identification of the root cause of failures is an important prerequisite for building reliable systems. We describe a pragmatic model-based technique for application-specific fault diagnosis based on indicators, symptoms and rules. Customized wrapper services then apply this knowledge to reason about root causes of failures. In addition to user-provided diagnosis models we show that given a set of past classified fault events it is possible to extract new models through learning that are able to diagnose new faults. We investigated and compared algorithms of supervised classification learning and cluster analysis. Our approach was implemented as part of the Otho Toolkit that ’service-enables’ legacy applications based on synthesis of wrapper service.


Fault Diagnosis Grid Application Bayesian Belief Network Fault Event Diagnosis Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Abrial, J.-R., Schuman, S.A., Meyer, B.: A specification language. In: McNaughten, R., McKeag, R.C. (eds.) On the Construction of Programs, Cambridge University Press, Cambridge (1980)Google Scholar
  2. 2.
    Bowring, J., Rehg, J., Harrold, M.J.: Active learning for automatic classification of software behavior. In: ISSTA 2004. Proc. of the Int. Symp. on Software Testing and Analysis (July 2004)Google Scholar
  3. 3.
    Chen, M., Zheng, A., Lloyd, J., Jordan, M., Brewer, E.: Failure diagnosis using decision trees. In: ICAC. Proc. of Int. Conf. on Autonomic Computing, York, NY (May 2004)Google Scholar
  4. 4.
    Millo, R., Mathur, A.: A grammar based fault classification scheme and its application to the classification of the errors of tex. Technical Report SERC-TR-165-P, Purdue University (1995)Google Scholar
  5. 5.
    Duarte, A.N., Brasileiro, F., Cirne, W., Filho, J.S.A.: Collaborative fault diagnosis in grids through automated tests. In: Proc. of the The IEEE 20th Int. Conf. on Advanced Information Networking and Applications, IEEE Computer Society Press, Los Alamitos (2006)Google Scholar
  6. 6.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)Google Scholar
  7. 7.
    Hochbaum, Shmoys,: A best possible heuristic for the k-center problem. Mathematics of Operations Research 10(2), 180–184 (1985)zbMATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Hofer, J., Fahringer, T.: Presenting Scientific Legacy Programs as Grid Services via Program Synthesis. In: Proceedings of 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam, Netherlands, December 4-6, 2006, IEEE Computer Society Press, Los Alamitos (2006)Google Scholar
  9. 9.
    Hofer, J., Fahringer, T.: Specification-based Synthesis of Tailor-made Grid Service Wrappers for Scientific Legacy Codes. In: Grid 2006. Proceedings of 7th IEEE/ACM International Conference on Grid Computing (Grid 2006), Short Paper and Poster, Barcelona, Spain, September 28-29, 2006 (2006)Google Scholar
  10. 10.
    Hofer, J., Fahringer, T.: The Otho Toolkit - Synthesizing Tailor-made Scientific Grid Application Wrapper Services. Journal of Multiagent and Grid Systems 3(3) (2007)Google Scholar
  11. 11.
    Hofer, J., Fahringer, T.: Towards automated diagnosis of application faults using wrapper services and machine learning. In: Proceedings of CoreGRID Workshop on Grid Middleware, Dresden, Germany, June 25–26, 2007, pp. 25–26. Springer, Heidelberg (2007)Google Scholar
  12. 12.
    Horita, Y., Taura, K., Chikayama, T.: A scalable and efficient self-organizing failure detector for grid applications. In: Grid 2005. 6th IEEE/ACM Int. Workshop on Grid Computing, IEEE Computer Society Press, Los Alamitos (2005)Google Scholar
  13. 13.
    Hwang, S., Kesselman, C.: A flexible framework for fault tolerance in the grid. Journal of Grid Computing 1(3), 251–272 (2003)zbMATHCrossRefGoogle Scholar
  14. 14.
    Hwang, S., Kesselman, C.: Gridworkflow: A flexible failure handling framework for the grid. In: HPDC 2003. 12th IEEE Int. Symp. on High Performance Distributed Computing, Seattle, Washington, IEEE Press, Los Alamitos (2003)Google Scholar
  15. 15.
    Jones, C.: Systematic Software Development using VDM. Prentice Hall, Englewood Cliffs (1990)zbMATHGoogle Scholar
  16. 16.
    Kola, G., Kosar, T., Livny, M.: Phoenix: Making data-intensive grid applications fault-tolerant. In: Proc. of 5th IEEE/ACM Int. Workshop on Grid Computing, Pittsburgh, Pennsylvania, November 8, 2004, pp. 251–258 (2004)Google Scholar
  17. 17.
    Kuhn, D.R.: Fault classes and error detection in specification based testing. ACM Transactions on Software Engineering Methodology 8(4), 411–424 (1999)CrossRefGoogle Scholar
  18. 18.
    Laprie, J.-C.: Dependable computing and fault tolerance: Concepts and terminology. In: Proc. of 15th Int. Symp. on Fault-Tolerant Computing (1985)Google Scholar
  19. 19.
    Meshkat, L., Allcock, W., Deelman, E., Kesselman, C.: Fault location in grids using bayesian belief networks. Technical Report GriPhyN-2002-8, GriPhyN Project (2002)Google Scholar
  20. 20.
    Mirgorodskiy, A.V., Maruyama, N., Miller, B.P.: Problem diagnosis in large-scale computing environments. In: Proc. of ACM/IEEE Supercomputing 2006 Conference (2006)Google Scholar
  21. 21.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, Boston (1997)zbMATHGoogle Scholar
  22. 22.
    Ortmeier, F., Reif, W.: Failure-sensitive Specification - A formal method for finding failure modes. Technical report, University of Augsburg (January 12, 2004)Google Scholar
  23. 23.
    Podgurski, A., Leon, D., Francis, P., Masri, W., Minch, M., Sun, J., Wang, B.: Automated support for classifying software failure reports. In: Proc. of 25th Int. Conf. on Software Engineering, Portland, Oregon, pp. 465–475 (2003)Google Scholar
  24. 24.
    Smallen, S., Olschanowsky, C., Ericson, K., Beckman, P., Schopf, J.M.: The inca test harness and reporting framework. In: Proc. of the ACM/IEEE Supercomputing’04 Conference (November 2004)Google Scholar
  25. 25.
    Stelling, P., Foster, I., Kesselman, C., Lee, C., von Laszewski, G.: A fault detection service for wide area distributed computations. In: Proc. 7th IEEE Symp. on High Performance Distributed Computing, pp. 268–278. IEEE Computer Society Press, Los Alamitos (1998)Google Scholar
  26. 26.
  27. 27.
  28. 28.
  29. 29.
    GNU Linear Programming Kit (GLPK),
  30. 30.
  31. 31.
    Web Service Description Language (WSDL),
  32. 32.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jürgen Hofer
    • 1
  • Thomas Fahringer
    • 1
  1. 1.Distributed and Parallel Systems Group, Institute of Computer Science, University of Innsbruck, Technikerstrasse 21a, 6020 InnsbruckAustria

Personalised recommendations