Advertisement

Fast Extraction of Adaptive Change Point Based Patterns for Problem Resolution in Enterprise Systems

  • Manoj K. Agarwal
  • Narendran Sachindran
  • Manish Gupta
  • Vijay Mann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4269)

Abstract

Enterprise middleware systems typically consist of a large cluster of machines with stringent performance requirements. Hence, when a performance problem occurs in such environments, it is critical that the health monitoring software identifies the root cause with minimal delay. A technique commonly used for isolating root causes is rule definition, which involves specifying combinations of events that cause particular problems. However, such predefined rules (or problem signatures) tend to be inflexible, and crucially depend on domain experts for their definition. We present in this paper a method that automatically generates change point based problem signatures using administrator feedback, thereby removing the dependence on domain experts. The problem signatures generated by our method are flexible, in that they do not require exact matches for triggering, and adapt as more information becomes available. Unlike traditional data mining techniques, where one requires a large number of problem instances to extract meaningful patterns, our method requires few fault instances to learn problem signatures. We demonstrate the efficacy of our approach by learning problem signatures for five common problems that occur in enterprise systems and reliably recognizing these problems with a small number of learning instances.

Keywords

fault localization patterns problem signatures change point detection adaptive learning 

References

  1. 1.
    Hellerstein, J.L., Ma, S., Perng, C.: Discovering Actionable Patterns in Event Data. IBM Systems Journal 41(3) (2002)Google Scholar
  2. 2.
    Agarwal, M., Gupta, M., Mann, V., Sachindran, N., Anerousis, N., Mummert, L.: Problem Determination in Enterprise Middleware Systems using Change Point Correlation of Time Series Data. In: 9th IEEE/IFIP Network Operations and Management Symposium (NOMS), Vancouver, Canada (May 2006)Google Scholar
  3. 3.
    Steinder, M., Sethi, A.: The present and future of event correlation: A need for end-to-end service fault localization. In: SCI-2001, 5th World Multiconference on Systemics, Cybernetics, and Informatics, Orlando, FL, pp. 124–129 (July 2001)Google Scholar
  4. 4.
    Appleby, K., Goldszmidt, G., Steinder, M.: Yemanja A Layered Fault Localization System for Multi-domain Computing Utilities. In: IM 2001 (2001)Google Scholar
  5. 5.
    Gruschke, B.: Integrated Event Management: Event Correlation Using Dependency Graphs. In: DSOM 1998 (1998)Google Scholar
  6. 6.
    Brodie, M., Rish, I., Ma, S., Odintsova, N.: Active Probing Strategies for Problem Diagnosis in Distributed Systems. In: IJCAI 2003 (2003)Google Scholar
  7. 7.
    Gao, J., Kar, G., Kermani, P.: Approaches to Building Self Healing Systems using Dependency Analysis. In: IEEE/IFIP Network Operations and Management Symposium (NOMS) (April 2004)Google Scholar
  8. 8.
    Brown, A., Kar, G., Keller, A.: An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Environment. In: IM 2001 (2001)Google Scholar
  9. 9.
    Steinder, M., Sethi, A.: Non-deterministic Event-driven Fault Diagnosis through Incremental Hypothesis Updating. In: Goldszmidt, G., Schonwalder, J. (eds.) Integrated Network Management, VIII, pp. 635–648. Kluwer Academic Publishers, Boston (2003)CrossRefGoogle Scholar
  10. 10.
    Chen, M.Y., Kıcıman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: PD in Large, Dynamic Internet Services. In: International Conference on Dependable Systems and Networks, DSN 2002 (2002)Google Scholar
  11. 11.
    Choi, J., Choi, M., Lee, S.: An Alarm Correlation and Fault Identification Scheme Based on OSI Managed Object Classes. In: IEEE International Conference on Communications, Vancouver, BC, Canada, pp. 1547–1551 (1999)Google Scholar
  12. 12.
    Katker, S., Paterok, M.: Fault Isolation and Event Correlation for Integrated Fault Management. In: Integrated Network Management V. Chapman and Hall, Boca Raton (1997)Google Scholar
  13. 13.
    Aguilera, M., et al.: Performance Debugging for Distributed Systems of Black Boxes. In: 19th ACM Symposium on Operating Systems Principles (October 2003)Google Scholar
  14. 14.
    Agarwal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Conference on Management of Data, pp. 207–216 (May 1993)Google Scholar
  15. 15.
    Agarwal, M., Appleby, K., Faik, J., Kar, G., Neogi, A., Sailer, A.: Threshold management for Problem Determination in Transaction Oriented e-Commerce Systems. In: 9th IFIP/IEEE International Symposium on Integrated Network Management (IM 2005) (May 2005)Google Scholar
  16. 16.
    Fu, A., Kwong, R., Tang, J.: Mining N most interesting Itemsets. In: Ohsuga, S., Raś, Z.W. (eds.) ISMIS 2000. LNCS, vol. 1932, pp. 59–67. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  17. 17.
  18. 18.

Copyright information

© IFIP International Federation for Information Processing 2006

Authors and Affiliations

  • Manoj K. Agarwal
    • 1
  • Narendran Sachindran
    • 1
  • Manish Gupta
    • 1
  • Vijay Mann
    • 1
  1. 1.IBM India Research LabsNew DelhiIndia

Personalised recommendations