Event Log Mining Tool for Large Scale HPC Systems

  • Ana Gainaru
  • Franck Cappello
  • Stefan Trausan-Matu
  • Bill Kramer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6852)


Event log files are the most common source of information for the characterization of events in large scale systems. However the large size of these files makes the task of manual analysing log messages to be difficult and error prone. This is the reason why recent research has been focusing on creating algorithms for automatically analysing these log files. In this paper we present a novel methodology for extracting templates that describe event formats from large datasets presenting an intuitive and user-friendly output to system administrators. Our algorithm is able to keep up with the rapidly changing environments by adapting the clusters to the incoming stream of events. For testing our tool, we have chosen 5 log files that have different formats and that challenge different aspects in the clustering task. The experiments show that our tool outperforms all other algorithms in all tested scenarios achieving an average precision and recall of 0.9, increasing the correct number of groups by a factor of 1.5 and decreasing the number of false positives and negatives by an average factor of 4.


Cluster Goodness Mining Tool Data Mining Algorithm Failure Prediction Event Description 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Archive, F.T., (accessed on 2010)
  2. 2.
    Schroeder, G.G.B.: A large-scale study of failures in high-performance computing systems. In: IEEE DSN 2006, pp. 249–258 (June 2006)Google Scholar
  3. 3.
    Bookstein, A., all: Generalized hamming distance. Information Retrieval Journal 5(4), 353–375 (2002)CrossRefGoogle Scholar
  4. 4.
    Chuah, E., et al.: Diagnosing the root-cause of failures from cluster log files (2010)Google Scholar
  5. 5.
    T. computer failure data repository, (accessed on 2010)
  6. 6.
    Fu, Q.: all. Execution anomaly detection in distributed systems through unstructured log analysis. In: ICDM, pp. 149–158 (December 2009)Google Scholar
  7. 7.
    Fu, S., Xu, C.-Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: Proceedings of the ACM/IEEE Conference on Supercomputing (November 2007)Google Scholar
  8. 8.
    Han, J., et al.: Mining frequent patterns without candidate generation. In: ACM SIGMOD, pp. 1–12 (May 2000)Google Scholar
  9. 9.
    Lan, Z., all: Toward automated anomaly identification in large-scale systems. IEEE Trans. on Parallel and Distributed Systems 21(2), 174–187 (2010)CrossRefGoogle Scholar
  10. 10.
    Makanju, A., et al: Clustering event logs using iterative partitioning. In: 15th ACM SIGKDD, pp. 1255–1264 (2009)Google Scholar
  11. 11.
    McCallum, A., all: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD, pp. 169–178 (August 2000)Google Scholar
  12. 12.
    Mitra, M., Chaudhuri, B.: Information retrieval from documents: A survey. Information Retrieval Journal 2(2-3), 141–163 (2000)CrossRefGoogle Scholar
  13. 13.
    NCSA, (accessed on 2010)
  14. 14.
    Pang, W., et al.: Mining logs files for data-driven system management. ACM SIGKDD 7, 44–51 (2005)CrossRefGoogle Scholar
  15. 15.
    Park, Geist, A.: System log pre-processing to improve failure prediction. In: DSN 2009, pp. 572–577 (2009)Google Scholar
  16. 16.
    Salfner, F., et al.: A survey of online failure prediction methods. ACM Computing Surveys 42(3) (March 2010)Google Scholar
  17. 17.
    Stearley, J.: Towards informatic analysis of syslogs. In: IEEE Conference on Cluster Computing (September 2004)Google Scholar
  18. 18.
    Stearley, J.: Towards informatic analysis of syslogs. In: IEEE International Conference on Cluster Computing, vol. 5, pp. 309–318 (2004)Google Scholar
  19. 19.
    Vaarandi, R.: Mining event logs with slct and loghound. In: IEEE NOMS 2008, pp. 1071–1074 (April 2008)Google Scholar
  20. 20.
    Wei Peng, S.M., Li, T.: Mining logs files for data driven system management. ACM SIGKDD 7, 44–51 (2005)CrossRefGoogle Scholar
  21. 21.
    Xue, Z., et al.: A survey on failure prediction of large-scale server clusters. In: ACIS SNPD 2007, pp. 733–738 (June 2007)Google Scholar
  22. 22.
    Zarza, G., et al.: Fault-tolerant routing for multiple permanent and non-permanent faults in hpc systems. In: PDPTA 2010 (July 2010)Google Scholar
  23. 23.
    Zhang, X., Furtlehner, C., Sebag, M.: Data streaming with affinity propagation. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 628–643. Springer, Heidelberg (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ana Gainaru
    • 1
    • 3
  • Franck Cappello
    • 1
    • 2
  • Stefan Trausan-Matu
    • 3
  • Bill Kramer
    • 1
  1. 1.University of Illinois at Urbana-ChampaignUSA
  2. 2.INRIAFrance
  3. 3.University Politehnica of BucharestRomania

Personalised recommendations