Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems

  • 169 Accesses


The increased complexity and scale of high performance computing and future extreme-scale systems have made resilience a key issue, since it is expected that future systems will have various faults during critical operations. It is also expected that current solutions for resiliency, mainly counting on checkpointing in hardware and applications, will become infeasible because of unacceptable recovery time for checkpointing and restarting. In this paper, we present innovative concepts for anomaly detection and identification, analyzing the duration of pattern transition sequences of an execution window. We use a three-dimensional array of features to capture spatial and temporal variability to be used by an anomaly analysis system to immediately generate an alert and identify the source of faults when an abnormal behavior pattern is captured, indicating some kind of software or hardware failure. The main contributions of this paper include the innovative analysis methodology and feature selection to detect and identify anomalous behavior. Evaluating the effectiveness of this approach to detect faults injected asynchronously shows a detection rate of above 99.9% with no occurrences of false alarms for a wide range of scenarios, and accuracy rate of 100% with short root cause analysis time.

This is a preview of subscription content, log in to check access.


  1. 1.

    Allen EB, Khoshgoftaar TM, Jones WD, Hudepohl JP (2000) Accuracy of software quality models. Ann Softw Eng 9:103–116

  2. 2.

    Allen EB, Khoshgoftaar TM, Deng J (2002) Using regression trees to classify fault-prone software modules reliability. IEEE Trans Comput 51:455–462

  3. 3.

    Bang M, Larsson A, Eriksson H (2003) NOSTOS: a paper-based ubiquitous computing healthcare environment to support data capture and collaboration. In: Proc 2003 AMIA annual symp, Washington, DC, pp 46–50

  4. 4.

    Beaumont GP (1996) Statistical tests: an introduction with minitab commentary. Prentice Hall, New York

  5. 5.

    Briand LC, Wust J, Daly JW, Porter DV (2000) Exploring the relationship between design measures and software quality in object-oriented systems. J Syst Softw 51:245–273

  6. 6.

    Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) C3: A system for automating application-level checkpointing of MPI programs. In: Proceedings of the 16th international workshop on languages and compilers for parallel computing, October (LCPC 2003)

  7. 7.

    Chen M, Kıcıman E, Fratkin E, Brewer E, Fox A (2004) Pinpoint problem determination in large, dynamic, internet services. In: Proceedings of the international conference on dependable systems and networks, Washington, DC

  8. 8.

    Chenand S, Billings SA (1994) Neural networks for nonlinear system modeling and identification. In: Harris CJ (ed) Advances in intelligent control. Taylor & Francis, London, pp 85–112

  9. 9.

    Christodorescu M, Jha S (2003) Static analysis of executables to detect malicious patterns. In: USENIX security symposium, Washington, DC

  10. 10.

    Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: ACM SOSP

  11. 11.

    Dharmapurikar S, Krishnamurthy P, Taylor DE (2003) Longest prefix matching using bloom filters. In: ACM SIGCOMM’03, Karlsruhe, Germany, August 25–29

  12. 12.

    Ebert C (1996) Classification techniques for metric-based software development. Softw Qual J 5:255–272

  13. 13.

    Fioravanti F, Nesi P (2001) A study on fault-proneness detection of object-oriented systems. In: Proc on software maintenance and reengineering, pp 121–130

  14. 14.

    Forman George IC (2004) Learning from little: comparison of classifiers given little training. In: ECML/PKDD

  15. 15.

    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 1289–1305

  16. 16.

    Geist A, Engelmann C (2011) Development of naturally fault tolerant algorithms for computing on 100,000 processors. http://www.csm.ornl.gov/~geist/Lyon2002-geist.pdf

  17. 17.

    Gruschke B (1998) Integrated event management: event correlation using dependency graphs. In: 10th IFIP/IEEE international workshop on distributed systems: operations and management (DSOM), pp 130–141

  18. 18.

    Khanna G, Cheng MY, Varadharajan P, Bagchi S, Correia MP, Verissimo PJ (2007) Automated rule-based diagnosis through a distributed monitor system. IEEE Trans Dependable Sec Comput (Dec)

  19. 19.

    Khoshgoftaar TM, Allen EB (2001) Controlling overfitting in software quality models: experiments with regression trees and classification. In: IEEE METRICS

  20. 20.

    Khoshgoftaar TM, Lanning DL (1995) A neural network approach for early detection of program modules having high risk in the maintenance phase. J Syst Softw 29:85–91

  21. 21.

    Khoshgoftaar TM, Seliya N (2003) Fault prediction modeling for software quality estimation: comparing commonly used techniques. Empir Softw Eng 8:255–283

  22. 22.

    Lane T, Brodie CE (1999) Temporal sequence learning and data reduction for anomaly detection. ACM Trans Inf Syst Secur 2:295–331

  23. 23.

    Lazarevic A, Kumar V, Srivastava J (2006) Intrusion detection: a survey, vol 5. Springer, New York

  24. 24.

    Marceau C (2000) Characterizing the behavior of a program using multiple-length N-grams. In: New security paradigms workshop, Cork, Ireland

  25. 25.

    Maxion R, Tand KMC (2002) Anomaly detection in embedded systems. IEEE Trans Comput 51:108–120

  26. 26.

    Nikora AP, Munson JC (2003) Developing fault predictors for evolving software systems. In: Proc of the ninth international software metrics symposium

  27. 27.

    Partow A (2011) http://www.partow.net/programming/hashfunctions/#Available. HashFunctions

  28. 28.

    Ping TSY, Muller H (2002) Predicting fault proneness using OO metrics. An industrial case study. In: Proc of the sixth European conference on software maintenance and reengineering, pp 99–107

  29. 29.

    Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106

  30. 30.

    Ray J, Hoe JC, Falsafi B (2001) Dual use of superscalar datapath for transient-fault detection and recovery. In: Proceedings of the 34th annual ACM/IEEE international symposium on microarchitecture. IEEE Computer Society Press, Los Alamitos, pp 214–224

  31. 31.

    Reinhardt S, Mukherjee S (2000) Transient fault detection via simultaneous multithreading. In: Proc of the 27th annual international symposium on computer architecture

  32. 32.

    Reis GAC, Chang J, Vachharajani N, Rangan R, August DI (2005) SWIFT: software implemented fault tolerance. Code Gener Optim 243–254

  33. 33.

    Takahashi R, Muraoka Y, Nakamura Y (1997) Building software quality classification trees: approach, experimentation, evaluation. In: Proceedings of the eighth international symposium on software reliability engineering. IEEE Computer Society Press, Los Alamitos, pp 222–233

  34. 34.

    Torrellas J (2009) Architectures for extreme-scale computing. Computer 42(11):28–35

  35. 35.

    TPC-W (2005) http://www.tpc.org/tpcw, April

  36. 36.

    Vargiya R, Chan P (2003) Boundary detection in tokenizing network application payload for anomaly detection. In: Data mining for computer security (DMSEC)

  37. 37.

    Wang K, Parekh JJ, Stolfo SJ (2006) Anagram: a content anomaly detector resistant to mimicry attack. In: Proceedings of the ninth international symposium on recent advances in intrusion detection

  38. 38.

    Wood A (2011) Data integrity concepts, features, and technology. White paper, Tandem Division, Compaq Computer Corporation

  39. 39.

    Ye N, Chen Q (2001) An anomaly detection technique based on chi-square statistic. Qual Reliab Eng Int 17:105–112

Download references

Author information

Correspondence to Byoung Uk Kim.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kim, B.U. Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems. J Supercomput 61, 6–26 (2012). https://doi.org/10.1007/s11227-011-0653-x

Download citation


  • Anomaly
  • Resilience
  • Data analysis
  • Fault detection and identification