Anomaly Detection is the process of uncovering anomalies, errors, bugs, and defects in software to eradicate them and increase the overall quality of a system. Finding anomalies in big data analytics is especially important. Big data is “unstructured” by definition, hence, the process of structuring it is continually presented with anomaly detection activities.
Data engineering is a challenging process. Different stages of the process affect the outcome in a variety of ways. Manpower, system design, data formatting, variety of data sources, size of the software, and project budget are among the variables that could alter the outcome of an engineering project. Nevertheless, software and data anomalies pose one of the most challenging obstacles in the success of any project. Anomalies have postponed space shuttle launches, caused problems for airplanes, and disrupted credit card and financial systems. Anomaly detection is commonly referred to as a science as well as an art. It is clearly an inexact process, as no two testing teams will produce the same exact testing design or plan (Batarseh 2012).
The cost of failed software can be high indeed. For example, in 1996, a test flight of a European launch system, Ariane 5 # 501, failed as a result of an anomaly. Upon launch, the rocket veered off its path and was destroyed by its self-destruction system to avoid further damage. This loss was later analyzed and linked to a simple floating number anomaly. Another famous example is regarding a wholesale pharmaceutical distribution company in Texas (called: Fox Meyer Drugs). The company developed a resources planning system that failed right after implementation, because the system was not tested thoroughly. When Fox Meyer deployed the new system, most anomalies floated to the surface, and caused lots of users’ frustration. That put the organization into bankruptcy in 1996. Moreover, three people died in 1986 when a radiation therapy system called Therac erroneously subjected patients to lethal overdoses of radiation. More recently however, in 2005, Toyota recalled 160,000 Prius automobiles from the market because of a software anomaly in the car’s software. The mentioned examples are just some of the many projects gone wrong (Batarseh and Gonzalez 2015); therefore, anomaly detection is a critical and difficult issue to address.
Anomaly Detection Types
Redundancy – Having the same data in two or more places.
Ambivalence – Mixed data or unclear representation of knowledge.
Circularity – Closed loops in software; a function or a system leading to itself as a solution.
Deficiency – Inefficient representation of requirements.
Incompleteness – Lack of representation of the data or the user requirements.
Inconsistency – Any untrue representation of the expert’s knowledge.
Anomaly detection approaches
Anomaly detection approach
Detection through analysis of heuristics
Logical validation with uncertainty, a field of artificial intelligence
Detection through simulation
Result-oriented validation through building simulations of the system
Face/field validation and verification
Preliminary approach (used with other types of detection). This is a usage-oriented approach
A software engineering method, part of testing
A software engineering method, part of testing
Verification through case testing
Result-oriented validation, achieved by running tests and observing the results
Verification through graphical representations
Visual validation and error detection
Decision trees and directed graphs
Visual validation – observing the trees, and the structure of the system
Simultaneous confidence intervals
Result-oriented validation, one of the commonplace artificial intelligence methods
Result-oriented data analysis
Data collection and outlier detection
Usage-oriented validation through statistical methods and data mining
Visual interaction verification
Visual validation thought user interfaces
However, based on the recent study by National Institute of Standards and Technology (NIST), the data anomaly itself is not the quandary, it is actually the ability to identify the location of the anomaly . That is listed as the most time-consuming activity of testing. In their study, NIST researchers compiled a vast number of software and data projects and reached the following conclusion: “If the location of bugs can be made more precise, both the calendar time and resource requirements of testing can be reduced. Modern data and software products typically contain millions of lines of code. Precisely locating the source of bugs in that code can be very resource consuming.” Based on that, it can be concluded that anomaly detection is an important area of research that is worth exploring (NIST 2002; Batarseh and Gonzalez 2015).
Similar to most engineering domains, software and data require extensive testing and evaluation. The main goal of testing is to eliminate anomalies, in a process referred to as anomaly detection.
It is not possible to perform data analysis if the data has anomalies. Data scientists usually perform steps such as data cleaning, aggregation, filtering, and many others. All these activities require anomaly detection to be able to verify the data and provide valid outcomes. Additionally, detection leads to a better overall quality of a data system, therefore, it is a necessary and an unavoidable process. Anomalies occur for many reasons, and in many parts of the system, many practices lead to anomalies (listed in this entity), locating them, however, is an interesting engineering problem.
- Batarseh, F. (2012). Incremental lifecycle validation of knowledge-based systems through CommonKADS. Ph.D. Dissertation Registered at the University of Central Florida and the Library of Congress.Google Scholar
- Batarseh, F., & Gonzalez, A. (2015). Predicting failures in contextual software development through data analytics. Proceedings of Springer’s Software Quality Journal.Google Scholar
- Planning Report for NIST. (2002). The economic impacts of inadequate infrastructure for software testing. A report published by the US Department of Commerce.Google Scholar