Below Loda is compared to relevant state of the art in three different scenarios. The first scenario simulates anomaly detection on stationary problems, where all detectors are first trained on a training set and then evaluated on a separate testing set. This scenario is included because it enables the comparison of Loda to other anomaly detectors, which are designed for batch training and their accuracy should be superior due to more sophisticated training algorithms. The second scenario mimics streamed data, where anomaly detectors first classify a sample and then use it to update their model. The third scenario uses a dataset with millions of samples and features to demonstrate that Loda is able to efficiently handle big data. The section is concluded by an experimental comparison of detectors time complexity in classify and update scenarios and by the demonstration of robustness against missing variables and the identification of causes of anomalies.
Stationary data
Loda has been compared to the following anomaly detectors chosen to represent different approaches to anomaly detection: PCA based anomaly detector (Shyu et al. 2003), 1-SVM (Schölkopf et al. 2001), FRAC (Noto et al. 2012), \(\delta \)-version of kth-nearest neighbor detector (Harmeling et al. 2006) (KNN) which is equivalent to Sricharan and Hero (2011) with \(\gamma =1\) and \(s=d\) (in notation of the referenced work) if the area under ROC curve (AUC) is used for evaluation, LOF (Breunig et al. 2000), and IForest (Liu et al. 2008).
Benchmarking data were constructed by the methodology proposed in Emmott et al. (2013) converting real-world datasets (Frank and Asuncion 2010) to anomaly detection problems. The methodology produces set of normal and anomalous samples, where samples from different classes have different probability distribution, which is aligned with our definition of anomalous samples. Created problems are divided according to (i) difficulty of finding an anomaly (easy, medium, hard, very hard), (ii) scatter / clusteredness of the anomalies (high scatter, medium scatter, low scatter, low clusteredness, medium clusteredness, high clusteredness), (iii) and finally with respect to the rate of anomalies within the data \(\{0,0.005,0.01,0.05,0.1\}\). The training and testing set were created such that they have the same properties (difficulty, scatter/clusteredness, and rate of anomalies)Footnote 2 with clustered anomalies located around the same point. Note that the total number of unique combinations of problem’s properties is up to 120 for each dataset. The construction of individual problems is recapitulated in Appendix “Construction of datasets”.
The quality of the detection is measured by the usual area under ROC curve (AUC). Since from every dataset (out of 36) up to 120 problems with different combinations of problem’s properties can be derived, it is impossible to present all \(7\times 4200\) AUC values. Tables with AUCs averaged over difficulty of detecting anomaly are in the supplemental and cover 15 pages. They are therefore presented in an aggregated form by using critical difference diagrams (Demšar 2006) which show average rank of detectors together with an interval in which Bonferroni–Dunn correction of Wilcoxon signed ranks test cannot reject the hypothesis that the detectors within the interval have the same performance as the best one. The average rank comparing AUC is calculated over all datasets on which all detectors provided the output, following Demšar (2006) lower rank is better. AUCs used for ranking on each dataset are an average over all combinations of problem parameters with fixed parameter(s) of interest. For example, to compare detectors on different rates of anomalies within the data, for each rate of anomalies the average AUC is calculated over all combinations of difficulty and scatter. Average AUCs of different detectors on the same dataset are ranked, and the average of ranks over datasets is the final average rank of a detector on one rate of anomalies within the data.
To decrease the noise in the data caused by statistical variations, experiments for every combination of problem properties were repeated 100 times. This means that every detector was evaluated on up to \(100\times 120\times 35=4.32\times 10^{5}\) problems. Benchmark datasets created as described in Appendix “Construction of datasets” were varied between repetitions by randomly selecting 50 % of normal samples (but maximum of 10,000) to be in the training set and putting the remaining data to the set. Similarly, anomalous samples were first selected randomly from all anomalous samples to reach the desired fraction of outliers within the data, and then randomly divided between training and testing set. The data in the training and testing set have the same properties in terms of difficulty, clusteredness, and fraction of anomalies within.
Settings of hyper-parameters
Setting hyper-parameters in anomaly detection is generally a difficult unsolved problem unless there is a validation ground truth available for the parameter tuning. Wrong parameters can cause an otherwise well designed anomaly detector to fail miserably. Nevertheless, in this context we do not aim to solve the problem as our intention here is primarily to compare multiple detectors against each other. For this purpose we follow hyper-parameter setting guidelines given by the authors of the respective methods. Note that our proposed method does not require manual hyper-parameter setting. Employed parameter settings are detailed below and in Table 2.
Table 2 Summary of detectors hyper-parameters used in experiments on stationary data in Sect. 4.1.2
Settings of the number of nearest neighbors in our implementations of LOF and KNN algorithms followed (Emmott et al. 2013; Breunig et al. 2000) and was set to \(\max \{10,0.03\times n\}\), where n is the number of samples in the training data. 1-SVM with a Gaussian kernel used \(\nu =0.05\) and width of the kernel \(\gamma \) equal to an inverse of median \(L_{2}\) distance between training data. SVM implementation has been taken from the libSVM library (Chang and Lin 2011). Our implementation of the FRAC detector used ordinary linear least-square estimators, which in this setting does not have any hyper-parameters. Our implementation of PCA detector based on principal component transformation used top k components capturing more than 95 % variance. IForest [implementation taken from Jonathan et al. (2014)] used parameters recommended in Liu et al. (2008): 100 trees each constructed with 256 randomly selected samples.
The number of histograms in Loda, k, was determined similarly to the number of probes in feature selection in Somol et al. (2013). Denoting \(f_{k}(x)\) Loda’s output with k histograms, the reduction of variance after adding another histogram can be estimated as
$$\begin{aligned} \hat{\sigma }_{k}=\frac{1}{n}\sum _{i=1}^{n}\left| f_{k+1}(x_{i})-f_{k}(x_{i})\right| . \end{aligned}$$
Although \(\hat{\sigma }_{k}\rightarrow 0\) as \(k\rightarrow \infty \), its magnitude is problem-dependent making it difficult to set a threshold on this quantity. Therefore \(\hat{\sigma }_{k}\) is normalized by \(\hat{\sigma }_{1}\), and the k is determined as
$$\begin{aligned} \arg \min _{k}\frac{\hat{\sigma }_{k}}{\hat{\sigma }_{1}}\ge \tau , \end{aligned}$$
where \(\tau \) is the threshold. In all experiments presented in this paper, \(\tau \) was set to 0.01. Unless said otherwise, Loda used equi-width histogram with the number of histogram bins, b, determined for each histogram separately by maximizing penalized maximum likelihood method of Birgé and Rozenholc (2006) described briefly in Sect. 3.4. With b being set, width of histogram bins in equi-width histograms was set to \(\frac{1}{b}(x_{\mathrm {max}}-x_{\mathrm {min}})\).
Setting of hyper-parameters is summarized in Table 2.
Experimental results
Figure 1a–e shows the average rank of detectors plotted against rate of anomalies within the training set for different clusteredness of anomalies (AUCs were averaged only with respect to the difficulty of anomalies). Since highly scattered anomalies could not be produced from the used datasets, the corresponding plot is omitted. The number of datasets creating problems with medium scattered anomalies is also low, and we should not draw any conclusions as they would not be statistically sound.
For problems with low number of low-scattered and low-clustered anomalies the data-centered detectors (KNN and LOF) are good. With increasing number of anomalies in data and with their increasing clusteredness, however, the performance of data-centered detectors appears to deteriorate (note LOF in particular). Additional investigation revealed that this effect can be mitigated by hyper-parameter tweaking, specifically by increasing the number of neighbours k. In practice this is a significant drawback as there is no way to optimize k unless labeled validation data is available.
The model-centered detectors (Loda, IForest as well as 1-SVM) appear to be comparatively more robust with respect to the increase in number of anomalies or their clusteredness, even if used with fixed parameters (see Sect. 4.1.1. ).
In terms of statistical hypothesis testing the experiments do not show marked differences between the detectors in question. Note that in isolated cases FRAC and 1-SVM perform statistically worse than the respective best detector.
Figure 1f shows the detectors time to train and classify samples with respect to the size of the problem measured by the dimension multiplied by the number of samples.Footnote 3 On small problems KNN, LOF, and 1-SVM are very fast, as their complexity is mainly determined by the number of samples, but Loda is not left behind too much. As the size of problems gets bigger Loda quickly starts to dominate all detectors followed by PCA. Surprisingly IForest with low theoretical complexity had the highest running times. Since its running time is almost independent of the problem size, it is probably due to large multiplication constant absorbed in big O notation.
Comparing detectors by the performance/time complexity ratio, we recommend KNN for small problems with scattered anomalies, while for bigger problems and problems with clustered anomalies we recommend the proposed Loda, as its performance is similar to IForest but it is much faster.
The investigation of problems on which Loda is markedly worse than KNN revealed that Loda performs poorly in cases where the support of the probability distribution of nominal class is not convex and it encapsulates the support of the probability distribution of the anomalous class. A typical example is a “banana” dataset in Fig. 2 showing isolines at false positive rates \(\{0.01,0.05,0.1,0.2,0.4\}\) of Loda and KNN. Isolines show that Loda has difficulty modeling the bay inside the arc. Although the adopted construction of problems from UCI datasets (see Appendix “Construction of datasets” for details) aimed to create these difficult datasets, Loda’s performance was overall competitive to other, more difficult detectors.
Breunig et al. (2000) has criticized KNN, which has performed the best on clean training data, for its inability to detect local outliers occurring in datasets with areas of very different densities. Figure 3 shows again isolines at false positive rates \(\{0.01,0.05,0.1,0.2,\)
\(0.4\}\) of Loda and LOF (designed to detect local outliers) on a data generated by two Gaussians with very different standard deviations (a typical example of such data). Loda’s isolines around the small compact cluster (lower left) are very close to each other, whereas isolines of the loose cluster (upper right) are further away which demonstrates that Loda is able do detect local outliers similarly to LOF. LOF’s wiggly isolines also demonstrates the over-fitting to the training data.
Figure 4 visualizes the average rank of Loda with equi-width, equi-depth, and on-line (Ben-Haim and Tom-Tov 2010) histograms on different rates of anomalies within the data. Although on almost all rates Friedman’s statistical test accepted the hypothesis that all three versions of Loda performs the same, Loda with equi-width histograms is obviously better. This is important because equi-width histograms can be efficiently constructed from a stream of data if the bin-width is determined, for example by using sample data.
Streamed and non-stationary data
This section compares Loda with floating-window histogram and two alternating histograms (see Sect. 3.4) to Half-Space Trees (HS-Trees) (Tan et al. 2011) [implementation taken from Tan (2014)]. On-line version of 1-SVM was omitted from the comparison due to the difficulties with hyper-parameter setting and inferior performance when algorithms were compared in the batch learning. The comparison is made on seven datasets: Shuttle, Covertype, HTTP and SMTP data from kdd-cup 99. HTTP and SMTP data are used in two variants: (i) with all 41 features (called HTTP-full and SMTP-full) and (ii) with 3 features (called HTTP-3 and SMTP-3) used in Tan et al. (2011).Footnote 4
In every repetition of an experiment the testing data were created by replacing maximum of 1 % of randomly selected normal samples by randomly selected anomalous samples. Order of normal samples has not been permuted in order to preserve data continuity. Thus, the variation between repetitions comes from (i) the position to where anomalous samples were inserted and (ii) from the selection of anomalous samples to be mixed in. To study the effect of clustered anomalies, anomalies were first selected to form clusters in space (as in the previous section) and then they replaced sequences of normal data of the same size. By this process data-streams contained anomalies clustered in time and space of sizes \(\{1,4,16\}\) (this quantity is hereafter called clusteredness). Experiments on each dataset for every value of clusteredness of anomalies were repeated 100 times.
HS-Trees used recommended configuration of 50 trees of depth 15 with minimum of 20 samples in leaf node. Loda’s configuration, namely number of histograms and bins were optimized on first 256 samples in every repetition, thus there were no parameters to be set manually. Both algorithms used a window of length 256 samples, which is the hard-coded length in implementation of HS-Trees provided by authors.
Table 3 shows AUCs and average time to process samples of Loda and that of HS-Trees. We can see that both algorithms deliver nearly the same performance, while Loda is on average 7–8 times faster (the time complexity is treated in more detail in the next subsection). Figure 5 shows AUCs on segments of 120,000 continuous samples on datasets Covertype, HTTP, HTTP-3, and SMTP+ HTTP with clusters of 16 anomalies (omitted datasets did not contained enough samples). Since all AUCs of all three detectors remains stable, it can be concluded that all detectors equally well combat the concept drift in datasets.
Table 3 AUCs of Loda and HS-Trees (Tan et al. 2011) on streaming problems with various levels of clusteredness
Further investigation of results reveals that Loda with two alternating histograms (construction similar to that in HS-Trees) is more robust to clustered anomalies than Loda with continuous histogram. Our explanation of this phenomenon is that clustered anomalies are classified most of the time by histograms built on clean data without anomalies, while polluted histograms classify most of the time clean data, which is not influenced by anomalies in training data. These results demonstrate that the right choice of histogram for non-stationary data should depend on the type of anomalies (time and space clustered vs. scattered).
Big data
The demonstration of Loda’s ease of handling big data is done on a URL dataset (Ma et al. 2009) containing 2.4 million samples with 3.2 million features. Normal samples contain features extracted from random URLs, while anomalous samples have features extracted from links in spam e-mails during 120 days at rate 20,000 samples per day (see Ma et al. 2009 for dataset details). Due to the high number of “anomalous” samples and cleanliness of normal samples, the dataset belongs to a category of one-class problems where the goal is to separate class of legitimate URLs from others.
Since the data are presented in batches of 20,000 samples per day, all methods evaluated in Sect. 4.1 can be theoretically applied by training them on normal samples from the previous day, i.e. detectors classifying samples from lth day are trained on normal samples from the \((l-1)\)th day. The problem is that the computational complexity effectively prevents a direct application of FRAC, 1-SVMs, PCA, LOF, and KNN, as methods would not finish in reasonable time. The only methods that can be used without modification is Loda and IForest. Their settings were same as in Sect. 4.1 with the difference that (i) Loda used 500 projections and histograms bins were optimized on data from day zero and then stay fixed; (ii) number of trees within IForest was increased to 1000. Rows labeled \(3.2\times 10^{6}\) in Table 4 show AUCs, average rank over 120 days, and the average time to classify new samples and retrain the detectors. We can see that Loda was not only significantly better (AUC of IForest is close to the detector answering randomly), but also more than 15 times faster.
Table 4 Average rank (caption rank), average AUC (caption AUC), and average time to classify and update detectors on 20 000 samples (caption time) on the full problem (\(d=3.2\times 10^6\)) and on the reduced problem (\(d=500\))
To reduce the computational complexity such that other algorithms can be used, the original data of dimension 3.2 million were projected onto 500 dimensional space with a randomly generated matrix \(W\in {\mathbb {R}}^{3.2\times 10^{6},500}\), \(W_{ij}\sim N(0,1)\). According to the Johnson-Lindenstrauss lemma, \(L_{2}\) distances between points should be approximately preserved, therefore most detectors should work. Hyper-parameters of all detectors were set as in Sect. 4.1.1. AUCs of all detectors on every day are shown in Fig. 6, and their summaries again in Table 4. On this new dataset the best algorithm was the KNN algorithm followed by PCA and Loda. These results are consistent with those in Sect. 4.1.2, where KNN detector was generally good if trained on clean data. The good performance of PCA can be explained by projected data being close to multivariate normal distribution, although the Lilliefors’ test has rejected the hypothesis that marginals follow the normal distribution with 85 % rate. Notice that Loda’s execution time was more than 100 times faster than KNN, as its time to classify and update on samples from one day was 1.16s on average, while that of KNN was 140s.
Time complexity
Loda was primarily designed to have low computational complexity in order to efficiently process large data and data-streams. This feature is demonstrated below, where times to classify one sample with simultaneous update of the detector is compared for PCA, KNN, LOF, HS-Trees, and Loda with continuous histogram.Footnote 5 All settings of individual algorithms were kept the same as in previous experiments. Loda’s time includes optimization of hyper-parameters on the first 256 samples of the stream as described in Sect. 4.1.1 (Loda used on average 140 projections with histograms having 16 bins and sliding window of length 256).
The comparison was made on 100,000 artificially generated samples of dimensions 100 and 10,000. As the interest here is on throughput, the details of data generation are skipped as not being relevant for the experiment. The results are summarized in Fig. 7 showing the average time to classify and update one sample against the number of observed samples. The experiment was stopped if the processing of 1000 samples took more than half an hour causing some graphs to be incomplete. The graph for the PCA detector is missing entirely in the left Fig. 7, because processing of 1000 samples took more than 15 h (109s per one sample). Both graphs demonstrate that Loda’s efficiency is superior to all other detectors, as it is two order of magnitudes faster than the fastest algorithms from the prior art. This experiment highlights Loda’s suitability for efficiently handling data-streams.
Robustness to missing variables
As mentioned in Sect. 3, Loda with sparse projections can classify and learn from samples with missing variables. This feature was evaluated under the “missing at random” scenario, where missing variables are independent of the class membership. In every repetition of an experiment, before the data was divided into training and testing sets, \(\{0,0.01,0.02,0.04,\ldots ,0.20\}\) fraction of all variables in data matrices were made missing (set to NaN). Consequently, missing variables were equally present in sets on which Loda were trained and evaluated. The robustness was evaluated on datasets created in the same way as in Sect. 4.1 with clean training sets and 10 % of anomalies in the test sets. Loda’s hyper-parameters were determined automatically as described in Sect. 4.1.1.
The effect of missing features on detector’s AUC is summarized in Fig. 8 showing the highest missing rate at which Loda had the AUC higher than 99 % of that on the data without missing variables (light grey bars). For a large set of problems the tolerance is high and more than 10 % of missing variables are tolerated with negligible impact on the AUC. There are also exceptions, e.g. magic telescope, page-blocks, wine, etc, where the robustness is very small, but we have failed to find a single cause for this fragility. Interestingly, the robustness is independent on the dimension of the problem and the success with which outliers are detected. The exception is datasets on which the detection of outliers is poor (AUC is around 0.5), which is caused by the fact that missing some variables cannot generally improve or decrease already bad AUC. The same figure also shows the rate of missing variables at which Loda stops to provide output (dark grey bars). We can see that this rate is usually much higher than the rate at which Loda retains its AUC.
Identification of features responsible for an outlier
Explaining the causes of an anomaly detection is relatively new field and there is not yet an established methodology to evaluate and compare the quality of algorithms, hence, the following is adopted here. It starts by calculating the average of Loda’s feature scores according to (4) over all anomalous samples in the testing set. Once the scores of individual features are known, they are sorted in decreasing order, which means that features in which anomalous samples deviates the most should be the first. Then, Loda with dense projections utilizing only first \(d'\) features is used and its AUC is recorded (\(d'\) has varied from 2 to \(\min (100,d)\)). If the feature relevance score (4) is meaningful, then Loda that uses first couple features (low \(d'\)) should have the same or better AUC than Loda that uses all features. This is because all testing problems were originally classification problems which means that anomalous samples were generated by the same probability and hence similar features should be responsible for their anomalousness. Experiments were performed on clean training datasets and all other experimental settings were kept same as in Sect. 4.1.
Figure 9 shows the average AUC of Loda with dense projections using the most important \(d_{\mathrm {min}}\) features, where \(d_{\mathrm {min}}\) is the least such that AUC by using only first \(d_{\mathrm {min}}\) features is higher than 0.99 times the best AUC. On average only 25 % of features are needed to identify outliers with AUC comparable or better than that of Loda with all features. This result confirms that Loda is able to identify features causing the anomaly.
It is interesting that for some problems (e.g. cardiotocography, glass, spect-heart, yeast, etc.) the improvement in the AUC is significant. This can be explained by the curse of dimensionality as anomaly detectors are, by their nature, more sensitive to noise generated by non-informative features.