Skip to main content
Log in

A meta-level analysis of online anomaly detectors

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Real-time detection of anomalies in streaming data is receiving increasing attention as it allows us to raise alerts, predict faults, and detect intrusions or threats across industries. Yet, little attention has been given to compare the effectiveness and efficiency of anomaly detectors for streaming data (i.e., of online algorithms). In this paper, we present a qualitative, synthetic overview of major online detectors from different algorithmic families (i.e., distance, density, tree or projection based) and highlight their main ideas for constructing, updating and testing detection models. Then, we provide a thorough analysis of the results of a quantitative experimental evaluation of online detection algorithms along with their offline counterparts. The behavior of the detectors is correlated with the characteristics of different datasets (i.e., meta-features), thereby providing a meta-level analysis of their performance. Our study addresses several missing insights from the literature such as (a) how reliable are detectors against a random classifier and what dataset characteristics make them perform randomly; (b) to what extent online detectors approximate the performance of offline counterparts; (c) which sketch strategy and update primitives of detectors are best to detect anomalies visible only within a feature subspace of a dataset; (d) what are the trade-offs between the effectiveness and the efficiency of detectors belonging to different algorithmic families; (e) which specific characteristics of datasets yield an online algorithm to outperform all others.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. In this paper, we use the terms outlier, novelty and anomaly detection interchangeably

  2. We call “sample” an element (i.e., observation, measurement) of a data stream.

  3. An hyper-parameter cannot be estimated from the data.

  4. The recently proposed distance-based detector NETS [73] consumes less resources than MCOD but is not reporting any improvement in terms of effectiveness.

  5. nearest neighbors are distinguished according to maximum and average distance.

  6. \(2^{h+1} - 1\) is the number of nodes in a perfect binary tree.

  7. https://aws.amazon.com/kinesis/

  8. \(2 n - 1\) is the number of nodes of a full binary tree with n leaves.

  9. We consider the coordinates of each point as (F2, F1).

  10. in our implementation we consider the simple forgetting mechanism rather than the time-decaying mechanism.

  11. Authors report that the value \(\gamma =0.01\) is optimal for the most datasets.

  12. https://github.com/Waikato/moa

  13. https://infolab.usc.edu/Luan/Outlier/CountBasedWindow/DODDS/src/outlierdetection/

  14. https://github.com/kaist-dmlab/STARE

  15. https://github.com/tranvanluan2/cpod

  16. https://github.com/cmuxstream/cmuxstream-core

  17. Online version: http://agents.fel.cvut.cz/stegodata/tools/ Offline Version: https://github.com/yzhao062/pyod

  18. https://github.com/bedanta01/Subspace-Outlier-Detection

  19. https://scikit-learn.org/

  20. https://github.com/ngoix/OCRF

  21. After removing null values and categorical features not treated by our anomaly detectors.

  22. https://www.ipd.kit.edu/~muellere/HiCS/

  23. There are also some cases of unknown anomalies they may appear in any of those categories.

  24. k value is automatically computed during training.

  25. When p is stored in a leaf of size larger than one, the value of !t(p) is adjusted to c(size).

  26. In most experimental studies [15, 24, 25], detectors were executed using the default values of their hyper-parameters as “recommended by their authors.”

  27. RS does not require to compute the gradient of the problem to be optimized and hence be used on functions that are not continuous or differentiable [31].

  28. https://en.wikipedia.org/wiki/Coefficient_of_variation

References

  1. Aggarwal, C.: An Introduction to Outlier Analysis, pp. 1–40 (2013)

  2. Aggarwal, C., Hinneburg, A., Keim, A.: On the surprising behavior of distance metrics in high dimensional spaces ICDT (2001)

  3. Aggarwal, C., Sathe, S.: Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explor. 17(1), 74 (2015)

    Article  Google Scholar 

  4. Aggarwal, C., Sathe, S.: Outlier Ensembles-An Introduction. Springer, Berlin (2017)

    Book  Google Scholar 

  5. Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., Whittle, S.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. PVLDB 8(12), 68 (2015)

    Google Scholar 

  6. Alcobaça, E., Siqueira, F., Rivolli, A., Garcia, L., Oliva, J., de Carvalho, A.: Mfe: towards reproducible meta-feature extraction. JMLR 21(111), 1–5 (2020)

    Google Scholar 

  7. Bailis, P., Gan, E., Madden, S., Narayanan, D., Rong, K., Suri, S.: Macrobase: prioritizing attention in fast data. In: SIGMOD (2017)

  8. Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. JMLR 11, 994 (2010)

    MathSciNet  MATH  Google Scholar 

  9. Bergmeir, C., Benítez, M.: On the use of cross-validation for time series predictor evaluation. Inf. Sci. 7, 191 (2012)

    Google Scholar 

  10. Birge, L., Rozenholc, Y.: How many bins should be put in a regular histogram. In: ESAIM: Probability and Statistics, pp. 24–45 (2006)

  11. Blázquez-García, A., Conde, A., Mori, U., Lozano, J.: A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 54(3), 98 (2021)

    Google Scholar 

  12. Braei, M., Wagner, S.: Anomaly detection in univariate time-series: a survey on the state-of-the-art. CoRR 00433, 2020 (2004)

    Google Scholar 

  13. Branco, P., Torgo, L., Ribeiro, R.: A survey of predictive modelling under imbalanced distributions. CoRR, 1505.01658 (2015)

  14. Breunig, M., Kriegel, H., Ng, R., Sander, J.: Lof: identifying density-based local outliers. SIGMOD Rec. 29(2), 799 (2000)

    Article  Google Scholar 

  15. Campos, G., Zimek, A., Sander, J., Campello, R., Micenková, B., Schubert, E., Assent, I., Houle, M.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)

    Article  MathSciNet  Google Scholar 

  16. Cao, L., Yang, D., Wang, Q., Yu, Y., Wang, J., Rundensteiner, E.: Scalable distance-based outlier detection over high-volume data streams (2014)

  17. Carbone, P., Fragkoulis, M., Kalavri, V., Katsifodimos, A.: Beyond analytics: the evolution of stream processing systems. In: SIGMOD (2020)

  18. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 96 (2009)

    Article  Google Scholar 

  19. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: a survey. IEEE TKDE 24(5), 119 (2012)

    Google Scholar 

  20. Choudhary, D., Arun Kejariwal, A., Orsini, F.: On the runtime-efficacy trade-off of anomaly detection techniques for real-time streaming data. CoRR, arxiv:1710.04735 (2017)

  21. Cook, A., Mısırlı, G., Fan, Z.: Anomaly detection for IoT time-series data: a survey. IEEE IoT J. 7(7), 88 (2020)

    Google Scholar 

  22. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: (ICML’06), pp. 233–240 (2006)

  23. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. In: JMLR, 7, December (2006)

  24. Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms. Pattern Recogn. 74(C), 406–421 (2018)

    Article  MATH  Google Scholar 

  25. Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognit. 74, 478 (2018)

    Article  MATH  Google Scholar 

  26. Dua, D., Graff, C.: Uci Machine Learning Repository. University of California, School of Information and Computer Sciences, Irvine (2017)

    Google Scholar 

  27. Dudani, S.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. SMC 6(4), 325–327 (1976)

    Google Scholar 

  28. Emmott, A., Das, S., Dietterich, T., Fern, A., Wong, W.: A meta-analysis of the anomaly detection problem. CoRR, arxiv:1503.01158 (2015)

  29. Goix, N., Drougard, N., Brault, R., Chiapino, M.: One class splitting criteria for random forests. In: ACML (2017)

  30. Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS One 11(4), 887 (2016)

    Article  Google Scholar 

  31. Granichin, O.N., Volkovich, Z., Toledano-Kitai, D.: Randomized Algorithms in Automatic Control and Data Mining, vol. 67. Springer, Berlin (2015)

    Google Scholar 

  32. Guha, S., Mishra, N., Roy, G., Schrijvers, O.: Robust random cut forest based anomaly detection on streams. In: ICML’16, pp. 2712–2721 (2016)

  33. Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data: a survey. IEEE TKDE 26(9), 83 (2014)

    MATH  Google Scholar 

  34. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, Berlin (2009)

    Book  MATH  Google Scholar 

  35. Herbold, S.: Autorank: a python package for automated ranking of classifiers. J. Open Source Softw. 3, 2173 (2020)

    Article  Google Scholar 

  36. Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 69 (2004)

    Article  MATH  Google Scholar 

  37. Jacob, V., Song, F., Stiegler, A., Rad, B., Diao, Y., Tatbul, N.: Exathlon: a benchmark for explainable anomaly detection over time series. PVLDB 14(11), 58 (2021)

    Google Scholar 

  38. Keller, F., Müller, E., Böhm, K.: (2012) Hics: high contrast subspaces for density-based outlier ranking. In: ICDE, pp. 1037–1048

  39. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI (1995)

  40. Kontaki, M., Gounaris, A., Papadopoulos, A., Tsichlas, K., Manolopoulos, Y.: Continuous monitoring of distance-based outliers over data streams. In: ICDE (2011)

  41. Li, J., Maier, D., Tufte, K., Papadimos, V., Tucker, P.: Semantics and evaluation techniques for window aggregates in data streams. In: SIGMOD (2005)

  42. Lindner, G., Studer, R.: Ast: Support for algorithm selection with a CBR approach. In: Principles of Data Mining and Knowledge Discovery, pp. 418–423 (1999)

  43. Liu, T., Ting, K. Ming, Zhou, Z.: Isolation forest. In: ICDM, pp. 413–422 (2008)

  44. Lobo, J., Jiménez-Valverde, A., Real, R.: AUC: a misleading measure of the performance of predictive distribution models. Global Ecol. Biogeogr. 17(2), 9008 (2008)

    Article  Google Scholar 

  45. Manzoor, E., Lamba, H., Akoglu, L.: Xstream: Outlier detection in feature-evolving data streams KDD. (2018)

  46. Na, Gyoung S., Kim, Donghyun, Yu., Hwanjo: Dilof: Effective and memory efficient local outlier detection in data streams. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1993–2002 (2018)

  47. Orair, G., Teixeira, C., Meira, W., Wang, Y., Parthasarathy, S.: Distance-based outlier detection: consolidation and renewed bearing. PVLDB 3(2), 788 (2010)

    Google Scholar 

  48. Pang, G., Shen, C., Cao, L., Hengel, A.: Deep learning for anomaly detection: a review. ACM Comput. Surv. 54(2), 89 (2021)

    Google Scholar 

  49. Pevný, T.: Loda: lightweight on-line detector of anomalies. Mach. Learn. 102(2), 116 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  50. Qin, X., Cao, L., Rundensteiner, E.A., Madden, S.: Scalable kernel density estimation-based local outlier detection over large data streams. In: EDBT (2019)

  51. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, pp. 427–438 (2000)

  52. Rastrigin, L.A.: The convergence of the random search method in the extremal control of a many parameter system. Autom. Remote Control 4, 1337–1342 (1963)

    Google Scholar 

  53. Rogers, J., Gunn, S.: Identifying feature relevance using a random forest. In: SLSFS, pp. 173–184, Bohinj, Slovenia (2005)

  54. Roy, S.N.: On a Heuristic method of test construction and its use in multivariate analysis. Ann. Math. Stat. 6, 220–238 (1953)

    Article  MathSciNet  MATH  Google Scholar 

  55. Sadik, S., Gruenwald, L.: Research issues in outlier detection for data streams. SIGKDD Explor. Newsl. 15(1), 78 (2014)

    Article  Google Scholar 

  56. Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than ROC when evaluating binary classifiers on imbalanced datasets. PLoS One 10(3), 708 (2015)

    Article  Google Scholar 

  57. Sathe, S., Aggarwal, C.: Subspace histograms for outlier detection in linear time. KAIS 56(3), 68 (2018)

    Google Scholar 

  58. Silva, J., Faria, E., Barros, R., Hruschka, E., de Carvalho, A., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 9114 (2013)

    Article  MATH  Google Scholar 

  59. Somol, P., Grim, J., Filip, J., Pudil, P.: On stopping rules in dependency-aware feature ranking. In: CIARP (2013)

  60. Tan, C., Ting, M., Liu, T.: Fast anomaly detection for streaming data. In: IJCAI (2011)

  61. Tatbul, N., Lee, T.J., Zdonik, S., Alam, M., Gottschlich, J.: Precision and recall for time series. In: NIPS (2018)

  62. Ting, K.M., Washio, T., Wells, J.R., Aryal, S.: Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach. Learn. 5, 55–91 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  63. Tran, L., Fan, L., Shahabi, C.: Distance-based outlier detection in data streams. PVLDB 9(12), 96 (2016)

    Google Scholar 

  64. Tran, L., Mun, M., Shahabi, C.: Real-time distance-based outlier detection in data streams. PVLDB 14(2), 7006 (2020)

    Google Scholar 

  65. van Stein, B., van Leeuwen, M., Bäck, T.: Local subspace-based outlier detection using global neighbourhoods. CoRR, arxiv:1611.00183 (2016)

  66. Vanschoren, J.: Meta-Learning, pp. 35–61 (2019)

  67. Vanschoren, J., van Rijn, J., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. SIGKDD Explor. 15(2), 96 (2013)

    Google Scholar 

  68. Wang, H., Bah, J., Hammad, M.: Progress in outlier detection techniques: a survey. IEEE Access 7, 998 (2019)

    Google Scholar 

  69. Wu, R., Keogh, E.: Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. In: IEEE TKDE (2021)

  70. Xia, S., Xiong, Z., Luo, Y., WeiXu, Z.G.: Effectiveness of the Euclidean distance in high dimensional spaces. Optik 4, 5614–5619 (2015)

  71. Yang, J., Rahardja, S., Fränti, P.: Outlier detection: how to threshold outlier scores? In: AIIPCC (2019)

  72. Yoon, S., Lee, J., Lee, B.: Ultrafast local outlier detection from a data stream with stationary region skipping. In: KDD (2020)

  73. Yoon, S., Lee, J., Lee, B.S.: Nets: extremely fast outlier detection from a data stream via set-based processing. PVLDB 12(11), 998 (2019)

    Google Scholar 

  74. Zhang, E., Zhang, Y.I.: Average precision. In: Encyclopedia of Database Systems (2009)

  75. Zhao, Y., Rossi, A., Akoglu, L.: Automating outlier detection via meta-learning. CoRR 2009, 10606 (2020)

    Google Scholar 

  76. Zimek, A., Filzmoser, P.: There and back again: outlier detection between statistical reasoning and data mining algorithms. Int. Rev. Data Min. Knowl. Discov. 8(6), 66 (2018)

    Google Scholar 

  77. Zimek, A., Gaudet, M., Campello, R., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: KDD (2013)

  78. Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Mini. 5(5), 997 (2012)

Download references

Acknowledgements

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) /ERC Grant Agreement n. 617393. The research work was also supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “First Call for H.F.R.I. Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment grant” (Project Number: 1941).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vassilis Christophides.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Appendix 1: Offline detectors

Based on the findings of previous benchmarking efforts [15, 24, 25] we considered representative offline detectors from three families, namely, distance-based like the weighed \(\hbox {KNN}_W\) [51], density-based like LOF [14] and tree-based like IF [43] and OCRF [29].

1.1 Weighted K nearest neighbor (\(\hbox {KNN}_{W}\))

\(\hbox {KNN}_{W}\) is a score-based variation [27] of the distance-based K nearest neighbors (KNN) [51] classifier. An anomaly is a sample that is in substantially higher distance than the rest samples (i.e., they have few neighbors in close distance).

To assign a real-valued score for each sample, it computes a matrix of distances, called MD: each row represents a sample and each column represents the distance from another sample. Then, using MD it computes the score of a sample p as the maximum distance dist over its K (columns) nearest neighbors as follows:

$$\begin{aligned} score(p) = max(dist(p, q)) : 1 \le q \le K \end{aligned}$$
(A1)

Higher score indicates more abnormality. The number of nearest neighbors K and the distance metric dist(\(\cdot \)) are the two hyper-parameters of \(\hbox {KNN}_W\). Its effectiveness depends on how carefully K is chosen w.r.t. the data characteristics. For a dataset of size N, \(\hbox {KNN}_{W}\) needs quadratic time \(O(N^2)\) to construct MD and linear time O(NK) to score each sample.

1.2 Local outlier factor (LOF)

LOF [14] is a density-based detector which computes the local density deviation of a sample w.r.t. its neighbors. Essentially, LOF considers an anomaly any sample that lies on a sparse area while its nearest neighbors lie on dense areas.

LOF first computes the reachability distance rd between a given sample p and another sample o according to Eq. A2, where dist(p, o) is the direct distance between the two samples and k-dist(p,o) is the distance of o to its k-th nearest neighbor.

$$\begin{aligned} rd(p, o) = max(dist(p, o), \text {k-}dist(o)) \end{aligned}$$
(A2)

Then, LOF computes the local reachability density of a sample p as the inverse average reachability distance of p from its k-nearest neighbors KNN(p):

$$\begin{aligned} lrd(p) = \frac{1}{\frac{1}{\vert KNN(p) \vert } \sum _{o \in KNN(p) } rd(p, o) }. \end{aligned}$$
(A3)

The final score of a sample p is computed by comparing to the average local reachability density of its neighbors:

$$\begin{aligned} LOF(p) = \frac{1}{\vert KNN(p) \vert } \sum _{o \in KNN(o) } \frac{ lrd(o) }{ lrd(p) } \end{aligned}$$
(A4)

The number of nearest neighbors K and the distance metric \( dist (\cdot )\) are the two hyper-parameters of LOF. The effectiveness of the algorithm strongly depends on the careful selection of K. LOF needs quadratic time \(O(N^2)\) to compute the distances between each pair of samples and linear time O(N) to score samples, where N is the dataset size.

1.3 Isolation forest (IF)

IF is a tree-based ensemble detector [43], which relies on the number of partitions needed to isolate a sample from the rest in a dataset. The less partitions needed to isolate, the more likely a sample is to be an anomaly.

The algorithm uses a forest of random Trees built on samples of the dataset using bootstrapping of size Max Samples. For each sub-sample, it constructs an isolation tree by uniformly selecting features and their split values as internal nodes. Samples are stored in the leafs, and the actual height of the tree could be limited up to a Max Height. The length of the path from the tree root to a leaf node, measures how well a sample in that node is isolated from the others: short paths provide evidence of anomalies. The outlyingness score of a sample p is then computed by averaging its path length over all Isolation Trees in the forest as follows:

$$\begin{aligned} score(p, n) = 2^{- E(!t(p))/c(n)} \end{aligned}$$
(A5)

where n is the max samples size, E(!t(p)) is the average of the actual height !t(p)Footnote 25 and c(n) is used for normalization:

$$\begin{aligned} c(n) = 2 (ln(n - 1) + 0.5772156649) - 2 (n - 1) / n \end{aligned}$$
(A6)

The closer to 1 the score is, the more probable the sample to be an anomaly. A score much smaller than 0.5 indicates an inlier.

The number of Trees t, the number of samples a tree may contain Max Samples m, and its Max Height !t are the three hyper-parameters of IF. Max Height is typically set to \(\lceil log(n) \rceil )\) while the variance of scores is usually minimized with  100 random Trees. By avoiding costly distance or density calculation, IF requires linear time O(tmh) to build a forest and linear time O(tnh) to score n samples.

1.4 One-class random forest (OCRF)

OCRF is a tree-based ensemble detector [29], extension of the Random Forest. It relies on the adaption of two-class splitting criterion to one-class setting, by assuming an adaptive distribution of anomalies in each node, i.e., assuming the number of anomalies are equal to the number of normal samples in each node. After splitting, one child node captures the maximum amount of samples in a minimal space, while the other the exact opposite. The closer the leaf containing a sample is to the root of the tree, the more likely this sample is to be an anomaly.

Similarly to IF, it builds a forest of Trees built on samples and features of the dataset using bootstrapping of size Max Samples and Max Features, respectively. For each sub-sample, it constructs a tree by selecting the feature and split value that minimizes the one-class Gini improvement proxy. One-class Gini improvement proxy is given by the following formula:

$$\begin{aligned} I_G^{OC-ad}(t_L, t_R) = \frac{n_{t_L} \gamma n_t \lambda _L}{n_t{_L} + \gamma n_t \lambda _L} + \frac{n_{t_R} \gamma n_t \lambda _R}{n_t{_R} + \gamma n_t \lambda _R} \end{aligned}$$
(A7)

where \(n_t, n_t{_L}\) and \(n_t{_R}\) are the number of samples on the node t and its children nodes (LR), respectively, \(\lambda _L = Leb(X_{t_L}/ Leb(X_t))\) and \(\lambda _R = Leb(X_{t_R}/ Leb(X_t))\) are the volume proportions of the children nodes and \(\gamma \) is a parameter which influences the optimal splits.

Table 9 The values of the varying and fixed hyper-parameters

The samples are stored in the leafs, and the actual height of the tree could be limited up to a Max Height. The length of the path from the tree root to a leaf node, measures how well a sample in that node is isolated from the others: short paths provide evidence of anomalies. The anomalousness score of a sample p is then computed by averaging its path length over all trees in the forest as follows:

$$\begin{aligned} log_2 score(x) = - \left( \sum _{t leaves}{\mathbbm {1}_{\{x\epsilon t\}}d_t + c(n_t)}\right) / c(n) \end{aligned}$$
(A8)

where \(d_t\) is the depth of node t, and \(c(n) = 2H(n-1)-2(n-1)/n\), !t(i) being the harmonic number.

The closer to 1 the score is, the more probable the sample to be an anomaly. A score much smaller than 0.5 indicates an inlier.

The number of Trees t, the number of samples a tree may contain Max Samples m, the number of features for those samples Max Features f and its Max Height !t are the three hyper-parameters of OCRF. Similarly to IF, Max Height is typically set to \(\lceil log (n) \rceil \) while the variance scores is usually minimized with  100 random Trees. OCRF requires linear time O(tmfh) to build a forest and linear time O(tnh) to score n samples.

Appendix 2: Hyper-parameters

We are interested in assessing the effectiveness of a detector under optimal conditions. In this respect, we distinguish between hyper-parameters whose values can be set independently of the datasets and those for which we need to estimate their values per dataset.Footnote 26

Table 9 presents the fixed values of various hyper-parameters we set in our experiments (e.g., training/testing splitting of datasets, window size and slide) as well as the hyper-parameters of detectors along with the range of values we tested per dataset using random searchFootnote 27 (RS) [52]. The employed ranges empirically ensure the existence of at least an optimal value per detector and dataset. To respect the isolation requirement [34, 39], the optimization phase should take place on an additional validation partition, with samples of the training partition that remain unseen by testing. According to Tables 1014, the optimal hyper-parameters of each detector per dataset are different in most cases to the default values originally proposed.

Table 10 Optimal hyper-parameter values of each online detector per dataset after tuning
Table 11 Optimal hyper-parameter values of online detectors HST/F, RRCF and RS-Hash per Exathlon dataset after tuning (only if the AUC ROC score of the detectors is greater than 0.5)

We are additionally interested in assessing how sensitive detectors are to the tuning of their hyper-parameters. In this respect, we are computing the average coefficient of variationFootnote 28 of MAP scores of the models tuned with different hyper-parameters across all datasets. As we can see in Fig. 17, for most online detectors (besides MCOD, LEAP and RS-HASH) the average coefficient is below or close to 0.10. Recall that the tinny performance differences between models are mostly due to the non-deterministic nature of the detectors. On the other hand, offline detectors’ average coefficient of variation lies on average around 0.2 and above, revealing that no matter their anomalousness criteria, they are a lot more sensitive to their hyper-parameters tuning. As we can see from Table 10, distance-based detectors’ hyper-parameters, differ in every dataset, which reveals the need for good tuning according to the characteristics of each dataset (value range, average distance, etc.). An interesting observation is that MCOD and CPOD succeeds to have a smaller average coefficient of variation w.r.t. offline detectors (while LEAP is also close). This is attributed to the fact that they exhibit a similar performance to the Random Classifier in most configurations where the two hyper-parameter values are not close to the optimal ones that to a lower coefficient of variation than expected. On the contrary, tuning does not significantly improve the effectiveness of tree-based and projection-based online detectors.

Regarding hyper-parameters with fixed values across all datasets, we are finally interested in investigating the impact of the window size on detectors’ effectiveness. To this end, we tested our detectors on four representative datasets w.r.t. the number of samples/features: Isolet (many samples/features), Wilt (many samples/few features), InternetAds (few samples/many features) and Diabetes (few samples/features). Figure 16 depicts the median ratio of the MAP scores with windows size (ws) = 128 as baseline that we use in our experiments. In terms of the window type, we used sliding windows on RRCF, MCOD, LEAP, CPOD, STARE and RS-HASH and tumbling windows on the remaining detectors. According to our experiments, we found that the previously mentioned detectors are performing better on sliding windows compared to tumbling. Note that the window slide indicates the number of past points to forget from the models built. HST exhibits a slightly better performance on ws = 128 especially on the datasets with few samples. L-S has a major performance boost using ws = 128 on datasets with low samples, while its performance remains stable on the rest. RRCF performs better on low sample datasets with a window size of 64, while HSTF has a performance boost on datasets with many features using ws = 256. RS-HASH is performing better using ws = 128 on all cases besides on datasets with many samples and few features. CPOD, LEAP and STARE exhibit better effectiveness using ws = 128 in all cases. X-S is stable throughout all window sizes.

Table 12 Offline detectors’ number of wins (using AP Scores) and average AP difference from the winner per dataset
Table 13 Offline detectors’ number of wins (using AUC Scores) and average AUC ROC difference from the winner per datase
Table 14 Optimal hyper-parameter values of each offline detector per dataset after tuning

To ensure a fair comparison of detectors in the remaining of our work, we avoid varying the window size per dataset. In this respect, we set the windows size to 128 as it proves to be optimal in most pairs of datasets and detectors. Moreover, we did not choose 256 (or greater) window size, because we need to ensure at least one full window for testing after the training phase; note that the dataset with the lowest number of samples in our benchmark is Ionosphere with 351 samples.

Fig. 16
figure 16

Median Ratio of MAP scores on different window sizes (64,256) with 128 as baseline using (Diabetes, Isolet, InternetAds, Wilt)

Fig. 17
figure 17

Average Coefficient of variation across all datasets of MAP scores of the models built with different hyper-parameters: Online detectors appear to be less sensitive

Appendix 3: AUC/MAP of detectors

For the shake of completeness, we are also investigating the performance ranking of online detectors. Table 12 depicts the total number of wins of offline detectors using MAP. X-B and LOF outperform the remaining detectors in 7 datasets, respectively. This places LOF the best performing proximity-based detector and X-B the best ensemble-based detector. KNN is best performing in 3 datasets along with OCRF and IF. The worst performing detector is L-B, managing to lead only in 1 dataset. It is worth mentioning that KNN has the lowest average difference from the leader along with LOF at 9.8%. There are no major changes in Table 13, which illustrates the number of wins using AUC ROC. LOF achieves to get more wins in AUC ROC, being the best detector, leading in 9 out of 24 datasets compared to X-B getting 2 less wins than before. OCRF dropped to just one win, as well as having the highest average difference from leader. It is worth mentioning that OCRF does not manage to make any split in some datasets, due to the high range of features, which leads to an infinite volume (\(> 10^{308}\)).

Table 15 (M)AP Scores of online and offline detectors per dataset
Table 16 AUC Scores of online and offline detectors per dataset

As we did on the previous section (Sect. 4.2) we use the nonparametric Friedman test [23], in order determine if there are any significant difference between the average ranks of the detectors. With a \(p value < 0.000\) we reject the null hypothesis with a confidence level of 5% that all detectors’ performances are the same. Subsequently, we use the post hoc Nemenyi test, in order to compare the detectors in pairs. There is a significant difference when the difference between the average ranks of two detectors is higher than a critical distance (CD) of 1.539 (for 6 detectors on 25 datasets at a significance level a = 0.05) Fig. 18a illustrates the ranking of detectors w.r.t. MAP scores. KNN and X-B are both ranked first (rank tie) followed closely by LOF. KNN is ranked higher than LOF, while having almost half of the wins, due to its low average difference from leader. We observe that there is a statistically significant difference between the first three detectors (LOF,KNN,X-B) and both OCRF and L-B which come last. As we can see in Fig. 18b, there are no drastic changes on ranking when using AUC ROC. LOF is first as expected from the previous results (9/24 wins) followed by KNN and X-B. There is a statistically significant difference between the two proximity-based detectors (LOF, KNN) and both L-B and OCRF, while IF and X-B have a statistical significant difference only with OCRF.

Fig. 18
figure 18

Offline detectors’ ranking

In overall, we observe that the two proximity-based detectors (LOF, KNN) ,as well as X-B, outperform all other detectors, with L-B and OCRF exhibiting the worst performance. LOF is performing better when using AUC ROC, while the rest of the detectors remain stable w.r.t both metrics.

Appendix 4: Meta-features

Table 17 Statistically significant correlations between the ratio of X-S divided by the respective online detector and meta-features. We report only pairs that had a statistically significant correlation at a significance level of 0.05

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ntroumpogiannis, A., Giannoulis, M., Myrtakis, N. et al. A meta-level analysis of online anomaly detectors. The VLDB Journal 32, 845–886 (2023). https://doi.org/10.1007/s00778-022-00773-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00773-x

Keywords

Navigation