Abstract
Anomaly or outlier detection is consider as one of the vital application of data mining, which deals with anomalies or outliers. Anomalies are considered as data points that are dramatically different from the rest of the data points. In this survey, we comprehensively present anomaly detection algorithms in an organized manner. We begin this survey with the definition of anomaly, then provide essential elements of anomaly detection, such as different types of anomaly, different application domains, and evaluation measures. Such anomaly detection algorithms are categorized in seven categories based on their working mechanisms, which includes total of 52 algorithms. The categories are anomaly detection algorithms based on statistics, density, distance, clustering, isolation, ensemble and subspace. For each category, we provide the time complexity of each algorithm and their general advantages and disadvantages. In the end, we compared all discussed anomaly detection algorithms in detail.
Similar content being viewed by others
Code availability
Not applicable.
Data Availability
Not applicable.
Notes
Anomaly and outlier are widely used terms. In this work, we will use both terms interchangeably.
Anomaly detection and outlier detection are widely used terms. In this paper, we used both terms interchangeably.
The time complexity of this kind of algorithms can be reduced to \(O(n\log (n))\) by using good indexing structure, but they are not feasible in high dimensional space. Thus we mention time complexities without such index throughout the paper.
Clustered anomalies are anomalies, which form cluster of few points outside of the normal cluster.
Some algorithms choose subspace based on statistical test (e.g. HiCS, CMI) and some choose randomly(e.g. Zero++).
Anomaly detection algorithms based on subspace are required to search for the subspace, which requires additional time, which depends on a search method. We only provide scoring time in a subspace.
References
Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Ann Data Sci 4(2):149–178
Ahmed M, Najmul Islam AKM (2020) Deep learning: hope or hype. Ann Data Sci 7(3):427–432
Chandola V, Banerjee A, Kumar V (2007) Outlier detection: a survey. ACM Comput Surv 14:15
Grubbs FE (1969) Procedures for detecting outlying observations in samples. Technometrics 11(1):1–21
Hawkins DM (1980) Identification of outliers, vol 11. Springer, Berlin
Barnett V, Lewis T (1984) Outliers in statistical data, 3rd edn. Wiley, New York
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, Association for Computing Machinery, New York, NY, USA, pp 93–104
Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Recogn Lett 22(6):691–700
Hu T, Sung SY (2003) Detecting pattern-based outliers. Pattern Recogn Lett 24(16):3059–3068
Aryal S, Baniya AA, Santosh KC (2019) Improved histogram-based anomaly detector with the extended principal component features. arXiv preprint arXiv: 1909.12702
Ahmed M (2018) Collective anomaly detection techniques for network traffic analysis. Ann Data Sci 5(4):497–512
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
Aggarwal CC (2017) An introduction to outlier analysis. Springer, Cham, pp 1–34
Olson DL, Shi Y, Shi Y (2007) Introduction to business data mining, vol 10. McGraw-Hill/Irwin, New York
Nick C (2009) Precision at n. Springer, Boston, pp 2127–2128
Zhang E, Zhang Y (2009) Average precision. Springer, Boston, pp 192–193
Hand DJ, Till RJ (2001) A simple generalisation of the area under the roc curve for multiple class classification problems. Mach Learn 45(2):171–186
Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):1–31, 04
Campos GO, Zimek A, Sander J, Campello RJGB, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927
Shewhart WA (1930) Economic quality control of manufactured product1. Bell Syst Tech J 9(2):364–389
Rosner B (1983) Percentage points for a generalized ESD many-outlier procedure. Technometrics 25(2):165–172
Liu J-P, Weng C-S (1991) Detection of outlying data in bioavailability/bioequivalence studies. Stat Med 10(9):1375–1389
Surace C, Worden K, Tomlinson G (1997) A novelty detection approach to diagnose damage in a cracked beam. In: Proceedings-SPIE the international society for optical engineering, Citeseer, pp 947–953
Surace C, Orden K et al (1998) A novelty detection method to diagnose damage in structures: an application to an offshore platform. In: The eighth international offshore and polar engineering conference, International Society of Offshore and Polar Engineers
Laurikkala J, Juhola M, Kentala E (2000) Informal identification of outliers in medical data. In: Fifth international workshop on intelligent data analysis in medicine and pharmacology, vol 1, pp 20–24
Ye N, Chen Q (2001) An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Qual Reliab Eng Int 17(2):105–112
Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection, vol 589. Wiley, New York
Horn PS, Feng L, Li Y, Pesce AJ (2001) Effect of outliers and nonhealthy individuals on reference interval estimation. Clin Chem 47(12):2137–2142
Solberg HE, Lahti A (2005) Detection of outliers in reference distributions: performance of Horn’s algorithm. Clin Chem 51(2):2326–2332, 12
Dovoedo YH, Chakraborti S (2015) Boxplot-based outlier detection for the location-scale family. Commun Stat Simul Comput 44(6):1492–1513
Gibbons RD (1994) Statistical methods for groundwater monitoring. Wiley, New York
Javitz HS, Valdes A (1991) The SRI ides statistical anomaly detector. In: Proceedings of 1991 IEEE computer society symposium on research in security and privacy, pp 316–326
Gebski M, Wong RK (2007) An efficient histogram method for outlier detection. In: Ramamohanarao KP, Krishna R, Mohania M, Nantajeewarawat E (eds) Advances in databases: concepts, systems and applications. Springer, Berlin, pp 176–187
Jiang X-B, Li G-Y, Lian S (2011) Outlier detection algorithm based on variable-width histogram for wireless sensor network. J Comput Appl 31(3):694–697
Goldstein M, Dengel A (2012) Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm. In: KI-2012: poster and demo track, pp 59–63
Xie M, Hu J, Tian B (2012) Histogram-based online anomaly detection in hierarchical wireless sensor networks. In: 2012 IEEE 11th international conference on trust, security and privacy in computing and communications, pp 751–759
Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer, Berlin, pp 61–75
Oh JH, Gao J (2009) A kernel-based approach for detecting outliers of high-dimensional biological data. In: BMC bioinformatics, vol 10, Springer, p S7
Gao J, Hu W, Zhang Z, Zhang X, Wu O (2011) Rkof: robust kernel-based local outlier detection. In: Huang JZ, Cao L, Srivastava J (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 270–283
Askari A, Yang F, Ghaoui LE (2018) Kernel-based outlier detection using the inverse christoffel function
Liu F, Yanwei Yu, Song P, Fan Y, Tong X (2020) Scalable KDE-based top-n local outlier detection over large-scale data streams. Knowl Based Syst 204:106186
Siegel AF, Morgan CJ (1988) Statistics and data analysis: an introduction, 2nd edn. Wiley, New York
Zhang Y, Hamm NAS, Meratnia N, Stein A, van de Voort M, Havinga PJM (2012) Statistics-based outlier detection for wireless sensor networks. Int J Geogr Inf Sci 26(8):1373–1392
Zimek A, Filzmoser P (2018) There and back again: outlier detection between statistical reasoning and data mining algorithms. WIREs Data Min Knowl Discov 8(6):e1280
Tang J, Chen Z, Fu AWC, Cheung DW (2002) Enhancing effectiveness of outlier detections for low density patterns. In: Chen MS, Yu PS, Liu B (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 535–548
Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Loop: local outlier probabilities. In: Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09, Association for Computing Machinery, New York, NY, USA, pp 1649–1652
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings 19th international conference on data engineering (Cat. No. 03CH37405), pp 315–326
Ren D, Wang B, Perrizo W (2004) Rdf: a density-based outlier detection method using vertical data representation. In: extitFourth IEEE international conference on data mining (ICDM’04), pp 503–506
Jin W, Tung Anthony KH, Han J, Wang W (2006) Ranking outliers using symmetric neighborhood relationship. In: Proceedings of the 10th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’06, Springer, Berlin, pp 577–593
Fan H, Zaïane OR, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1):31–51
Goldstein M (2012) Fastlof: an expectation-maximization based local outlier detection algorithm. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012), pp 2282–2285
Momtaz R, Nesma M, Gowayyed MA (2013) Dwof: a robust density-based outlier detection approach. In: Sanches JM, Micó L, Cardoso JS (eds) Pattern recognition and image analysis. Springer, Berlin, pp 517–525
Schubert E, Zimek A, Kriegel H-P (2014) Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min Knowl Disc 28(1):190–237
Wells JR, Ting KM, Washio T (2014) Linearn: a new approach to nearest neighbour density estimator. Pattern Recogn 47(8):2702–2720
Campello Ricardo JGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans Knowl Discov Data 10(1):1–51
Aryal S, Ting KM, Haffari G (2016) Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection. In: Michael C, Alan Wang G, Hsinchun C (eds) Intelligence and security informatics. Springer, Cham, pp 73–86
Abdi H, Williams LJ (2010) Principal component analysis. WIREs Comput Stat 2(4):433–459
Aggarwal CC (2017) Proximity-based outlier detection. Springer, Cham, pp 111–147
Domingues R, Filippone M, Michiardi P, Zouaoui J (2018) A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recogn 74:406–421
Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd international conference on very large data bases, VLDB ’98, Kaufmann Publishers Inc, San Francisco, CA, USA, Morgan, pp 392–403
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3):237–253
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. SIGMOD Rec 29(2):427–438
Ghoting A, Parthasarathy S, Otey ME (2008) Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Disc 16(3):349–364
Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, Association for Computing Machinery, New York, pp 444–452
Wang B, Xiao G, Yu H, Yang X (2009) Distance-based outlier detection on uncertain data. In: 2009 Ninth IEEE international conference on computer and information technology, vol 1, pp 293–298
Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 813–822
Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26, Curran Associates Inc, pp 467–475
Radovanović M, Nanopoulos A, Ivanović M (2015) Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans Knowl Data Eng 27(5):1369–1382
Wang H, Bah MJ, Hammad M (2019) Progress in outlier detection techniques: a survey. IEEE Access 7:107964–108000
Berchtold S, Keim DA, Kriegel H-P (1996) The x-tree: an index structure for high-dimensional data. In: Proceedings of the 22th international conference on very large data bases, VLDB ’96, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp 28–39
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. SIGMOD Rec 14(2):47–57
Sellis TK, Roussopoulos N, Faloutsos C (1987) The r+-tree: a dynamic index for multi-dimensional objects. In: Proceedings of the 13th international conference on very large data bases, VLDB ’87, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp 507–518
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517
Dantong Yu, Sheikholeslami G, Zhang A (2002) Findout: finding outliers in very large datasets. Knowl Inf Syst 4(4):387–412
He Z, Xiaofei X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10):1641–1650
Jiang S, An Q (2008) Clustering-based outlier detection method. In: 2008 Fifth international conference on fuzzy systems and knowledge discovery, vol 2, pp 429–433
Liu FT, Ting KM, Zhou Z (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining, pp 413–422
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):1–39
Liu FT, Ting KM, Zhou ZH (2010) On detecting clustered anomalies using sciforest. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 274–290
Tan SC, Ting KM, Liu TF (2011) Fast anomaly detection for streaming data. In: Proceedings of the twenty-second international joint conference on artificial intelligence, vol 2, IJCAI’11, AAAI Press, pp 1511–1516
Aryal S, Ting KM, Wells JR, Washio T (2014) Improving iforest with relative mass. In: Tseng VS, Ho TB, Zhou ZH, Chen ALP, Kao HY (eds) Advances in knowledge discovery and data mining. Springer, Cham, pp 510–521
Bandaragoda TR, Ting KM, Albrecht D, Liu FT, Wells JR (2014) Efficient anomaly detection by isolation using nearest neighbour ensemble. In: 2014 IEEE International conference on data mining workshop, pp 698–705
Bandaragoda TR, Ting KM, Albrecht D, Liu FT, Zhu Y, Wells JR (2018) Isolation-based anomaly detection using nearest-neighbor ensembles. Comput Intell 34(4):968–998
Pang G, Ting KM, Albrecht D (2015) Lesinn: detecting anomalies by identifying least similar nearest neighbours. In: 2015 IEEE international conference on data mining workshop (ICDMW), pp 623–630
Zhang X, Dou W, He Q, Zhou R, Leckie C, Kotagiri R, Salcic Z (2017) Lshiforest: a generic framework for fast tree isolation based ensemble anomaly analysis. In: 2017 IEEE 33rd international conference on data engineering (ICDE), pp 983–994
Aryal S (2018) Anomaly detection technique robust to units and scales of measurement. In: Phung D, Tseng VS, Webb GI, Ho B, Ganji M, Rashidi L (eds) Advances in knowledge discovery and data mining. Springer, Cham, pp 589–601
Aryal S, Santosh KC, Dazeley R (2020) usfad: a robust anomaly detector based on unsupervised stochastic forest. Int J Mach Learn Cybern 12:1–14
Ting KM, Zhou G-T, Liu FT, Tan JSC (2010) Mass estimation and its applications. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10, Association for Computing Machinery, New York, NY, USA, pp 989–998
Fernando TL, Webb GI (2017) Simusf: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31(1):264–286
Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91
Bandaragoda TR (2015) Isolation based anomaly detection: a re-examination. PhD thesis, Monash University
Pevnỳ T (2016) Loda: lightweight on-line detector of anomalies. Mach Learn 102(2):275–304
Zhao Y, Hryniewicki MK (2018) DCSO: dynamic combination of detector scores for outlier ensembles. In: ACM SIGKDD ODD workshop, London, UK
Zhao Y, Nasrullah Z, Hryniewicki MK, Li Z (2019) LSCP: locally selective combination in parallel outlier ensembles. In: Proceedings of the 2019 SIAM international conference on data mining, SDM 2019, Calgary, Canada, pp 585–593
Aggarwal CC (2013) Outlier ensembles: position paper. SIGKDD Explor Newsl 14(2):49–58
Aggarwal CC (2017) Outlier ensembles. Springer, Cham, pp 185–218
Zimek A, Campello RJGB, Sander J (2014) Ensembles for unsupervised outlier detection: challenges and research questions a position paper. SIGKDD Explor Newsl 15(2):11–22
Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 831–838
Agrawal A (2009) Local subspace based outlier detection. In: Ranka S, Aluru S, Buyya R, Chung Y-C, Dua S, Grama A, Gupta SKS, Kumar R, Phoha VV (eds) Contemporary computing. Springer, Heidelberg, pp 149–157
Nguyen HV, Gopalkrishnan V, Assent I (2011) An unbiased distance-based outlier detection approach for high-dimensional data. In: Jeffrey XY, Myoung HK, Rainer U (eds) Database systems for advanced applications. Springer, Berlin, pp 138–152
Kriegel H, Kröger P, Schubert E, Zimek A (2012) Outlier detection in arbitrarily oriented subspaces. In: 2012 IEEE 12th international conference on data mining, pp 379–388
Keller F, Muller E, Bohm K (2012) Hics: high contrast subspaces for density-based outlier ranking. In: 2012 IEEE 28th international conference on data engineering, pp 1037–1048
Nguyen HV, Müller E, Vreeken J, Keller F, Böhm, K (2013) Cmi: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proceedings of the 2013 SIAM international conference on data mining, SIAM, pp 198–206
Pang G, Ting KM, Albrecht D, Jin H (2016) Zero++: harnessing the power of zero appearances to detect anomalies in large-scale data sets. J Artif Intell Res 57:593–620
Aggarwal CC (2017) High-dimensional outlier detection: the subspace method, Springer International Publishing, Cham, pp 149–184
Zimek A, Schubert E, Kriegel H-P (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min ASA Data Sci J 5(5):363–387
Acknowledgements
The authors would also like to thank the anonymous reviewers for their valuable comments and suggestions to improve the manuscript.
Funding
No funding recieved.
Author information
Authors and Affiliations
Contributions
DS conducted the systematic literature review and examined various outlier detection techniques. DS wrote the first draft of the manuscript. DS made significant contributions to design and structure of review. AT review the work and edit the manuscript. All authors read and approved the final manuscript.
Ethics declarations
Conflict of interest
Not applicable.
Ethical approval
This article does not contain any studies with human participants by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Samariya, D., Thakkar, A. A Comprehensive Survey of Anomaly Detection Algorithms. Ann. Data. Sci. 10, 829–850 (2023). https://doi.org/10.1007/s40745-021-00362-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40745-021-00362-9