Abstract
This paper demonstrates that the performance of various outlier detection methods is sensitive to both the characteristics of the dataset, and the data normalization scheme employed. To understand these dependencies, we formally prove that normalization affects the nearest neighbor structure, and density of the dataset; hence, affecting which observations could be considered outliers. Then, we perform an instance space analysis of combinations of normalization and detection methods. Such analysis enables the visualization of the strengths and weaknesses of these combinations. Moreover, we gain insights into which method combination might obtain the best performance for a given dataset.
Similar content being viewed by others
Notes
Generally normalization refers to scaling each attribute to \(\left[ 0,1\right] \) while standardization refers to scaling each attribute to \({\mathcal {N}}\left( 0,1\right) \). For the sake of simplicity, and without loss of generality, we use the term normalization to refer to both re-scalings in this paper.
References
Achtert E, Kriegel H-P, Zimek A (2008) Elki: a software system for evaluation of subspace clustering algorithms. In: International conference on scientific and statistical database management. Springer, pp 580–585
Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: European conference on principles of data mining and knowledge discovery. Springer, pp 15–27
Barnett V, Lewis T (1974) Outliers in statistical data. Wiley, Hoboken
Bates D, Mächler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1–48
Beygelzimer A, Kakadet S, Langford J, Arya S, Mount D, Li S (2018) FNN: Fast nearest neighbor search algorithms and applications. R package version 1.1.2.2. https://CRAN.R-project.org/package=FNN
Billor N, Hadi AS, Velleman PF (2000) Bacon: blocked adaptive computationally efficient outlier nominators. Comput Stat Data Anal 34(3):279–298
Bischl B, Mersmann O, Trautmann H, Preuß M (2012) Algorithm selection based on exploratory landscape analysis and cost-sensitive learning. In: Proceedings of the 14th annual conference on genetic and evolutionary computation. ACM, pp 313–320
Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2008) Metalearning: applications to data mining. Springer, Berlin
Breheny P, Burchett W (2017) Visualization of regression models using visreg. R J 9(2):56–71
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: ACM sigmod record, vol 29. ACM, pp 93–104
Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927
Craswell N (2009) Precision at n. Springer, Boston, pp 2127–2128
Csardi G, Nepusz T (2006) The igraph software package for complex network research. InterJ Complex Syst 1695(5):1–9
Culberson JC (1998) On the futility of blind search: an algorithmic view of “no free lunch”. Evol Comput 6(2):109–127
Davis J, Goadrich M (2006) The relationship between Precision–Recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 233–240
Duong T (2018) ks: Kernel smoothing. R package version 1.11.3. https://CRAN.R-project.org/package=ks
Emmott A, Das S, Dietterich T, Fern A, Wong W-K (2015) A meta-analysis of the anomaly detection problem. ArXiv preprint arXiv:1503.01158
Emmott AF, Das S, Dietterich T, Fern A, Wong W-K (2013) Systematic construction of anomaly detection benchmarks from real data. In: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 16–21
Goix N (2016) How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152
Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):e0152173
Hahsler M, Piekenbrock M (2018) dbscan: Density based clustering of applications with noise (DBSCAN) and related algorithms. R package version 1.1-3. https://CRAN.R-project.org/package=dbscan
Hautamaki V, Karkkainen I, Franti P (2004) Outlier detection using k-nearest neighbour graph. In: Proceedings of the 17th international conference on pattern recognition, ICPR 2004, vol 3. IEEE, pp 430–433
Hawkins DM (1980) Identification of outliers, vol 11. Springer, Berlin
Ho Y-C, Pepyne DL (2002) Simple explanation of the no-free-lunch theorem and its implications. J Optim Theory Appl 115(3):549–570
Hothorn T, Bretz F, Westfall P (2008) Simultaneous inference in general parametric models. Biom J 50(3):346–363
Hubert M, Van der Veeken S (2008) Outlier detection for skewed data. J Chemom 22(3–4):235–246
Igel C, Toussaint M (2005) A no-free-lunch theorem for non-uniform distributions of target functions. J Math Modell Algorithms 3(4):313–322
Jin W, Tung AK, Han J, Wang W (2006) Ranking outliers using symmetric neighborhood relationship. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 577–593
Kandanaarachchi S (2018) Outselect: algorithm selection for unsupervised outlier detection. R package version 0.0.0.9000. https://github.com/sevvandi/outselect
Kandanaarachchi S, Munoz MA, Smith-Miles K (2019) Instance space analysis for unsupervised outlier detection. In: Proceedings of the 1st workshop on evaluation and experimental design in data mining and machine learning co-located with siam international conference on data mining (SDM 2019), Calgary, Alberta, Canada, May 4th, 2019, pp 32–41. http://ceur-ws.org/Vol-2436/article_4.pdf
Kandanaarachchi S, Muñoz MA, Smith-Miles K, Hyndman R (2019) Datasets for outlier detection. https://monash.figshare.com/articles/Datasets_12338_zip/7705127/4
Kang Y, Hyndman R, Smith-Miles K (2017) Visualising forecasting algorithm performance using time series instance spaces. Int J Forecast 33(2):345–358
Komsta L, Novomestky F (2015) Moments: moments, cumulants, skewness, kurtosis and related tests. R package version 0.14. https://CRAN.R-project.org/package=moments
Kourentzes N (2019) tsutils: time series exploration, modelling and forecasting. R package version 0.9.0. https://CRAN.R-project.org/package=tsutils
Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) LoOP: local outlier probabilities. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1649–1652
Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 444–452
Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 61–75
Leigh C, Alsibai O, Hyndman RJ, Kandanaarachchi S, King OC, McGree JM, Neelamraju C, Strauss J, Talagala PD, Turner RD et al (2019) A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. Sci Total Environ 664:885–898
Leyton-Brown K, Nudelman E, Andrew G, McFadden J, Shoham Y (2003) A portfolio approach to algorithm selection. In: 2003 International joint conference on artificial intelligence (IJCAI), vol 3. pp 1542–1543
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22
Liu FT (2009) Isolationforest: Isolation forest. R package version 0.0-26/r4. https://R-Forge.R-project.org/projects/iforest/
Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining. IEEE, pp 413–422
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2018) Cluster: cluster analysis basics and extensions. R package version 2.0.7-1
Meschiari S (2015) latex2exp: Use LaTeX expressions in plots. R package version 0.4.0. https://CRAN.R-project.org/package=latex2exp
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2018) e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.7-0. https://CRAN.R-project.org/package=e1071
Meyer PE (2014) Infotheo: information-theoretic measures. R package version 1.2.0. https://CRAN.R-project.org/package=infotheo
Muñoz MA (2019) Instance space analysis: a toolkit for the assessment of algorithmic power. https://github.com/andremun/InstanceSpace
Muñoz MA, Villanova L, Baatar D, Smith-Miles K (2018) Instance spaces for machine learning classification. Mach Learn 107(1):109–147
Peng Y, Flach PA, Soares C, Brazdil P (2002) Improved dataset characterisation for meta-learning. In: International conference on discovery science. Springer, pp 141–152
Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: International conference on machine learning (ICML), pp 743–750
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: ACM sigmod record, vol 29. ACM, pp 427–438
Rice J (1976) The algorithm selection problem. In: Advances in computers, vol 15. Elsevier, pp 65–118
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform 12:77
Rousseeuw PJ, Hubert M (2017) Anomaly detection by robust statistics. Wiley Interdiscip Rev Data Min Knowl Discov 8:e1236
Ryan JA, Ulrich JM (2018) quantmod: Quantitative financial modelling framework. R package version 0.4-13. https://CRAN.R-project.org/package=quantmod
Schubert E, Zimek A, Kriegel H-P (2014a) Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM, pp 542–550
Schubert E, Zimek A, Kriegel H-P (2014b) ‘Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection’. Data Min Knowl Discov 28(1):190–237
Smith-Miles K (2019) MATILDA: melbourne algorithm test instance library with data analytics. https://matilda.unimelb.edu.au
Smith-Miles KA (2009) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv (CSUR) 41(1):6
Smith-Miles K, Baatar D, Wreford B, Lewis R (2014) Towards objective measures of algorithm performance across instance space. Comput Oper Res 45:12–24
Smith-Miles K, Bowly S (2015) Generating new test instances by evolving in instance space. Comput Oper Res 63:102–113
Smith-Miles K, Tan TT (2012) Measuring algorithm footprints in instance space. In: 2012 IEEE congress on evolutionary computation. IEEE, pp 3446–3453
Talagala PD, Hyndman RJ, Smith-Miles K, Kandanaarachchi S, Munoz MA (2019) Anomaly detection in streaming nonstationary temporal data. J Comput Graph Stat. https://doi.org/10.1080/10618600.2019.1617160
Tang J, Chen Z, Fu AW-C, Cheung DW (2002) Enhancing effectiveness of outlier detections for low density patterns. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 535–548
Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20
Wickham H (2016) ggplot2: Elegant graphics for data analysis. Springer, New York. http://ggplot2.org
Wilkinson L (2018) Visualizing big data outliers through distributed aggregation. IEEE Trans Vis Comput Graph 24(1):256–266
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evolut Comput 1(1):67–82
Wolpert DH, Macready WG et al (1995) No free lunch theorems for search. Technical report, SFI-TR-95-02-010, Santa Fe Institute
Zhang E, Zhang Y (2009) Average precision. In: Encyclopedia of database systems. Springer, Berlin, pp 192–193
Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 813–822
Zimek A, Schubert E, Kriegel H-P (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min ASA Data Sci J 5(5):363–387
Acknowledgements
Funding was provided by the Australian Research Council through the Australian Laureate Fellowship FL140100012, and Linkage Project LP160101885. This research was supported in part by the Monash eResearch Centre and eSolutions-Research Support Services through the MonARCH HPC Cluster.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Srinivasan Parthasarathy.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Kandanaarachchi, S., Muñoz, M.A., Hyndman, R.J. et al. On normalization and algorithm selection for unsupervised outlier detection. Data Min Knowl Disc 34, 309–354 (2020). https://doi.org/10.1007/s10618-019-00661-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-019-00661-z