Skip to main content
Log in

On normalization and algorithm selection for unsupervised outlier detection

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper demonstrates that the performance of various outlier detection methods is sensitive to both the characteristics of the dataset, and the data normalization scheme employed. To understand these dependencies, we formally prove that normalization affects the nearest neighbor structure, and density of the dataset; hence, affecting which observations could be considered outliers. Then, we perform an instance space analysis of combinations of normalization and detection methods. Such analysis enables the visualization of the strengths and weaknesses of these combinations. Moreover, we gain insights into which method combination might obtain the best performance for a given dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Generally normalization refers to scaling each attribute to \(\left[ 0,1\right] \) while standardization refers to scaling each attribute to \({\mathcal {N}}\left( 0,1\right) \). For the sake of simplicity, and without loss of generality, we use the term normalization to refer to both re-scalings in this paper.

References

  • Achtert E, Kriegel H-P, Zimek A (2008) Elki: a software system for evaluation of subspace clustering algorithms. In: International conference on scientific and statistical database management. Springer, pp 580–585

  • Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: European conference on principles of data mining and knowledge discovery. Springer, pp 15–27

  • Barnett V, Lewis T (1974) Outliers in statistical data. Wiley, Hoboken

    MATH  Google Scholar 

  • Bates D, Mächler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1–48

    Article  Google Scholar 

  • Beygelzimer A, Kakadet S, Langford J, Arya S, Mount D, Li S (2018) FNN: Fast nearest neighbor search algorithms and applications. R package version 1.1.2.2. https://CRAN.R-project.org/package=FNN

  • Billor N, Hadi AS, Velleman PF (2000) Bacon: blocked adaptive computationally efficient outlier nominators. Comput Stat Data Anal 34(3):279–298

    Article  Google Scholar 

  • Bischl B, Mersmann O, Trautmann H, Preuß M (2012) Algorithm selection based on exploratory landscape analysis and cost-sensitive learning. In: Proceedings of the 14th annual conference on genetic and evolutionary computation. ACM, pp 313–320

  • Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2008) Metalearning: applications to data mining. Springer, Berlin

    MATH  Google Scholar 

  • Breheny P, Burchett W (2017) Visualization of regression models using visreg. R J 9(2):56–71

    Article  Google Scholar 

  • Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: ACM sigmod record, vol 29. ACM, pp 93–104

  • Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927

    Article  MathSciNet  Google Scholar 

  • Craswell N (2009) Precision at n. Springer, Boston, pp 2127–2128

    Google Scholar 

  • Csardi G, Nepusz T (2006) The igraph software package for complex network research. InterJ Complex Syst 1695(5):1–9

    Google Scholar 

  • Culberson JC (1998) On the futility of blind search: an algorithmic view of “no free lunch”. Evol Comput 6(2):109–127

    Article  Google Scholar 

  • Davis J, Goadrich M (2006) The relationship between Precision–Recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 233–240

  • Duong T (2018) ks: Kernel smoothing. R package version 1.11.3. https://CRAN.R-project.org/package=ks

  • Emmott A, Das S, Dietterich T, Fern A, Wong W-K (2015) A meta-analysis of the anomaly detection problem. ArXiv preprint arXiv:1503.01158

  • Emmott AF, Das S, Dietterich T, Fern A, Wong W-K (2013) Systematic construction of anomaly detection benchmarks from real data. In: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 16–21

  • Goix N (2016) How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152

  • Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):e0152173

    Article  Google Scholar 

  • Hahsler M, Piekenbrock M (2018) dbscan: Density based clustering of applications with noise (DBSCAN) and related algorithms. R package version 1.1-3. https://CRAN.R-project.org/package=dbscan

  • Hautamaki V, Karkkainen I, Franti P (2004) Outlier detection using k-nearest neighbour graph. In: Proceedings of the 17th international conference on pattern recognition, ICPR 2004, vol 3. IEEE, pp 430–433

  • Hawkins DM (1980) Identification of outliers, vol 11. Springer, Berlin

    Book  Google Scholar 

  • Ho Y-C, Pepyne DL (2002) Simple explanation of the no-free-lunch theorem and its implications. J Optim Theory Appl 115(3):549–570

    Article  MathSciNet  Google Scholar 

  • Hothorn T, Bretz F, Westfall P (2008) Simultaneous inference in general parametric models. Biom J 50(3):346–363

    Article  MathSciNet  Google Scholar 

  • Hubert M, Van der Veeken S (2008) Outlier detection for skewed data. J Chemom 22(3–4):235–246

    Article  Google Scholar 

  • Igel C, Toussaint M (2005) A no-free-lunch theorem for non-uniform distributions of target functions. J Math Modell Algorithms 3(4):313–322

    Article  MathSciNet  Google Scholar 

  • Jin W, Tung AK, Han J, Wang W (2006) Ranking outliers using symmetric neighborhood relationship. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 577–593

  • Kandanaarachchi S (2018) Outselect: algorithm selection for unsupervised outlier detection. R package version 0.0.0.9000. https://github.com/sevvandi/outselect

  • Kandanaarachchi S, Munoz MA, Smith-Miles K (2019) Instance space analysis for unsupervised outlier detection. In: Proceedings of the 1st workshop on evaluation and experimental design in data mining and machine learning co-located with siam international conference on data mining (SDM 2019), Calgary, Alberta, Canada, May 4th, 2019, pp 32–41. http://ceur-ws.org/Vol-2436/article_4.pdf

  • Kandanaarachchi S, Muñoz MA, Smith-Miles K, Hyndman R (2019) Datasets for outlier detection. https://monash.figshare.com/articles/Datasets_12338_zip/7705127/4

  • Kang Y, Hyndman R, Smith-Miles K (2017) Visualising forecasting algorithm performance using time series instance spaces. Int J Forecast 33(2):345–358

    Article  Google Scholar 

  • Komsta L, Novomestky F (2015) Moments: moments, cumulants, skewness, kurtosis and related tests. R package version 0.14. https://CRAN.R-project.org/package=moments

  • Kourentzes N (2019) tsutils: time series exploration, modelling and forecasting. R package version 0.9.0. https://CRAN.R-project.org/package=tsutils

  • Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) LoOP: local outlier probabilities. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1649–1652

  • Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 444–452

  • Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 61–75

  • Leigh C, Alsibai O, Hyndman RJ, Kandanaarachchi S, King OC, McGree JM, Neelamraju C, Strauss J, Talagala PD, Turner RD et al (2019) A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. Sci Total Environ 664:885–898

    Article  Google Scholar 

  • Leyton-Brown K, Nudelman E, Andrew G, McFadden J, Shoham Y (2003) A portfolio approach to algorithm selection. In: 2003 International joint conference on artificial intelligence (IJCAI), vol 3. pp 1542–1543

  • Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22

    Google Scholar 

  • Liu FT (2009) Isolationforest: Isolation forest. R package version 0.0-26/r4. https://R-Forge.R-project.org/projects/iforest/

  • Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining. IEEE, pp 413–422

  • Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2018) Cluster: cluster analysis basics and extensions. R package version 2.0.7-1

  • Meschiari S (2015) latex2exp: Use LaTeX expressions in plots. R package version 0.4.0. https://CRAN.R-project.org/package=latex2exp

  • Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2018) e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.7-0. https://CRAN.R-project.org/package=e1071

  • Meyer PE (2014) Infotheo: information-theoretic measures. R package version 1.2.0. https://CRAN.R-project.org/package=infotheo

  • Muñoz MA (2019) Instance space analysis: a toolkit for the assessment of algorithmic power. https://github.com/andremun/InstanceSpace

  • Muñoz MA, Villanova L, Baatar D, Smith-Miles K (2018) Instance spaces for machine learning classification. Mach Learn 107(1):109–147

    Article  MathSciNet  Google Scholar 

  • Peng Y, Flach PA, Soares C, Brazdil P (2002) Improved dataset characterisation for meta-learning. In: International conference on discovery science. Springer, pp 141–152

  • Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: International conference on machine learning (ICML), pp 743–750

  • Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: ACM sigmod record, vol 29. ACM, pp 427–438

  • Rice J (1976) The algorithm selection problem. In: Advances in computers, vol 15. Elsevier, pp 65–118

  • Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform 12:77

    Article  Google Scholar 

  • Rousseeuw PJ, Hubert M (2017) Anomaly detection by robust statistics. Wiley Interdiscip Rev Data Min Knowl Discov 8:e1236

    Article  Google Scholar 

  • Ryan JA, Ulrich JM (2018) quantmod: Quantitative financial modelling framework. R package version 0.4-13. https://CRAN.R-project.org/package=quantmod

  • Schubert E, Zimek A, Kriegel H-P (2014a) Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM, pp 542–550

  • Schubert E, Zimek A, Kriegel H-P (2014b) ‘Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection’. Data Min Knowl Discov 28(1):190–237

    Article  MathSciNet  Google Scholar 

  • Smith-Miles K (2019) MATILDA: melbourne algorithm test instance library with data analytics. https://matilda.unimelb.edu.au

  • Smith-Miles KA (2009) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv (CSUR) 41(1):6

    Google Scholar 

  • Smith-Miles K, Baatar D, Wreford B, Lewis R (2014) Towards objective measures of algorithm performance across instance space. Comput Oper Res 45:12–24

    Article  MathSciNet  Google Scholar 

  • Smith-Miles K, Bowly S (2015) Generating new test instances by evolving in instance space. Comput Oper Res 63:102–113

    Article  MathSciNet  Google Scholar 

  • Smith-Miles K, Tan TT (2012) Measuring algorithm footprints in instance space. In: 2012 IEEE congress on evolutionary computation. IEEE, pp 3446–3453

  • Talagala PD, Hyndman RJ, Smith-Miles K, Kandanaarachchi S, Munoz MA (2019) Anomaly detection in streaming nonstationary temporal data. J Comput Graph Stat. https://doi.org/10.1080/10618600.2019.1617160

    Article  Google Scholar 

  • Tang J, Chen Z, Fu AW-C, Cheung DW (2002) Enhancing effectiveness of outlier detections for low density patterns. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 535–548

  • Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20

    Article  Google Scholar 

  • Wickham H (2016) ggplot2: Elegant graphics for data analysis. Springer, New York. http://ggplot2.org

  • Wilkinson L (2018) Visualizing big data outliers through distributed aggregation. IEEE Trans Vis Comput Graph 24(1):256–266

    Article  Google Scholar 

  • Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evolut Comput 1(1):67–82

    Article  Google Scholar 

  • Wolpert DH, Macready WG et al (1995) No free lunch theorems for search. Technical report, SFI-TR-95-02-010, Santa Fe Institute

  • Zhang E, Zhang Y (2009) Average precision. In: Encyclopedia of database systems. Springer, Berlin, pp 192–193

  • Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 813–822

  • Zimek A, Schubert E, Kriegel H-P (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min ASA Data Sci J 5(5):363–387

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Funding was provided by the Australian Research Council through the Australian Laureate Fellowship FL140100012, and Linkage Project LP160101885. This research was supported in part by the Monash eResearch Centre and eSolutions-Research Support Services through the MonARCH HPC Cluster.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sevvandi Kandanaarachchi.

Additional information

Responsible editor: Srinivasan Parthasarathy.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (R 10 KB)

Supplementary material 2 (R 3 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kandanaarachchi, S., Muñoz, M.A., Hyndman, R.J. et al. On normalization and algorithm selection for unsupervised outlier detection. Data Min Knowl Disc 34, 309–354 (2020). https://doi.org/10.1007/s10618-019-00661-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00661-z

Keywords

Navigation