On normalization and algorithm selection for unsupervised outlier detection

Abstract

This paper demonstrates that the performance of various outlier detection methods is sensitive to both the characteristics of the dataset, and the data normalization scheme employed. To understand these dependencies, we formally prove that normalization affects the nearest neighbor structure, and density of the dataset; hence, affecting which observations could be considered outliers. Then, we perform an instance space analysis of combinations of normalization and detection methods. Such analysis enables the visualization of the strengths and weaknesses of these combinations. Moreover, we gain insights into which method combination might obtain the best performance for a given dataset.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    Generally normalization refers to scaling each attribute to \(\left[ 0,1\right] \) while standardization refers to scaling each attribute to \({\mathcal {N}}\left( 0,1\right) \). For the sake of simplicity, and without loss of generality, we use the term normalization to refer to both re-scalings in this paper.

References

  1. Achtert E, Kriegel H-P, Zimek A (2008) Elki: a software system for evaluation of subspace clustering algorithms. In: International conference on scientific and statistical database management. Springer, pp 580–585

  2. Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: European conference on principles of data mining and knowledge discovery. Springer, pp 15–27

  3. Barnett V, Lewis T (1974) Outliers in statistical data. Wiley, Hoboken

    Google Scholar 

  4. Bates D, Mächler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1–48

    Article  Google Scholar 

  5. Beygelzimer A, Kakadet S, Langford J, Arya S, Mount D, Li S (2018) FNN: Fast nearest neighbor search algorithms and applications. R package version 1.1.2.2. https://CRAN.R-project.org/package=FNN

  6. Billor N, Hadi AS, Velleman PF (2000) Bacon: blocked adaptive computationally efficient outlier nominators. Comput Stat Data Anal 34(3):279–298

    Article  Google Scholar 

  7. Bischl B, Mersmann O, Trautmann H, Preuß M (2012) Algorithm selection based on exploratory landscape analysis and cost-sensitive learning. In: Proceedings of the 14th annual conference on genetic and evolutionary computation. ACM, pp 313–320

  8. Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2008) Metalearning: applications to data mining. Springer, Berlin

    Google Scholar 

  9. Breheny P, Burchett W (2017) Visualization of regression models using visreg. R J 9(2):56–71

    Article  Google Scholar 

  10. Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: ACM sigmod record, vol 29. ACM, pp 93–104

  11. Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927

    MathSciNet  Article  Google Scholar 

  12. Craswell N (2009) Precision at n. Springer, Boston, pp 2127–2128

    Google Scholar 

  13. Csardi G, Nepusz T (2006) The igraph software package for complex network research. InterJ Complex Syst 1695(5):1–9

    Google Scholar 

  14. Culberson JC (1998) On the futility of blind search: an algorithmic view of “no free lunch”. Evol Comput 6(2):109–127

    Article  Google Scholar 

  15. Davis J, Goadrich M (2006) The relationship between Precision–Recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 233–240

  16. Duong T (2018) ks: Kernel smoothing. R package version 1.11.3. https://CRAN.R-project.org/package=ks

  17. Emmott A, Das S, Dietterich T, Fern A, Wong W-K (2015) A meta-analysis of the anomaly detection problem. ArXiv preprint arXiv:1503.01158

  18. Emmott AF, Das S, Dietterich T, Fern A, Wong W-K (2013) Systematic construction of anomaly detection benchmarks from real data. In: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 16–21

  19. Goix N (2016) How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152

  20. Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):e0152173

    Article  Google Scholar 

  21. Hahsler M, Piekenbrock M (2018) dbscan: Density based clustering of applications with noise (DBSCAN) and related algorithms. R package version 1.1-3. https://CRAN.R-project.org/package=dbscan

  22. Hautamaki V, Karkkainen I, Franti P (2004) Outlier detection using k-nearest neighbour graph. In: Proceedings of the 17th international conference on pattern recognition, ICPR 2004, vol 3. IEEE, pp 430–433

  23. Hawkins DM (1980) Identification of outliers, vol 11. Springer, Berlin

    Google Scholar 

  24. Ho Y-C, Pepyne DL (2002) Simple explanation of the no-free-lunch theorem and its implications. J Optim Theory Appl 115(3):549–570

    MathSciNet  Article  Google Scholar 

  25. Hothorn T, Bretz F, Westfall P (2008) Simultaneous inference in general parametric models. Biom J 50(3):346–363

    MathSciNet  Article  Google Scholar 

  26. Hubert M, Van der Veeken S (2008) Outlier detection for skewed data. J Chemom 22(3–4):235–246

    Article  Google Scholar 

  27. Igel C, Toussaint M (2005) A no-free-lunch theorem for non-uniform distributions of target functions. J Math Modell Algorithms 3(4):313–322

    MathSciNet  Article  Google Scholar 

  28. Jin W, Tung AK, Han J, Wang W (2006) Ranking outliers using symmetric neighborhood relationship. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 577–593

  29. Kandanaarachchi S (2018) Outselect: algorithm selection for unsupervised outlier detection. R package version 0.0.0.9000. https://github.com/sevvandi/outselect

  30. Kandanaarachchi S, Munoz MA, Smith-Miles K (2019) Instance space analysis for unsupervised outlier detection. In: Proceedings of the 1st workshop on evaluation and experimental design in data mining and machine learning co-located with siam international conference on data mining (SDM 2019), Calgary, Alberta, Canada, May 4th, 2019, pp 32–41. http://ceur-ws.org/Vol-2436/article_4.pdf

  31. Kandanaarachchi S, Muñoz MA, Smith-Miles K, Hyndman R (2019) Datasets for outlier detection. https://monash.figshare.com/articles/Datasets_12338_zip/7705127/4

  32. Kang Y, Hyndman R, Smith-Miles K (2017) Visualising forecasting algorithm performance using time series instance spaces. Int J Forecast 33(2):345–358

    Article  Google Scholar 

  33. Komsta L, Novomestky F (2015) Moments: moments, cumulants, skewness, kurtosis and related tests. R package version 0.14. https://CRAN.R-project.org/package=moments

  34. Kourentzes N (2019) tsutils: time series exploration, modelling and forecasting. R package version 0.9.0. https://CRAN.R-project.org/package=tsutils

  35. Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) LoOP: local outlier probabilities. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1649–1652

  36. Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 444–452

  37. Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 61–75

  38. Leigh C, Alsibai O, Hyndman RJ, Kandanaarachchi S, King OC, McGree JM, Neelamraju C, Strauss J, Talagala PD, Turner RD et al (2019) A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. Sci Total Environ 664:885–898

    Article  Google Scholar 

  39. Leyton-Brown K, Nudelman E, Andrew G, McFadden J, Shoham Y (2003) A portfolio approach to algorithm selection. In: 2003 International joint conference on artificial intelligence (IJCAI), vol 3. pp 1542–1543

  40. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22

    Google Scholar 

  41. Liu FT (2009) Isolationforest: Isolation forest. R package version 0.0-26/r4. https://R-Forge.R-project.org/projects/iforest/

  42. Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining. IEEE, pp 413–422

  43. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2018) Cluster: cluster analysis basics and extensions. R package version 2.0.7-1

  44. Meschiari S (2015) latex2exp: Use LaTeX expressions in plots. R package version 0.4.0. https://CRAN.R-project.org/package=latex2exp

  45. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2018) e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.7-0. https://CRAN.R-project.org/package=e1071

  46. Meyer PE (2014) Infotheo: information-theoretic measures. R package version 1.2.0. https://CRAN.R-project.org/package=infotheo

  47. Muñoz MA (2019) Instance space analysis: a toolkit for the assessment of algorithmic power. https://github.com/andremun/InstanceSpace

  48. Muñoz MA, Villanova L, Baatar D, Smith-Miles K (2018) Instance spaces for machine learning classification. Mach Learn 107(1):109–147

    MathSciNet  Article  Google Scholar 

  49. Peng Y, Flach PA, Soares C, Brazdil P (2002) Improved dataset characterisation for meta-learning. In: International conference on discovery science. Springer, pp 141–152

  50. Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: International conference on machine learning (ICML), pp 743–750

  51. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: ACM sigmod record, vol 29. ACM, pp 427–438

  52. Rice J (1976) The algorithm selection problem. In: Advances in computers, vol 15. Elsevier, pp 65–118

  53. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform 12:77

    Article  Google Scholar 

  54. Rousseeuw PJ, Hubert M (2017) Anomaly detection by robust statistics. Wiley Interdiscip Rev Data Min Knowl Discov 8:e1236

    Article  Google Scholar 

  55. Ryan JA, Ulrich JM (2018) quantmod: Quantitative financial modelling framework. R package version 0.4-13. https://CRAN.R-project.org/package=quantmod

  56. Schubert E, Zimek A, Kriegel H-P (2014a) Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM, pp 542–550

  57. Schubert E, Zimek A, Kriegel H-P (2014b) ‘Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection’. Data Min Knowl Discov 28(1):190–237

    MathSciNet  Article  Google Scholar 

  58. Smith-Miles K (2019) MATILDA: melbourne algorithm test instance library with data analytics. https://matilda.unimelb.edu.au

  59. Smith-Miles KA (2009) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv (CSUR) 41(1):6

    Google Scholar 

  60. Smith-Miles K, Baatar D, Wreford B, Lewis R (2014) Towards objective measures of algorithm performance across instance space. Comput Oper Res 45:12–24

    MathSciNet  Article  Google Scholar 

  61. Smith-Miles K, Bowly S (2015) Generating new test instances by evolving in instance space. Comput Oper Res 63:102–113

    MathSciNet  Article  Google Scholar 

  62. Smith-Miles K, Tan TT (2012) Measuring algorithm footprints in instance space. In: 2012 IEEE congress on evolutionary computation. IEEE, pp 3446–3453

  63. Talagala PD, Hyndman RJ, Smith-Miles K, Kandanaarachchi S, Munoz MA (2019) Anomaly detection in streaming nonstationary temporal data. J Comput Graph Stat. https://doi.org/10.1080/10618600.2019.1617160

    Article  Google Scholar 

  64. Tang J, Chen Z, Fu AW-C, Cheung DW (2002) Enhancing effectiveness of outlier detections for low density patterns. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 535–548

  65. Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20

    Article  Google Scholar 

  66. Wickham H (2016) ggplot2: Elegant graphics for data analysis. Springer, New York. http://ggplot2.org

  67. Wilkinson L (2018) Visualizing big data outliers through distributed aggregation. IEEE Trans Vis Comput Graph 24(1):256–266

    Article  Google Scholar 

  68. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evolut Comput 1(1):67–82

    Article  Google Scholar 

  69. Wolpert DH, Macready WG et al (1995) No free lunch theorems for search. Technical report, SFI-TR-95-02-010, Santa Fe Institute

  70. Zhang E, Zhang Y (2009) Average precision. In: Encyclopedia of database systems. Springer, Berlin, pp 192–193

  71. Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 813–822

  72. Zimek A, Schubert E, Kriegel H-P (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min ASA Data Sci J 5(5):363–387

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

Funding was provided by the Australian Research Council through the Australian Laureate Fellowship FL140100012, and Linkage Project LP160101885. This research was supported in part by the Monash eResearch Centre and eSolutions-Research Support Services through the MonARCH HPC Cluster.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Sevvandi Kandanaarachchi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Responsible editor: Srinivasan Parthasarathy.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (R 10 KB)

Supplementary material 2 (R 3 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kandanaarachchi, S., Muñoz, M.A., Hyndman, R.J. et al. On normalization and algorithm selection for unsupervised outlier detection. Data Min Knowl Disc 34, 309–354 (2020). https://doi.org/10.1007/s10618-019-00661-z

Download citation

Keywords

  • Unsupervised outlier detection
  • Effect of normalization on outlier detection
  • Algorithm selection problem for outlier detection
  • Instance space analysis
  • Instance space analysis for outlier detection