Dimension-based subspace search for outlier detection

Regular Paper
  • 5 Downloads

Abstract

Scientific data often are high dimensional. In such data, finding outliers are challenging because they often are hidden in subspaces, i.e., lower-dimensional projections of the data. With recent approaches to outlier mining, the actual detection of outliers is decoupled from the search for subspaces likely to contain outliers. However, finding such sets of subspaces that contain most or even all outliers of the given data set remains an open problem. While previous proposals use per-subspace measures such as correlation in order to quantify the quality of subspaces, we explicitly take the relationship between subspaces into account and propose a dimension-based measure of that quality. Based on it, we formalize the notion of an optimal set of subspaces and propose the Greedy Maximum Deviation heuristic to approximate this set. Experiments on comprehensive benchmark data show that our concept is more effective in determining the relevant set of subspaces than approaches which use per-subspace measures.

Keywords

Outlier mining Subspace search High-dimensional data 

Notes

Acknowledgements

This work was supported by the German Research Foundation (DFG) as part of the Research Training Group GRK 2153: Energy Status Data – Informatics Methods for its Collection, Analysis and Exploitation.

Compliance with ethical standards

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

References

  1. 1.
    Aggarwal, C., Sathe, S.: Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explor. Newsl. 17(1), 24–47 (2015)CrossRefGoogle Scholar
  2. 2.
    Angiulli, F., Fassetti, F., Manco, G., Palopoli, L.: Outlying property detection with numerical attributes. Data Min. Knowl. Discov. 31, 134–163 (2017)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. SIGMOD 29(2), 93–104 (2000)CrossRefGoogle Scholar
  4. 4.
    Campos, G.O., Zimek, A., Sander, J., Campello, R.J.G.B., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30, 891–927 (2016)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Duan, L., Tang, G., Pei, J., Bailey, J., Campbell, A., Tang, C.: Mining outlying aspects on numeric data. Data Min. Knowl. Discov. 29(5), 1116–1151 (2015)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Duan, L., Tang, G., Pei, J., Bailey, J., Dong, G., Nguyen, V., Campbell, A., Tang, C.: Efficient discovery of contrast subspaces for object explanation and characterization. Knowl. Inf. Syst. 47(1), 99–129 (2015)CrossRefGoogle Scholar
  7. 7.
    Keller, F., Müller, E., Böhm, K.: HiCS: high contrast subspaces for density-based outlier ranking. In: ICDE, pp. 1037–1048 (2012)Google Scholar
  8. 8.
    Keller, F., Müller, E., Wixler, A., Böhm, K.: Flexible and adaptive subspace search for outlier analysis. In: CIKM, pp. 1381–1390 (2013)Google Scholar
  9. 9.
    Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. In: VLDB, vol. 99, pp. 211–222 (1999)Google Scholar
  10. 10.
    Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: PAKDD (2009)Google Scholar
  11. 11.
    Kriegel, H.P., Kroger, P., Schubert, E., Zimek, A.: Outlier detection in arbitrarily oriented subspaces. In: ICDM, pp. 379–388 (2012)Google Scholar
  12. 12.
    Kriegel, H.P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: KDD, pp. 444–452 (2008)Google Scholar
  13. 13.
    Lichman, M.: UCI machine learning repository. 2013 http://archive.ics.uci.edu/ml . Accessed 1 June 2017
  14. 14.
    Micenková, B., Dang, X.H., Assent, I., Ng, R.T.: Explaining outliers by subspace separability. In: ICDM, pp. 518–527 (2013)Google Scholar
  15. 15.
    Müller, E., Keller, F., Blanc, S., Böhm, K.: OutRules: a framework for outlier descriptions in multiple context spaces. In: ECML PKDD, pp. 828–832 (2012)Google Scholar
  16. 16.
    Müller, E., Schiffer, M., Seidl, T.: Statistical selection of relevant subspace projections for outlier ranking. In: ICDE, pp. 434–445 (2011)Google Scholar
  17. 17.
    Nguyen, H.V., Ang, H.H., Gopalkrishnan, V.: Mining outliers with ensemble of heterogeneous detectors on random subspaces. In: DASFAA, pp. 368–383 (2010)Google Scholar
  18. 18.
    Nguyen, H.V., Müller, E., Böhm, K.: 4S: scalable subspace search scheme overcoming traditional apriori processing. In: Big Data Conference, pp. 359–367 (2013)Google Scholar
  19. 19.
    Nguyen, H.V., Müller, E., Vreeken, J., Keller, F., Böhm, K.: CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: ICDM, pp. 198–206 (2013)Google Scholar
  20. 20.
    Pang, G., Cao, L., Chen, L., Liu, H.: Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In: ICDM (2016)Google Scholar
  21. 21.
    Pang, G., Cao, L., Chen, L., Liu, H.: Learning homophily couplings from non-IID data for joint feature selection and noise-resilient outlier detection. In: IJCAI (2017)Google Scholar
  22. 22.
    Pasillas-Díaz, J.R., Ratté, S.: Bagged subspaces for unsupervised outlier detection. Comput. Intell. 33, 507–523 (2016)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Pestov, V.: On the geometry of similarity search: dimensionality curse and concentration of measure. Inf. Process. Lett. 73(1), 47–51 (2000)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Sathe, S., Aggarwal, C.C.: Subspace outlier detection in linear time with randomized hashing. In: ICDM, pp. 459–468 (2016)Google Scholar
  25. 25.
    Vinh, N.X., Chan, J., Romano, S., Bailey, J., Leckie, C., Ramamohanarao, K., Pei, J.: Discovering outlying aspects in large datasets. Data Min. Knowl. Discov. 30, 1520–1555 (2016)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Zhang, J., Gao, Q., Wang, H.: A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm. In: ICDM, pp. 731–740 (2006)Google Scholar
  27. 27.
    Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions a position paper. SIGKDD Explor. Newsl. 15(1), 11–22 (2014)CrossRefGoogle Scholar
  28. 28.
    Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations