Skip to main content
Log in

Dimension-based subspace search for outlier detection

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Scientific data often are high dimensional. In such data, finding outliers are challenging because they often are hidden in subspaces, i.e., lower-dimensional projections of the data. With recent approaches to outlier mining, the actual detection of outliers is decoupled from the search for subspaces likely to contain outliers. However, finding such sets of subspaces that contain most or even all outliers of the given data set remains an open problem. While previous proposals use per-subspace measures such as correlation in order to quantify the quality of subspaces, we explicitly take the relationship between subspaces into account and propose a dimension-based measure of that quality. Based on it, we formalize the notion of an optimal set of subspaces and propose the Greedy Maximum Deviation heuristic to approximate this set. Experiments on comprehensive benchmark data show that our concept is more effective in determining the relevant set of subspaces than approaches which use per-subspace measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://www.ipd.kit.edu/trittenb/gmd/readme.

References

  1. Aggarwal, C., Sathe, S.: Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explor. Newsl. 17(1), 24–47 (2015)

    Article  Google Scholar 

  2. Angiulli, F., Fassetti, F., Manco, G., Palopoli, L.: Outlying property detection with numerical attributes. Data Min. Knowl. Discov. 31, 134–163 (2017)

    Article  MathSciNet  Google Scholar 

  3. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. SIGMOD 29(2), 93–104 (2000)

    Article  Google Scholar 

  4. Campos, G.O., Zimek, A., Sander, J., Campello, R.J.G.B., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30, 891–927 (2016)

    Article  MathSciNet  Google Scholar 

  5. Duan, L., Tang, G., Pei, J., Bailey, J., Campbell, A., Tang, C.: Mining outlying aspects on numeric data. Data Min. Knowl. Discov. 29(5), 1116–1151 (2015)

    Article  MathSciNet  Google Scholar 

  6. Duan, L., Tang, G., Pei, J., Bailey, J., Dong, G., Nguyen, V., Campbell, A., Tang, C.: Efficient discovery of contrast subspaces for object explanation and characterization. Knowl. Inf. Syst. 47(1), 99–129 (2015)

    Article  Google Scholar 

  7. Keller, F., Müller, E., Böhm, K.: HiCS: high contrast subspaces for density-based outlier ranking. In: ICDE, pp. 1037–1048 (2012)

  8. Keller, F., Müller, E., Wixler, A., Böhm, K.: Flexible and adaptive subspace search for outlier analysis. In: CIKM, pp. 1381–1390 (2013)

  9. Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. In: VLDB, vol. 99, pp. 211–222 (1999)

  10. Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: PAKDD (2009)

  11. Kriegel, H.P., Kroger, P., Schubert, E., Zimek, A.: Outlier detection in arbitrarily oriented subspaces. In: ICDM, pp. 379–388 (2012)

  12. Kriegel, H.P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: KDD, pp. 444–452 (2008)

  13. Lichman, M.: UCI machine learning repository. 2013 http://archive.ics.uci.edu/ml . Accessed 1 June 2017

  14. Micenková, B., Dang, X.H., Assent, I., Ng, R.T.: Explaining outliers by subspace separability. In: ICDM, pp. 518–527 (2013)

  15. Müller, E., Keller, F., Blanc, S., Böhm, K.: OutRules: a framework for outlier descriptions in multiple context spaces. In: ECML PKDD, pp. 828–832 (2012)

  16. Müller, E., Schiffer, M., Seidl, T.: Statistical selection of relevant subspace projections for outlier ranking. In: ICDE, pp. 434–445 (2011)

  17. Nguyen, H.V., Ang, H.H., Gopalkrishnan, V.: Mining outliers with ensemble of heterogeneous detectors on random subspaces. In: DASFAA, pp. 368–383 (2010)

  18. Nguyen, H.V., Müller, E., Böhm, K.: 4S: scalable subspace search scheme overcoming traditional apriori processing. In: Big Data Conference, pp. 359–367 (2013)

  19. Nguyen, H.V., Müller, E., Vreeken, J., Keller, F., Böhm, K.: CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: ICDM, pp. 198–206 (2013)

  20. Pang, G., Cao, L., Chen, L., Liu, H.: Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In: ICDM (2016)

  21. Pang, G., Cao, L., Chen, L., Liu, H.: Learning homophily couplings from non-IID data for joint feature selection and noise-resilient outlier detection. In: IJCAI (2017)

  22. Pasillas-Díaz, J.R., Ratté, S.: Bagged subspaces for unsupervised outlier detection. Comput. Intell. 33, 507–523 (2016)

    Article  MathSciNet  Google Scholar 

  23. Pestov, V.: On the geometry of similarity search: dimensionality curse and concentration of measure. Inf. Process. Lett. 73(1), 47–51 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  24. Sathe, S., Aggarwal, C.C.: Subspace outlier detection in linear time with randomized hashing. In: ICDM, pp. 459–468 (2016)

  25. Vinh, N.X., Chan, J., Romano, S., Bailey, J., Leckie, C., Ramamohanarao, K., Pei, J.: Discovering outlying aspects in large datasets. Data Min. Knowl. Discov. 30, 1520–1555 (2016)

    Article  MathSciNet  Google Scholar 

  26. Zhang, J., Gao, Q., Wang, H.: A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm. In: ICDM, pp. 731–740 (2006)

  27. Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions a position paper. SIGKDD Explor. Newsl. 15(1), 11–22 (2014)

    Article  Google Scholar 

  28. Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by the German Research Foundation (DFG) as part of the Research Training Group GRK 2153: Energy Status Data – Informatics Methods for its Collection, Analysis and Exploitation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Holger Trittenbach.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Trittenbach, H., Böhm, K. Dimension-based subspace search for outlier detection. Int J Data Sci Anal 7, 87–101 (2019). https://doi.org/10.1007/s41060-018-0137-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-018-0137-7

Keywords

Navigation