Advertisement

L1-Depth Revisited: A Robust Angle-Based Outlier Factor in High-Dimensional Space

  • Ninh PhamEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11051)

Abstract

Angle-based outlier detection (ABOD) has been recently emerged as an effective method to detect outliers in high dimensions. Instead of examining neighborhoods as proximity-based concepts, ABOD assesses the broadness of angle spectrum of a point as an outlier factor. Despite being a parameter-free and robust measure in high-dimensional space, the exact solution of ABOD suffers from the cubic cost \(O(n^3)\) regarding the data size n, hence cannot be used on large-scale data sets.

In this work we present a conceptual relationship between the ABOD intuition and the L1-depth concept in statistics, one of the earliest methods used for detecting outliers. Deriving from this relationship, we propose to use L1-depth as a variant of angle-based outlier factors, since it only requires a quadratic computational time as proximity-based outlier factors. Empirically, L1-depth is competitive (often superior) to proximity-based and other proposed angle-based outlier factors on detecting high-dimensional outliers regarding both efficiency and accuracy.

In order to avoid the quadratic computational time, we introduce a simple but efficient sampling method named SamDepth for estimating L1-depth measure. We also present theoretical analysis to guarantee the reliability of SamDepth. The empirical experiments on many real-world high-dimensional data sets demonstrate that SamDepth with \(\sqrt{n}\) samples often achieves very competitive accuracy and runs several orders of magnitude faster than other proximity-based and ABOD competitors. Data related to this paper are available at: https://www.dropbox.com/s/nk7nqmwmdsatizs/Datasets.zip. Code related to this paper is available at: https://github.com/NinhPham/Outlier.

Notes

Acknowledgments

We would like to thank Rasmus Pagh for useful discussion and comments in the early stage of this work. We thank members of the DABAI project and anonymous reviewers for their constructive comments and suggestions.

Supplementary material

478880_1_En_7_MOESM1_ESM.pdf (160 kb)
Supplementary material 1 (pdf 159 KB)

References

  1. 1.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2001).  https://doi.org/10.1007/3-540-44503-X_27CrossRefGoogle Scholar
  2. 2.
    Aggarwal, C.C., Sathe, S.: Outlier Ensembles: An Introduction. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54765-7CrossRefGoogle Scholar
  3. 3.
    Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: Proceedings of SIGMOD 2001, pp. 37–46 (2001)CrossRefGoogle Scholar
  4. 4.
    Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 17(2), 203–215 (2005)CrossRefGoogle Scholar
  5. 5.
    Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of SIGMOD 2000, pp. 93–104 (2000)CrossRefGoogle Scholar
  6. 6.
    Campos, G.O., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Chen, Y., Bart Jr., H.L., Dang, X., Peng, H.: Depth-based novelty detection and its application to taxonomic research. In: Proceedings of ICDM 2007, pp. 113–122 (2007)Google Scholar
  8. 8.
    Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)CrossRefGoogle Scholar
  9. 9.
    Hugg, J., Rafalin, E., Seyboth, K., Souvaine, K.: An experimental study of old and new depth measures. In: Proceedings of ALENEX 2006, pp. 51–64 (2006)Google Scholar
  10. 10.
    Jeong, M., Cai, Y., Sullivan, C.J., Wang, S.: Data depth based clustering analysis. In: Proceedings of SIGSPATIAL 2016, pp. 29:1–29:10 (2016)Google Scholar
  11. 11.
    Johnson, T., Kwok, I., Ng, R.T.: Fast computation of 2-dimensional depth contours. In: Proceedings of KDD 1998, pp. 224–228 (1998)Google Scholar
  12. 12.
    Jörnsten, R.: Clustering and classification based on the L1 data depth. J. Multivar. Anal. 90(1), 67–89 (2004)CrossRefGoogle Scholar
  13. 13.
    Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of VLDB 1998, pp. 392–403 (1998)Google Scholar
  14. 14.
    Kriegel, H.-P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceedings of KDD 2008, pp. 444–452 (2008)Google Scholar
  15. 15.
    Lichman, M.: UCI machine learning repository (2013)Google Scholar
  16. 16.
    Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: fast outlier detection using the local correlation integral. In: Proceedings of ICDE 2003, pp. 315–326 (2003)Google Scholar
  17. 17.
    Pham, N., Pagh, R.: A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In: Proceedings of KDD 2012, pp. 877–885 (2012)Google Scholar
  18. 18.
    Preparata, F.P., Shamos, M.: Computational Geometry: An Introduction. Springer, New York (1985).  https://doi.org/10.1007/978-1-4612-1098-6CrossRefzbMATHGoogle Scholar
  19. 19.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of SIGMOD 2000, pp. 427–438 (2000)CrossRefGoogle Scholar
  20. 20.
    Rayana, S.: ODDS library (2016)Google Scholar
  21. 21.
    Serfling, R.: A depth function and a scale curve based on spatial quantiles. In: Dodge, Y. (ed.) Statistical Data Analysis Based on the L\(_1\)-Norm and Related Methods. SIT, pp. 25–38. Birkhäuser Basel, Basel (2002).  https://doi.org/10.1007/978-3-0348-8201-9_3CrossRefGoogle Scholar
  22. 22.
    Tukey, J.W.: Mathematics and picturing data. In: Proceedings of the International Congress of Mathematicians Vancouver, pp. 523–531 (1974)Google Scholar
  23. 23.
    Vardi, Y., Zhang, C.-H.: The multivariate L1-median and associated data depth. Proc. Natl. Acad. Sci. U. S. A. 97(4), 1423–1426 (2000)CrossRefGoogle Scholar
  24. 24.
    Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions a position paper. SIGKDD Explor. 15(1), 11–22 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of CopenhagenCopenhagenDenmark

Personalised recommendations