Advertisement

Knowledge and Information Systems

, Volume 56, Issue 3, pp 691–715 | Cite as

Subspace histograms for outlier detection in linear time

  • Saket SatheEmail author
  • Charu C. Aggarwal
Regular Paper

Abstract

Outlier detection algorithms are often computationally intensive because of their need to score each point in the data. Even simple distance-based algorithms have quadratic complexity. High-dimensional outlier detection algorithms such as subspace methods are often even more computationally intensive because of their need to explore different subspaces of the data. In this paper, we propose an exceedingly simple subspace outlier detection algorithm, which can be implemented in a few lines of code, and whose complexity is linear in the size of the data set and the space requirement is constant. We show that this outlier detection algorithm is much faster than both conventional and high-dimensional algorithms and also provides more accurate results. The approach uses randomized hashing to score data points and has a neat subspace interpretation. We provide a visual representation of this interpretability in terms of outlier sensitivity histograms. Furthermore, the approach can be easily generalized to data streams, where it provides an efficient approach to discover outliers in real time. We present experimental results showing the effectiveness of the approach over other state-of-the-art methods.

Keywords

Subspace outlier detection High-dimensional outlier detection Outlier ensembles 

References

  1. 1.
    Aggarwal C (2017) Outlier analysis, 2nd edn. Springer, BerlinCrossRefzbMATHGoogle Scholar
  2. 2.
    Aggarwal C, Yu P (2001) Outlier detection for high-dimensional data. In: ACM SIGMOD conferenceGoogle Scholar
  3. 3.
    Aggarwal C, Zhao Y, Yu P (2011) Outlier detection in graph streams. In: ICDEGoogle Scholar
  4. 4.
    Aggarwal C, Sathe S (2015) Theoretical foundations and algorithms for outlier ensembles. In: ACM SIGKDD explorationsGoogle Scholar
  5. 5.
    Aggarwal C (2013) Outlier ensembles. Position paper. In: ACM SIGKDD explorationsGoogle Scholar
  6. 6.
    Aggarwal C, Sathe S (2017) Outlier ensembles: an introduction. Springer, BerlinCrossRefGoogle Scholar
  7. 7.
    Akoglu L, Muller E, Vreeken J (2013) ACM KDD workshop on outlier detection and descriptionGoogle Scholar
  8. 8.
    Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: ACM CIKM conferenceGoogle Scholar
  9. 9.
    Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: PKDD conferenceGoogle Scholar
  10. 10.
    Angiulli F, Fassetti F (2007) Detecting distance-based outliers in streams of data. In: ACM CIKM conferenceGoogle Scholar
  11. 11.
    Assent I, Kranen P, Beldauf C, Seidl T (2012) AnyOut: anytime outlier detection in streaming data. In: DASFAA conferenceGoogle Scholar
  12. 12.
    Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDDGoogle Scholar
  13. 13.
    Breunig M, Kriegel H-P, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: SIGMODGoogle Scholar
  14. 14.
    Chen J, Sathe S, Aggarwal C, Turaga D (2017) Outlier Detection with Autoencoder Ensembles. In: SDM conferenceGoogle Scholar
  15. 15.
    Cormode G, Muthukrishnan S (2004) An improved data stream summary: the count-min sketch and its applications. In: LATINGoogle Scholar
  16. 16.
    Dang X, Misenkova B, Assent I, and Ng R (2013) Outlier detection with space transformation and spectral analysis. In: SDM conferenceGoogle Scholar
  17. 17.
    Keller F, Muller E, Bohm K (2012) HiCS: high-contrast subspaces for density-based outlier ranking. In: IEEE ICDE conferenceGoogle Scholar
  18. 18.
    Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: VLDB conferenceGoogle Scholar
  19. 19.
    Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: KDDGoogle Scholar
  20. 20.
    Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: KDDGoogle Scholar
  21. 21.
    Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: ICDMGoogle Scholar
  22. 22.
    Muller E, Schiffer M, Seidl T (2011) Statistical selection of relevant subspace projections for outlier ranking. In: ICDE conferenceGoogle Scholar
  23. 23.
    Muller E, Assent I, Iglesias P, Mulle Y, Bohm K (2012) Outlier ranking via subspace analysis in multiple views of the data. In: ICDMGoogle Scholar
  24. 24.
    Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C (2003) LOCI: fast outlier detection using the local correlation integral. In: ICDEGoogle Scholar
  25. 25.
    Pokrajac D, Lazarevic A, Latecki L (2007) Incremental local outlier detection for data streams. In: CIDM conferenceGoogle Scholar
  26. 26.
    Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: ACM SIGMOD conferenceGoogle Scholar
  27. 27.
    Sathe S, Aggarwal C (2013) LODES: local density meets spectral outlier detection. In: SDM conferenceGoogle Scholar
  28. 28.
    Sathe S, Aggarwal C (2016) Outlier detection in linear time with randomized hashing. In: ICDM conferenceGoogle Scholar
  29. 29.
    Tan SC, Ting KM, Liu TF (2011) Fast anomaly detection for streaming data. In: IJCAI conferenceGoogle Scholar
  30. 30.
    Wu K, Zhang K, Fan W, Edwards A, Yu P (2014) RS-forest: a rapid density estimator for streaming anomaly detection. In: ICDMGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations