Advertisement

Fast and Scalable Outlier Detection with Metric Access Methods

  • Altamir Gomes Bispo Junior
  • Robson Leonardo Ferreira CordeiroEmail author
Conference paper
  • 673 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11537)

Abstract

It is well-known that the existing theoretical models for outlier detection make assumptions that may not reflect the true nature of outliers in every real application. With that in mind, this paper describes an empirical study performed on unsupervised outlier detection using 8 algorithms from the state-of-the-art and 8 datasets that refer to a variety of real-world tasks of high impact, like spotting cyberattacks, clinical pathologies and abnormalities in nature. We present the lowdown on the results obtained, pointing out to the strengths and weaknesses of each technique from the application specialist’s point of view, which is a shift from the designer-based point of view that is commonly considered. Interestingly, many of the techniques had unfeasibly high runtime requirements or failed to spot what the specialists consider as outliers in their own data. To tackle this issue, we propose MetricABOD: a novel angle-based outlier detection algorithm that makes the analysis up to thousands of times faster, still being in average 26% more accurate than the most accurate related work. This improvement is essential to enable outlier detection in many real-world applications for which the existing methods lead to unexpected results or unfeasible runtime requirements. Finally, we studied two real collections of text data to show that our MetricABOD works also for adimensional, purely metric data.

Keywords

Applied computational sciences Complex data Data Mining Unsupervised outlier detection Metric Access Methods 

References

  1. 1.
    Aggarwal, C.C.: Outlier Analysis. Springer, New York (2013).  https://doi.org/10.1007/978-1-4614-6396-2CrossRefzbMATHGoogle Scholar
  2. 2.
    Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. SIGMOD Rec. 30(2), 37–46 (2001)CrossRefGoogle Scholar
  3. 3.
    Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: ACM SIGKDD, pp. 29–38 (2003)Google Scholar
  4. 4.
    Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. SIGMOD Rec. 29(2), 93–104 (2000)CrossRefGoogle Scholar
  5. 5.
    Campos, G.O., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. DMKD 30(4), 891–927 (2016)MathSciNetGoogle Scholar
  6. 6.
    Fan, J., Li, R.: Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Congress of Mathematicians, pp. 595–622 (2006)Google Scholar
  7. 7.
    Ghoting, A., Parthasarathy, S., Otey, M.E.: Fast mining of distance-based outliers in high-dimensional datasets. DMKD 16(3), 349–364 (2008)MathSciNetGoogle Scholar
  8. 8.
    Johnstone, I.M., Titterington, D.M.: Statistical challenges of high-dimensional data. Philos. Trans. A 367(1906), 4237–4253 (2009)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Keller, F., Muller, E., Bohm, K.: HiCS: high contrast subspaces for density-based outlier ranking. In: IEEE ICDE, pp. 1037–1048 (2012)Google Scholar
  10. 10.
    Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: VLDB, pp. 392–403 (1998)Google Scholar
  11. 11.
    Kriegel, H.P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: ACM SIGKDD, pp. 444–452 (2008)Google Scholar
  12. 12.
    Marques, H.O., Campello, R.J.G.B., Zimek, A., Sander, J.: On the internal evaluation of unsupervised outlier detection. In: SSDBM, pp. 7:1–7:12 (2015)Google Scholar
  13. 13.
    Muller, E., Assent, I., Steinhausen, U., Seidl, T.: OutRank: ranking outliers in high dimensional data. In: IEEE ICDE Workshop, pp. 600–603 (2008)Google Scholar
  14. 14.
    Muller, E., Schiffer, M., Seidl, T.: Statistical selection of relevant subspace projections for outlier ranking. In: IEEE ICDE, pp. 434–445 (2011)Google Scholar
  15. 15.
    Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: fast outlier detection using the local correlation integral. In: IEEE ICDE, pp. 315–326 (2003)Google Scholar
  16. 16.
    Pham, N., Pagh, R.: A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In: ACM SIGKDD, pp. 877–885 (2012)Google Scholar
  17. 17.
    Traina Jr., C., Traina, A., Faloutsos, C., Seeger, B.: Fast indexing and visualization of metric data sets using slim-trees. IEEE TKDE 14(2), 244–260 (2002)Google Scholar
  18. 18.
    Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. ASA Data Sci. 5(5), 363–387 (2012)MathSciNetGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Altamir Gomes Bispo Junior
    • 1
  • Robson Leonardo Ferreira Cordeiro
    • 1
    Email author
  1. 1.University of São PauloSão Carlos, São PauloBrazil

Personalised recommendations