Advertisement

Good and Bad Neighborhood Approximations for Outlier Detection Ensembles

  • Evelyn Kirner
  • Erich Schubert
  • Arthur Zimek
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10609)

Abstract

Outlier detection methods have used approximate neighborhoods in filter-refinement approaches. Outlier detection ensembles have used artificially obfuscated neighborhoods to achieve diverse ensemble members. Here we argue that outlier detection models could be based on approximate neighborhoods in the first place, thus gaining in both efficiency and effectiveness. It depends, however, on the type of approximation, as only some seem beneficial for the task of outlier detection, while no (large) benefit can be seen for others. In particular, we argue that space-filling curves are beneficial approximations, as they have a stronger tendency to underestimate the density in sparse regions than in dense regions. In comparison, LSH and NN-Descent do not have such a tendency and do not seem to be beneficial for the construction of outlier detection ensembles.

References

  1. 1.
    Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. JCSS 66, 671–687 (2003)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Achtert, E., Kriegel, H.P., Schubert, E., Zimek, A.: Interactive data mining with 3D-parallel-coordinate-trees. In: Proceedings SIGMOD, pp. 1009–1012 (2013)Google Scholar
  3. 3.
    Angiulli, F., Fassetti, F.: DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD 3(1), 4:1–57 (2009)Google Scholar
  4. 4.
    Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE TKDE 17(2), 203–215 (2005)zbMATHGoogle Scholar
  5. 5.
    Arya, S., Mount, D.M.: Approximate nearest neighbor queries in fixed dimensions. In: Proceedings SODA, pp. 271–280 (1993)Google Scholar
  6. 6.
    Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, New York (1994)zbMATHGoogle Scholar
  7. 7.
    Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings KDD, pp. 29–38 (2003)Google Scholar
  8. 8.
    Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings SIGMOD, pp. 322–331 (1990)Google Scholar
  9. 9.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)CrossRefzbMATHGoogle Scholar
  10. 10.
    Breunig, M.M., Kriegel, H.P., Ng, R., Sander, J.: LOF: Identifying density-based local outliers. In: Proceedings SIGMOD. pp. 93–104 (2000)Google Scholar
  11. 11.
    Campos, G.O., Zimek, A., Sander, J., Campello, R.J.G.B., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30, 891–927 (2016)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM CSUR 41(3), 1–58 (2009). Article 15CrossRefGoogle Scholar
  13. 13.
    Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings ACM SoCG, pp. 253–262 (2004)Google Scholar
  14. 14.
    de Vries, T., Chawla, S., Houle, M.E.: Density-preserving projections for large-scale local anomaly detection. KAIS 32(1), 25–52 (2012)Google Scholar
  15. 15.
    Dong, W., Charikar, M., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings WWW, pp. 577–586 (2011)Google Scholar
  16. 16.
    Erickson, J.: On the relative complexities of some geometric problems. In: Proceedings of the 7th Canadian Conference on Computational Geometry, Quebec City, Quebec, Canada, August 1995, pp. 85–90 (1995)Google Scholar
  17. 17.
    Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The amsterdam library of object images. Int. J. Comput. Vis. 61(1), 103–112 (2005)CrossRefGoogle Scholar
  18. 18.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings VLDB, pp. 518–529 (1999)Google Scholar
  19. 19.
    Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: Proceedings SIGMOD, pp. 47–57 (1984)Google Scholar
  20. 20.
    Hilbert, D.: Ueber die stetige Abbildung einer Linie auf ein Flächenstück. Math. Ann. 38(3), 459–460 (1891)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Imamura, Y., Shinohara, T., Hirata, K., Kuboyama, T.: Fast Hilbert Sort Algorithm Without Using Hilbert Indices. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 259–267. Springer, Cham (2016). doi: 10.1007/978-3-319-46759-7_20 Google Scholar
  22. 22.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings STOC, pp. 604–613 (1998)Google Scholar
  23. 23.
    Jin, W., Tung, A.K., Han, J.: Mining top-n local outliers in large databases. In: Proceedings KDD, pp. 293–298 (2001)Google Scholar
  24. 24.
    Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. In: Conference in Modern Analysis and Probability, Contemporary Mathematics, vol. 26, pp. 189–206. American Mathematical Society (1984)Google Scholar
  25. 25.
    Kabán, A.: On the distance concentration awareness of certain data reduction techniques. Pattern Recogn. 44(2), 265–277 (2011)CrossRefzbMATHGoogle Scholar
  26. 26.
    Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings VLDB, pp. 392–403 (1998)Google Scholar
  27. 27.
    Kollios, G., Gunopulos, D., Koudas, N., Berchthold, S.: Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE 15(5), 1170–1187 (2003)Google Scholar
  28. 28.
    Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings SDM, pp. 13–24 (2011)Google Scholar
  29. 29.
    Kriegel, H.P., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: are we comparing algorithms or implementations? KAIS 52(2), 341–378 (2017). doi: 10.1007/s10115-016-1004-2 Google Scholar
  30. 30.
    Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: Proceedings KDD, pp. 157–166 (2005)Google Scholar
  31. 31.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web J. 6(2), 167–195 (2015)Google Scholar
  32. 32.
    Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM TKDD 6(1), 3:1–39 (2012)Google Scholar
  33. 33.
    Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2), 142–156 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. Technical report, International Business Machines Co (1966)Google Scholar
  35. 35.
    Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE TPAMI 36(11), 2227–2240 (2014)CrossRefGoogle Scholar
  36. 36.
    Nguyen, H.V., Gopalkrishnan, V.: Efficient pruning schemes for distance-based outlier detection. In: Proceedings ECML PKDD, pp. 160–175 (2009)Google Scholar
  37. 37.
    Orair, G.H., Teixeira, C., Wang, Y., Meira, W., Parthasarathy, S.: Distance-based outlier detection: consolidation and renewed bearing. PVLDB 3(2), 1469–1480 (2010)Google Scholar
  38. 38.
    Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: Fast outlier detection using the local correlation integral. In: Proceedings ICDE, pp. 315–326 (2003)Google Scholar
  39. 39.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings SIGMOD, pp. 427–438 (2000)Google Scholar
  40. 40.
    Rousseeuw, P.J., Hubert, M.: Robust statistics for outlier detection. WIREs DMKD 1(1), 73–79 (2011)Google Scholar
  41. 41.
    Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. PVLDB 8(12), 1976–1979 (2015)Google Scholar
  42. 42.
    Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc. 28(1), 190–237 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Schubert, E., Zimek, A., Kriegel, H.-P.: Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA 2015. LNCS, vol. 9050, pp. 19–36. Springer, Cham (2015). doi: 10.1007/978-3-319-18123-3_2 Google Scholar
  44. 44.
    Silpa-Anan, C., Hartley, R.I.: Optimised kd-trees for fast image descriptor matching. In: Proceedings CVPR (2008)Google Scholar
  45. 45.
    Venkatasubramanian, S., Wang, Q.: The Johnson-Lindenstrauss transform: an empirical study. In: Proceedings ALENEX Workshop (SIAM), pp. 164–173 (2011)Google Scholar
  46. 46.
    Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: Proceedings ICDE, pp. 410–421 (2011)Google Scholar
  47. 47.
    Zhang, X., Dou, W., He, Q., Zhou, R., Leckie, C., Kotagiri, R., Salcic, Z.: LSHiForest: A generic framework for fast tree isolation based ensemble anomaly analysis. In: Proceedings ICDE (2017)Google Scholar
  48. 48.
    Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions. SIGKDD Explor. 15(1), 11–22 (2013)CrossRefGoogle Scholar
  49. 49.
    Zimek, A., Campello, R., Sander, J.: Data perturbation for outlier detection ensembles. In: Proceedings SSDBM, pp. 13:1–12 (2014)Google Scholar
  50. 50.
    Zimek, A., Gaudet, M., Campello, R., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proceedings KDD, pp. 428–436 (2013)Google Scholar
  51. 51.
    Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Ludwig-Maximilians-Universität MünchenMünchenGermany
  2. 2.Heidelberg UniversityHeidelbergGermany
  3. 3.University of Southern DenmarkOdense MDenmark

Personalised recommendations