Advertisement

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection

A Remedy Against the Curse of Dimensionality?
  • Erich Schubert
  • Michael Gertz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10609)

Abstract

Analyzing high-dimensional data poses many challenges due to the “curse of dimensionality”. Not all high-dimensional data exhibit these characteristics because many data sets have correlations, which led to the notion of intrinsic dimensionality. Intrinsic dimensionality describes the local behavior of data on a low-dimensional manifold within the higher dimensional space.

We discuss this effect, and describe a surprisingly simple approach modification that allows us to reduce local intrinsic dimensionality of individual points. While this unlikely will be able to “cure” all problems associated with high dimensionality, we show the theoretical impact on idealized distributions and how to practically incorporate it into new, more robust, algorithms. To demonstrate the effect of this adjustment, we introduce the novel Intrinsic Stochastic Outlier Score (ISOS), and we propose modifications of the popular t-Stochastic Neighbor Embedding (t-SNE) visualization technique for intrinsic dimensionality, intrinsic t-Stochastic Neighbor Embedding (it-SNE).

References

  1. 1.
    Achtert, E., Kriegel, H., Schubert, E., Zimek, A.: Interactive data mining with 3D-parallel-coordinate-trees. In: ACM SIGMOD (2013)Google Scholar
  2. 2.
    Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M.E., Kawarabayashi, K., Nett, M.: Estimating local intrinsic dimensionality. In: ACM SIGKDD (2015)Google Scholar
  3. 3.
    Angiulli, F., Pizzuti, C.: Fast outlier detection in high dimensional spaces. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS, vol. 2431, pp. 15–27. Springer, Heidelberg (2002). doi: 10.1007/3-540-45681-3_2 CrossRefGoogle Scholar
  4. 4.
    Bellman, R.: Adaptive Control Processes. A Guided Tour. Princeton University Press, Princeton (1961)CrossRefzbMATHGoogle Scholar
  5. 5.
    Bennett, K.P., Fayyad, U.M., Geiger, D.: Density-based indexing for approximate nearest-neighbor queries. In: ACM SIGKDD (1999)Google Scholar
  6. 6.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999). doi: 10.1007/3-540-49257-7_15 CrossRefGoogle Scholar
  7. 7.
    Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM SIGMOD (2000)Google Scholar
  8. 8.
    Campos, G.O., Zimek, A., Sander, J., Campello, R., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Casanova, G., Englmeier, E., Houle, M.E., Kröger, P., Nett, M., Schubert, E., Zimek, A.: Dimensional testing for reverse \(k\)-nearest neighbor search. VLDB Endowment, vol. 10 (2017, to appear)Google Scholar
  10. 10.
    Dang, X.H., Assent, I., Ng, R.T., Zimek, A., Schubert, E.: Discriminative features for identifying and interpreting outliers. In: IEEE International Conference on Data Engineering, ICDE (2014)Google Scholar
  11. 11.
    Emrich, T., Kriegel, H., Kröger, P., Niedermayer, J., Renz, M., Züfle, A.: On reverse-k-nearest-neighbor joins. GeoInformatica 19(2), 299–330 (2015)CrossRefzbMATHGoogle Scholar
  12. 12.
    Geusebroek, J., Burghouts, G.J., Smeulders, A.W.M.: The amsterdam library of object images. Int. J. Comput. Vis. 61(1), 103–112 (2005)CrossRefGoogle Scholar
  13. 13.
    Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE 11(4), e0152173 (2016)CrossRefGoogle Scholar
  14. 14.
    Hautamäki, V., Kärkkäinen, I., Fränti, P.: Outlier detection using k-nearest neighbour graph. In: International Conference on Pattern Recognition, ICPR (2004)Google Scholar
  15. 15.
    Hill, B.M.: A simple general approach to inference about the tail of a distribution. Ann. Stat. 3(5), 1163–1174 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Hinton, G.E., Roweis, S.T.: Stochastic neighbor embedding. In: Advances in Neural Information Processing Systems, NIPS, vol. 15 (2002)Google Scholar
  17. 17.
    Houle, M.E.: Dimensionality, discriminability, density and distance distributions. In: ICDM Workshops (2013)Google Scholar
  18. 18.
    Houle, M.E.: Inlierness, outlierness, hubness and discriminability: an extreme-value-theoretic foundation. Technical report NII-2015-002E, National Institute of Informatics, Tokyo, Japan (2015)Google Scholar
  19. 19.
    Houle, M.E., Kashima, H., Nett, M.: Generalized expansion dimension. In: ICDM Workshops (2012)Google Scholar
  20. 20.
    Houle, M.E., Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Can shared-neighbor distances defeat the curse of dimensionality? In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 482–500. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-13818-8_34 CrossRefGoogle Scholar
  21. 21.
    Huisman, R., Koedijk, K.G., Kool, C.J.M., Palm, F.: Tail-index estimates in small samples. Bus. Econ. Stat. 19(2), 208–216 (2001)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Janssens, J.H.M.: Outlier selection and one-class classification. Ph.D. thesis, Tilburg University (2013)Google Scholar
  23. 23.
    Janssens, J.H.M., Huszár, F., Postma, E.O., van den Herik, H.J.: Stochastic outlier selection. Technical report TiCC TR 2012–001, Tilburg Center for Cognition and Communication (2012)Google Scholar
  24. 24.
    Janssens, J., Postma, E., van den Herik, J.: Density-based anomaly detection in the maritime domain. In: van de Laar, P., Tretmans, J., Borth, M. (eds.) Situation Awareness with Systems of Systems, pp. 119–131. Springer, New York (2013). doi: 10.1007/978-1-4614-6230-9_8 CrossRefGoogle Scholar
  25. 25.
    Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS, vol. 3918, pp. 577–593. Springer, Heidelberg (2006). doi: 10.1007/11731139_68 CrossRefGoogle Scholar
  26. 26.
    Karger, D.R., Ruhl, M.: Finding nearest neighbors in growth-restricted metrics. In: ACM Symposium on Theory of Computing, STOC (2002)Google Scholar
  27. 27.
    Keller, F., Müller, E., Böhm, K.: HiCS: high contrast subspaces for density-based outlier ranking. In: IEEE International Conference on Data Engineering (2012)Google Scholar
  28. 28.
    Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Very Large Data Bases, VLDB (1998)Google Scholar
  29. 29.
    Kriegel, H., Kröger, P., Schubert, E., Zimek, A.: Loop: local outlier probabilities. In: ACM Conference on Information and Knowledge Management (2009)Google Scholar
  30. 30.
    Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 831–838. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-01307-2_86 CrossRefGoogle Scholar
  31. 31.
    Kriegel, H., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: SIAM Data Mining, SDM (2011)Google Scholar
  32. 32.
    Kriegel, H., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in arbitrarily oriented subspaces. In: IEEE International Conference on Data Mining (2012)Google Scholar
  33. 33.
    Low, T., Borgelt, C., Stober, S., Nürnberger, A.: The hubness phenomenon: fact or artifact? In: Borgelt, C., Gil, M., Sousa, J., Verleysen, M. (eds.) Towards Advanced Data Analysis by Combining Soft Computing and Statistics. STUDFUZZ, vol. 285, pp. 267–278. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-30278-7_21 CrossRefGoogle Scholar
  34. 34.
    van der Maaten, L.: Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15(1), 3221–3245 (2014)MathSciNetzbMATHGoogle Scholar
  35. 35.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)zbMATHGoogle Scholar
  36. 36.
    Nguyen, H.V., Gopalkrishnan, V., Assent, I.: An unbiased distance-based outlier detection approach for high-dimensional data. In: Yu, J.X., Kim, M.H., Unland, R. (eds.) DASFAA 2011. LNCS, vol. 6587, pp. 138–152. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-20149-3_12 CrossRefGoogle Scholar
  37. 37.
    Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010)MathSciNetzbMATHGoogle Scholar
  38. 38.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: ACM SIGMOD (2000)Google Scholar
  39. 39.
    Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. VLDB Endowment 8(12), 1976–1979 (2015). https://elki-project.github.io/ CrossRefGoogle Scholar
  40. 40.
    Schubert, E., Zimek, A., Kriegel, H.: Generalized outlier detection with flexible kernel density estimates. In: SIAM Data Mining (2014)Google Scholar
  41. 41.
    Schubert, E., Zimek, A., Kriegel, H.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov. 28(1), 190–237 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Schubert, E., Zimek, A., Kriegel, H.-P.: Fast and scalable outlier detection with approximate nearest neighbor ensembles. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA 2015. LNCS, vol. 9050, pp. 19–36. Springer, Cham (2015). doi: 10.1007/978-3-319-18123-3_2 Google Scholar
  43. 43.
    Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Heidelberg UniversityHeidelbergGermany

Personalised recommendations