Advertisement

The Hubness Phenomenon in High-Dimensional Spaces

  • Priya Mani
  • Marilyn Vazquez
  • Jessica Ruth Metcalf-Burton
  • Carlotta DomeniconiEmail author
  • Hillary Fairbanks
  • Gülce Bal
  • Elizabeth Beer
  • Sibel Tari
Chapter
Part of the Association for Women in Mathematics Series book series (AWMS, volume 17)

Abstract

High-dimensional data analysis is often negatively affected by the curse of dimensionality. In high-dimensional spaces, data becomes extremely sparse and distances between points become indistinguishable. As a consequence, reliable estimations of density, or meaningful distance-based similarity measures, cannot be obtained. This issue is particularly prevalent in clustering, which is commonly employed in exploratory data analysis. Another challenge for clustering high-dimensional data is that data often exist in subspaces consisting of combinations of dimensions, with different subspaces being relevant for different clusters. The hubness phenomenon is a recently discovered aspect of high-dimensional spaces. It is observed that the distribution of neighbor occurrences becomes skewed in intrinsically high-dimensional data, with few points, the hubs, having high occurrence counts. Hubness is observed to be more pronounced with increasing dimensionality. Hubs are also known to exhibit useful clustering properties and could be leveraged to mitigate the challenges in high-dimensional data analysis. In this chapter, we identify new geometric relationships between hubness, data density, and data distance distribution, as well as between hubness, subspaces, and intrinsic dimensionality of data. In addition, we formulate various potential research directions to leverage hubness for clustering and for subspace estimation.

References

  1. 1.
    C.C. Aggarwal, A. Hinneburg, D.A. Keim, On the surprising behavior of distance metrics in high dimensional spaces, in ICDT, Lecture Notes in Computer Science (Springer, Berlin, 2001), pp. 420–434zbMATHGoogle Scholar
  2. 2.
    R.E. Bellman, Adaptive Control Processes (Princeton University Press, Princeton, 1961)CrossRefGoogle Scholar
  3. 3.
    A. Berenzweig, Anchors and hubs in audio-based music similarity, Ph.D. thesis, 2007Google Scholar
  4. 4.
    T. Berry, J. Harlim, Variable bandwidth diffusion kernels. Appl. Comput. Harmon. Anal. 40(1), 68–96 (2016)MathSciNetCrossRefGoogle Scholar
  5. 5.
    K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is nearest neighbor meaningful? in ICDT, Lecture Notes in Computer Science (Springer, Berlin, 1999), pp. 217–235Google Scholar
  6. 6.
    F. Camastra, A. Staiano, Intrinsic dimension estimation: advances and open problems. Inf. Sci. 328, 26–41 (2016)CrossRefGoogle Scholar
  7. 7.
    P. Demartines, Analyse de données par réseaux de neurones auto-organisés, Ph.D. thesis (1994)Google Scholar
  8. 8.
    D. Francois, V. Wertz, M. Verleysen, The concentration of fractional distances, in IEEE Trans. Knowl. Data Eng. 19, 873–886 (2007)CrossRefGoogle Scholar
  9. 9.
    P. Grassberger, I. Procaccia, Measuring the strangeness of strange attractors. Physica D9, 189–208 (1983)MathSciNetzbMATHGoogle Scholar
  10. 10.
    J.A. Hartigan, Direct clustering of a data matrix. J. Am. Stat. Assoc. 67(337), 123–129 (1972)CrossRefGoogle Scholar
  11. 11.
    A. Hicklin, C. Watson, B. Ulery, The myth of goats: how many people have fingerprints that are hard to match? in Internal Report 7271, National Institute of Standards and Technology (NIST), 2005Google Scholar
  12. 12.
    A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis (Wiley, Hoboken, 2001)CrossRefGoogle Scholar
  13. 13.
    I.T. Jolliffe, Principal component analysis and factor analysis, in Principal Component Analysis (Springer, New York, 1986), pp. 115–128Google Scholar
  14. 14.
    H.-P. Kriegel, P. Kroger, A. Zimek, Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1:1–1:58 (2009)Google Scholar
  15. 15.
    E. Levina, P.J. Bickel, Maximum likelihood estimation of intrinsic dimension, in Advances in Neural Information Processing Systems (2004)Google Scholar
  16. 16.
    T. Low, C. Borgelt, S. Stober, A. Nürnberger, The hubness phenomenon: fact or artifact? in Towards Advanced Data Analysis by Combining Soft Computing and Statistics, ed. by C. Borgelt, M. Gil, J. Sousa, M. Verleysen. Studies in Fuzziness and Soft Computing, vol. 285 (Springer, Berlin, 2013)Google Scholar
  17. 17.
    E. Muller, S. Gunnemann, I. Assent, T. Seidl, Evaluating clustering in subspace projections of high dimensional data. Proc. VLDB Endowment 2(1), 1270–1281 (2009)CrossRefGoogle Scholar
  18. 18.
    L. Parsons, E. Haque, H. Liu, Subspace clustering for high dimensional data: a review. SIGKDD Explor. 6, 90–105 (2004)CrossRefGoogle Scholar
  19. 19.
    M. Radovanović, A. Nanopoulos, M. Ivanović, Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11(Sep), 2487–2531 (2010)MathSciNetzbMATHGoogle Scholar
  20. 20.
    S. Rayana, L. Akoglu, Less is more: building selective anomaly ensembles. ACM Trans. Knowl. Discov. Data 10(4), 42:1–42:33 (2016)Google Scholar
  21. 21.
    S.T. Roweis, L.K. Saul, Non-linear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)CrossRefGoogle Scholar
  22. 22.
    N. Tomasev, D. Mladenić, Hub co-occurrence modeling for robust high-dimensional kNN classification, in ECML PKDD (2013)Google Scholar
  23. 23.
    N. Tomasev, M. Radovanović, D. Mladenić, M. Ivanović, A probabilistic approach to nearest-neighbor classification: naive hubness Bayesian kNN, in CIKM (2011)Google Scholar
  24. 24.
    N. Tomasev, M. Radovanović, D. Mladenić, M. Ivanović, The role of hubness in clustering high-dimensional data, in PAKDD (2011)Google Scholar
  25. 25.
    L. van der Maaten, G. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  26. 26.
    P. Verveer, R. Duin, An evaluation of intrinsic dimensionality estimators. IEEE Trans. Pattern Anal. Mach. Intell. 17(1), 81–86 (1995)CrossRefGoogle Scholar

Copyright information

© The Author(s) and the Association for Women in Mathematics 2019

Authors and Affiliations

  • Priya Mani
    • 1
  • Marilyn Vazquez
    • 1
  • Jessica Ruth Metcalf-Burton
    • 2
  • Carlotta Domeniconi
    • 1
    Email author
  • Hillary Fairbanks
    • 3
  • Gülce Bal
    • 4
  • Elizabeth Beer
    • 5
  • Sibel Tari
    • 4
  1. 1.George Mason UniversityFairfaxUSA
  2. 2.National Security AgencyFort MeadeUSA
  3. 3.University of Colorado BoulderBoulderUSA
  4. 4.Middle East Technical UniversityCankayaTurkey
  5. 5.Center for Computing SciencesInstitute for Defense AnalysesAlexandriaUSA

Personalised recommendations