The Hubness Phenomenon: Fact or Artifact?

  • Thomas Low
  • Christian Borgelt
  • Sebastian Stober
  • Andreas Nürnberger
Part of the Studies in Fuzziness and Soft Computing book series (STUDFUZZ, volume 285)

Abstract

The hubness phenomenon, as it was recently described, consists in the observation that for increasing dimensionality of a data set the distribution of the number of times a data point occurs among the k nearest neighbors of other data points becomes increasingly skewed to the right. As a consequence, so-called hubs emerge, that is, data points that appear in the lists of the k nearest neighbors of other data points much more often than others. In this paper we challenge the hypothesis that the hubness phenomenon is an effect of the dimensionality of the data set and provide evidence that it is rather a boundary effect or, more generally, an effect of a density gradient. As such, it may be seen as an artifact that results from the process in which the data is generated that is used to demonstrate this phenomenon. We report experiments showing that the hubness phenomenon need not occur in high-dimensional data and can be made to occur in low-dimensional data.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)MATHGoogle Scholar
  2. 2.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is Nearest Neighbor Meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  3. 3.
    Conway, J.H., Sloane, N.J.A.: Sphere Packings, Lattices and Groups, 3rd edn. Springer, New York (1999)MATHGoogle Scholar
  4. 4.
    Groeneveld, R.A., Meeden, G.: Measuring Skewness and Kurtosis. J. of the Royal Statistical Society, Series D (The Statistician) 33(4), 391–399 (1984)Google Scholar
  5. 5.
    Knuth, D.E.: The Art of Computer Programming, vol. 2: Seminumerical Algorithms. Addison-Wesley, Reading (1998)MATHGoogle Scholar
  6. 6.
    Marsaglia, G.: Re: good C random number generator. Post on newsgroup comp.lang.c, date: 2003-05-13 08:55:05 PST (2003), http://groups.google.com/group/comp.lang.c/browse_thread/thread/a9915080a4424068/
  7. 7.
    Marsaglia, G., Bray, T.A.: A Convenient Method for Generating Normal Variables. SIAM Review 6, 260–264 (1964)MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Matsumoto, M., Nishimura, T.: Mersenne Twister: A 623-dimensionally Equidistributed Uniform Pseudorandom Number Generator. ACM Trans. on Modeling and Computer Simulation 8, 3–30 (1998)MATHCrossRefGoogle Scholar
  9. 9.
    Nebe, G., Sloane, N.J.A.: Table of the Highest Kissing Numbers Presently Known (2012), http://www.math.rwth-aachen.de/~Gabriele.Nebe/LATTICES/kiss.html (retrieved January 16, 2012)
  10. 10.
    Radovanović, M., Nanopoulos, A., Ivanović, M.: Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs. In: Proc. 26th Int. Conf. on Machine Learning (ICML 2009), Montreal, Canada, pp. 865–872. ACM Press, New York (2009)Google Scholar
  11. 11.
    Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. J. Machine Learning Research 11, 2487–2531 (2010)MATHGoogle Scholar
  12. 12.
    Rubinstein, R.Y., Kroese, D.P.: Simulation and the Monte Carlo Method, 2nd edn. J. Wiley & Sons, Chichester (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Berlin Heidelberg 2013

Authors and Affiliations

  • Thomas Low
    • 1
  • Christian Borgelt
    • 2
  • Sebastian Stober
    • 1
  • Andreas Nürnberger
    • 1
  1. 1.Data and Knowledge Engineering GroupOtto-von-Guericke-University of MagdeburgMagdeburgGermany
  2. 2.European Centre for Soft ComputingMieresSpain

Personalised recommendations