Statistics and Computing

, Volume 22, Issue 2, pp 375–385

Non-parametric detection of meaningless distances in high dimensional data


DOI: 10.1007/s11222-011-9229-0

Cite this article as:
Kabán, A. Stat Comput (2012) 22: 375. doi:10.1007/s11222-011-9229-0


Distance concentration is the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases. It affects high dimensional data processing, analysis, retrieval, and indexing, which all rely on some notion of distance or dissimilarity. Previous work has characterised this phenomenon in the limit of infinite dimensions. However, real data is finite dimensional, and hence the infinite-dimensional characterisation is insufficient. Here we quantify the phenomenon more precisely, for the possibly high but finite dimensional case in a distribution-free manner, by bounding the tails of the probability that distances become meaningless. As an application, we show how this can be used to assess the concentration of a given distance function in some unknown data distribution solely on the basis of an available data sample from it. This can be used to test and detect problematic cases more rigorously than it is currently possible, and we demonstrate the working of this approach on both synthetic data and ten real-world data sets from different domains.


High dimensional data Curse of dimensionality Distance concentration Nearest neighbour Chebyshev bound Statistical test 

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.School of Computer ScienceThe University of BirminghamEdgbastonUK

Personalised recommendations