Abstract
Distance concentration is the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases. It affects high dimensional data processing, analysis, retrieval, and indexing, which all rely on some notion of distance or dissimilarity. Previous work has characterised this phenomenon in the limit of infinite dimensions. However, real data is finite dimensional, and hence the infinite-dimensional characterisation is insufficient. Here we quantify the phenomenon more precisely, for the possibly high but finite dimensional case in a distribution-free manner, by bounding the tails of the probability that distances become meaningless. As an application, we show how this can be used to assess the concentration of a given distance function in some unknown data distribution solely on the basis of an available data sample from it. This can be used to test and detect problematic cases more rigorously than it is currently possible, and we demonstrate the working of this approach on both synthetic data and ten real-world data sets from different domains.
Similar content being viewed by others
References
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Proc. ICDT, pp. 420–434 (2001)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Proc. ICDT, pp. 217–235 (1999)
Clarke, R., Ressom, H.W., Wang, A., Xuan, J., Liu, M.C., Gehan, E.A., Wang, Y.: The properties of high-dimensional data spaces: Implications for exploring gene and protein expression data. Nat. Rev., Cancer 8, 37–49 (2008)
Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J. Complex. 25(4), 385–397 (2009)
François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Trans. Knowl. Data Eng. 19(7), 873–886 (2007)
Giannella, C.: New instability results for high dimensional nearest neighbor search. Inf. Process. Lett. 109(19), 1109–1113 (2009)
Hsu, C.-M., Chen, M.-S.: On the design and applicability of distance functions in high-dimensional data space. IEEE Transactions on Knowledge and Data Engineering 21(4) (2009)
Kabán, A.: On the distance concentration awareness of certain data reduction techniques. Pattern Recognit. 44(2), 265–277 (2011)
Kontorovich, L.: Measure concentration of strongly mixing processes with applications. Ph.D. thesis, School of Computer Science, Carnegie Mellon University (2007)
Kowalczyk, A.: Classification of anti-learnable biological and synthetic data. In: Proc. PKDD, pp. 176–187 (2007)
Pramanik, S., Li, J.: Fast approximate search algorithm for nearest neighbor queries in high dimensions. In: Proc. ICDE, p. 251 (1999)
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010)
Saw, J.G., Yang, M.C.K., Mo, T.C.: Chebyshev inequality with estimated mean and variance. Am. Stat. 38(2), 130–132 (1984)
Shim, J., Sohn, I., Kim, S., Lee, J.-W., Green, P., Hwang, C.: Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine. Comput. Stat. Data Anal. 53(5), 1736–1742 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kabán, A. Non-parametric detection of meaningless distances in high dimensional data. Stat Comput 22, 375–385 (2012). https://doi.org/10.1007/s11222-011-9229-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-011-9229-0