Non-parametric detection of meaningless distances in high dimensional data

Kabán, Ata

doi:10.1007/s11222-011-9229-0

Non-parametric detection of meaningless distances in high dimensional data

Published: 06 April 2011

Volume 22, pages 375–385, (2012)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Ata Kabán¹

535 Accesses
36 Citations
3 Altmetric
Explore all metrics

Abstract

Distance concentration is the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases. It affects high dimensional data processing, analysis, retrieval, and indexing, which all rely on some notion of distance or dissimilarity. Previous work has characterised this phenomenon in the limit of infinite dimensions. However, real data is finite dimensional, and hence the infinite-dimensional characterisation is insufficient. Here we quantify the phenomenon more precisely, for the possibly high but finite dimensional case in a distribution-free manner, by bounding the tails of the probability that distances become meaningless. As an application, we show how this can be used to assess the concentration of a given distance function in some unknown data distribution solely on the basis of an available data sample from it. This can be used to test and detect problematic cases more rigorously than it is currently possible, and we demonstrate the working of this approach on both synthetic data and ten real-world data sets from different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Proc. ICDT, pp. 420–434 (2001)
Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Proc. ICDT, pp. 217–235 (1999)
Google Scholar
Clarke, R., Ressom, H.W., Wang, A., Xuan, J., Liu, M.C., Gehan, E.A., Wang, Y.: The properties of high-dimensional data spaces: Implications for exploring gene and protein expression data. Nat. Rev., Cancer 8, 37–49 (2008)
Article Google Scholar
Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J. Complex. 25(4), 385–397 (2009)
Article MATH Google Scholar
François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Trans. Knowl. Data Eng. 19(7), 873–886 (2007)
Article Google Scholar
Giannella, C.: New instability results for high dimensional nearest neighbor search. Inf. Process. Lett. 109(19), 1109–1113 (2009)
Article MathSciNet MATH Google Scholar
Hsu, C.-M., Chen, M.-S.: On the design and applicability of distance functions in high-dimensional data space. IEEE Transactions on Knowledge and Data Engineering 21(4) (2009)
Kabán, A.: On the distance concentration awareness of certain data reduction techniques. Pattern Recognit. 44(2), 265–277 (2011)
Article MATH Google Scholar
Kontorovich, L.: Measure concentration of strongly mixing processes with applications. Ph.D. thesis, School of Computer Science, Carnegie Mellon University (2007)
Kowalczyk, A.: Classification of anti-learnable biological and synthetic data. In: Proc. PKDD, pp. 176–187 (2007)
Google Scholar
Pramanik, S., Li, J.: Fast approximate search algorithm for nearest neighbor queries in high dimensions. In: Proc. ICDE, p. 251 (1999)
Google Scholar
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010)
MathSciNet Google Scholar
Saw, J.G., Yang, M.C.K., Mo, T.C.: Chebyshev inequality with estimated mean and variance. Am. Stat. 38(2), 130–132 (1984)
Article MathSciNet Google Scholar
Shim, J., Sohn, I., Kim, S., Lee, J.-W., Green, P., Hwang, C.: Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine. Comput. Stat. Data Anal. 53(5), 1736–1742 (2009)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
Ata Kabán

Authors

Ata Kabán
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ata Kabán.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kabán, A. Non-parametric detection of meaningless distances in high dimensional data. Stat Comput 22, 375–385 (2012). https://doi.org/10.1007/s11222-011-9229-0

Download citation

Received: 30 June 2010
Accepted: 10 January 2011
Published: 06 April 2011
Issue Date: March 2012
DOI: https://doi.org/10.1007/s11222-011-9229-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-parametric detection of meaningless distances in high dimensional data

Abstract

Access this article

Similar content being viewed by others

Extreme-value-theoretic estimation of local intrinsic dimensionality

Dynamic Similarity and Distance Measures Based on Quantiles

Multiplicative distance: a method to alleviate distance instability for high-dimensional data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Non-parametric detection of meaningless distances in high dimensional data

Abstract

Access this article

Similar content being viewed by others

Extreme-value-theoretic estimation of local intrinsic dimensionality

Dynamic Similarity and Distance Measures Based on Quantiles

Multiplicative distance: a method to alleviate distance instability for high-dimensional data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation