On the Surprising Behavior of Distance Metrics in High Dimensional Space

Aggarwal, Charu C.; Hinneburg, Alexander; Keim, Daniel A.

doi:10.1007/3-540-44503-X_27

Charu C. Aggarwal⁶,
Alexander Hinneburg⁷ &
Daniel A. Keim⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1973))

Included in the following conference series:

International Conference on Database Theory

5403 Accesses
663 Citations
7 Altmetric

Abstract

In recent years, the effect of the curse of high dimensionality has been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from a effciency and/or effectiveness perspective. Recent research results show that in high dimensional space, the concept of proximity, distance or nearest neighbor may not even be qualitatively meaningful. In this paper, we view the dimensionality curse from the point of view of the distance metrics which are used to measure the similarity between objects. We specifically examine the behavior of the commonly used L_k norm and show that the problem of meaningfulness in high dimensionality is sensitive to the value of k. For example, this means that the Manhattan distance metric L₍₁ norm) is consistently more preferable than the Euclidean distance metric L₍₂ norm) for high dimensional data mining applications. Using the intuition derived from our analysis, we introduce and examine a natural extension of the L_k norm to fractional distance metrics. We show that the fractional distance metric provides more meaningful results both from the theoretical and empirical perspective. The results show that fractional distance metrics can significantly improve the effectiveness of standard clustering algorithms such as the k-means algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Weber R., Schek H.-J., Blott S.: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB Conference Proceedings, 1998.
Google Scholar
Bennett K. P., Fayyad U., Geiger D.: Density-Based Indexing for Approximate Nearest Neighbor Queries. ACM SIGKDD Conference Proceedings, 1999.
Google Scholar
Berchtold S., Böhm C., Kriegel H.-P.: The Pyramid Technique: Towards Breaking the Curse of Dimensionality. ACM SIGMOD Conference Proceedings, June 1998.
Google Scholar
Berchtold S., Böhm C., Keim D., Kriegel H.-P.: A Cost Model for Nearest Neighbor Search in High Dimensional Space. ACM PODS Conference Proceedings, 1997.
Google Scholar
Berchtold S., Ertl B., Keim D., Kriegel H.-P. Seidl T.: Fast Nearest Neighbor Search in High Dimensional Spaces. ICDE Conference Proceedings, 1998.
Google Scholar
Beyer K., Goldstein J., Ramakrishnan R., Shaft U.: When is Nearest Neighbors Meaningful? ICDT Conference Proceedings, 1999.
Google Scholar
Shaft U., Goldstein J., Beyer K.: Nearest Neighbor Query Performance for Unstable Distributions. Technical Report TR 1388, Department of Computer Science, University of Wisconsin at Madison.
Google Scholar
Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. ACM SIGMOD Conference Proceedings, 1984.
Google Scholar
Hinneburg A., Aggarwal C., Keim D.: What is the nearest neighbor in high dimensional spaces? VLDB Conference Proceedings, 2000.
Google Scholar
Katayama N., Satoh S.: The SR-Tree: An Index Structure for High Dimensional Nearest Neighbor Queries. ACM SIGMOD Conference Proceedings, 1997.
Google Scholar
Lin K.-I., Jagadish H. V., Faloutsos C.: The TV-tree: An Index Structure for High Dimensional Data. VLDB Journal, Volume 3, Number 4, pages 517–542, 1992.
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA
Charu C. Aggarwal
Institute of Computer Science, University of Halle, Kurt-Mothes-Str.1, 06120 Halle (Saale), Germany
Alexander Hinneburg & Daniel A. Keim

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Hinneburg
View author publications
You can also search for this author in PubMed Google Scholar
Daniel A. Keim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Limburg University (LUC), 3590, Diepenbeek, Belgium
Jan Van den Bussche
Department of Computer Science and Engineering, University of California, 92093-0114, La Jolla, CA, USA
Victor Vianu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aggarwal, C.C., Hinneburg, A., Keim, D.A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds) Database Theory — ICDT 2001. ICDT 2001. Lecture Notes in Computer Science, vol 1973. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44503-X_27

Download citation

DOI: https://doi.org/10.1007/3-540-44503-X_27
Published: 12 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41456-8
Online ISBN: 978-3-540-44503-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics