On the Surprising Behavior of Distance Metrics in High Dimensional Space
In recent years, the effect of the curse of high dimensionality has been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from a effciency and/or effectiveness perspective. Recent research results show that in high dimensional space, the concept of proximity, distance or nearest neighbor may not even be qualitatively meaningful. In this paper, we view the dimensionality curse from the point of view of the distance metrics which are used to measure the similarity between objects. We specifically examine the behavior of the commonly used Lk norm and show that the problem of meaningfulness in high dimensionality is sensitive to the value of k. For example, this means that the Manhattan distance metric L(1 norm) is consistently more preferable than the Euclidean distance metric L(2 norm) for high dimensional data mining applications. Using the intuition derived from our analysis, we introduce and examine a natural extension of the Lk norm to fractional distance metrics. We show that the fractional distance metric provides more meaningful results both from the theoretical and empirical perspective. The results show that fractional distance metrics can significantly improve the effectiveness of standard clustering algorithms such as the k-means algorithm.
Unable to display preview. Download preview PDF.
- 1.Weber R., Schek H.-J., Blott S.: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB Conference Proceedings, 1998.Google Scholar
- 2.Bennett K. P., Fayyad U., Geiger D.: Density-Based Indexing for Approximate Nearest Neighbor Queries. ACM SIGKDD Conference Proceedings, 1999.Google Scholar
- 3.Berchtold S., Böhm C., Kriegel H.-P.: The Pyramid Technique: Towards Breaking the Curse of Dimensionality. ACM SIGMOD Conference Proceedings, June 1998.Google Scholar
- 4.Berchtold S., Böhm C., Keim D., Kriegel H.-P.: A Cost Model for Nearest Neighbor Search in High Dimensional Space. ACM PODS Conference Proceedings, 1997.Google Scholar
- 5.Berchtold S., Ertl B., Keim D., Kriegel H.-P. Seidl T.: Fast Nearest Neighbor Search in High Dimensional Spaces. ICDE Conference Proceedings, 1998.Google Scholar
- 6.Beyer K., Goldstein J., Ramakrishnan R., Shaft U.: When is Nearest Neighbors Meaningful? ICDT Conference Proceedings, 1999.Google Scholar
- 7.Shaft U., Goldstein J., Beyer K.: Nearest Neighbor Query Performance for Unstable Distributions. Technical Report TR 1388, Department of Computer Science, University of Wisconsin at Madison.Google Scholar
- 8.Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. ACM SIGMOD Conference Proceedings, 1984.Google Scholar
- 9.Hinneburg A., Aggarwal C., Keim D.: What is the nearest neighbor in high dimensional spaces? VLDB Conference Proceedings, 2000.Google Scholar
- 10.Katayama N., Satoh S.: The SR-Tree: An Index Structure for High Dimensional Nearest Neighbor Queries. ACM SIGMOD Conference Proceedings, 1997.Google Scholar