When Is “Nearest Neighbor” Meaningful?

  • Kevin Beyer
  • Jonathan Goldstein
  • Raghu Ramakrishnan
  • Uri Shaft
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1540)

Abstract

We explore the effect of dimensionality on the “nearest neighbor” problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10–15 dimensions.

These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10–15) dimensionality!

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Faloutsos, C., Swami, A.: Efficient Similarity Search in Sequence Databases. In Proc. 4th Inter. Conf. on FODO (1993) 69–84Google Scholar
  2. 2.
    Altschul, S.F., Gish, W., Miller, W., Myers, E., Lipman, D.J.: Basic Local Alignment Search Tool. In Journal of Molecular Biology, Vol. 215 (1990) 403–410Google Scholar
  3. 3.
    Ang, Y.H., Li, Z., Ong, S.H.: Image retrieval based on multidimensional feature properties. In SPIE, Vol. 2420 (1995) 47–57CrossRefGoogle Scholar
  4. 4.
    Arya, S.: Nearest Neighbor Searching and Applications. Ph.D. thesis, Univ. of Maryland at College Park (1995)Google Scholar
  5. 5.
    Arya, S., Mount, D.M., Narayan, O.: Accounting for Boundary Effects in Nearest Neighbors Searching. In Proc. 11th ACM Symposium on Computational Geometry (1995) 336–344Google Scholar
  6. 6.
    Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.: An Optimal Algorithm for Nearest Neighbor Searching. In Proc. 5th ACM SIAM Symposium on Discrete Algorithms (1994) 573–582Google Scholar
  7. 7.
    Bellman, R.E.: Adaptive Control Processes. Princeton University Press (1961)Google Scholar
  8. 8.
    Belussi, A., Faloutsos, C.: Estimating the Selectivity of Spatial Queries Using the ‘Correlation’ Fractal Dimension. In Proc. VLDB (1995) 299–310Google Scholar
  9. 9.
    Bentley, J.L., Weide, B.W., Yao, A.C.: Optimal Expected-time Algorithms for Closest Point Problem”, In ACM Transactions on Mathematical Software, Vol. 6,No. 4 (1980) 563–580MATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Berchtold, S., Böhm, C., Braunmüller, B., Keim, D.A., Kriegel, H.-P.: Fast Parallel Similarity Search in Multimedia Databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data (1997) 1–12Google Scholar
  11. 11.
    Berchtold, S., Böhm, C.,, B., Keim, D.A., Kriegel H.-P.: A Cost Model for Nearest Neighbor Search in High-Dimensional Data Space. In Proc. 16th ACM SIGACTSIGMOD-SIGART Symposium on PODS (1997) 78–86Google Scholar
  12. 12.
    Bern, M.: Approximate Closest Point Queries in High Dimensions. In Information Processing Letters, Vol. 45 (1993) 95–99MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is Nearest Neighbors Meaningful? Technical Report No. TR1377, Computer Sciences Dept., Univ. of Wisconsin-Madison, June 1998Google Scholar
  14. 14.
    Bozkaya, T., Ozsoyoglu, M.: Distance-Based Indexing for High-Dimensional Metric Spaces. In Proc. 16th ACM SIGACT-SIGMOD-SIGART Symposium on PODS (1997) 357–368Google Scholar
  15. 15.
    Faloutsos, C., et al: Efficient and Effective Querying by Image Content. In Journal of Intelligent Information Systems, Vol. 3,No. 3 (1994) 231–262CrossRefGoogle Scholar
  16. 16.
    Faloutsos, C., Gaede, V.: Analysis of n-Dimensional Quadtrees Using the Housdorff Fractal Dimension. In Proc. ACM SIGMOD Int. Conf. of the Management of Data (1996)Google Scholar
  17. 17.
    Faloutsos, C., Kamel, I.: Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension. In Proc. 13th ACM SIGACT-SIGMOD-SIGART Symposium on PODS 1994 4–13Google Scholar
  18. 18.
    Fayyad, U.M., Smyth, P.: Automated Analysis and Exploration of Image Databases: Results, Progress and Challenges. In Journal of intelligent information systems, Vol. 4,No. 1 (1995) 7–25CrossRefGoogle Scholar
  19. 19.
    Katayama, N., Satoh, S.: The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. In Proc. 16th ACM SIGACT-SIGMOD-SIGART Symposium on PODS (1997) 369–380Google Scholar
  20. 20.
    Lin, K.-I., Jagadish, H.V., Faloutsos, C.: The TV-Tree: An Index Structure for High-Dimensional Data. In VLDB Journal, Vol. 3,No. 4 (1994) 517–542CrossRefGoogle Scholar
  21. 21.
    Manjunath, B.S., Ma, W.Y.: Texture Features for Browsing and Retrieval of Image Data. In IEEE Trans. on Pattern Analysis and Machine Learning, Vol. 18,No. 8 (1996) 837–842CrossRefGoogle Scholar
  22. 22.
    Mehrotra, R., Gary, J.E.: Feature-Based Retrieval of Similar Shapes. In 9th Data Engineering Conference (1992) 108–115Google Scholar
  23. 23.
    Murase, H., Nayar, S.K.: Visual Learning and Recognition of 3D Objects from Appearance. In Int. J. of Computer Vision, Vol. 14,No. 1 (1995) 5–24CrossRefGoogle Scholar
  24. 24.
    Nene, S.A., Nayar, S.K.: A Simple Algorithm for Nearest Neighbor Search in High Dimensions. In IEEE Trans. on Pattern Analysis and Machine Learning, Vol. 18,No. 8 (1996) 989–1003Google Scholar
  25. 25.
    Pentland, A., Picard, R.W., Scalroff, S.: Photobook: Tools for Content Based Manipulation of Image Databases. In SPIE Vol. 2185 (1994) 34–47CrossRefGoogle Scholar
  26. 26.
    Scott, D.W.: Multivariate Density Estimation. Wiley Interscience, Chapter 2 (1992)Google Scholar
  27. 27.
    Shaft, U., Goldstein, J., Beyer, K.: Nearest Neighbors Query Performance for Unstable Distributions. Technical Report No. TR1388, Computer Sciences Dept., Univ. of Wisconsin-Madison, October 1998Google Scholar
  28. 28.
    Swain, M.J., Ballard D.H.: Color Indexing. In Inter. Journal of Computer Vision, Vol. 7,No. 1 (1991) 11–32CrossRefGoogle Scholar
  29. 29.
    Swets, D.L., Weng, J.: Using Discriminant Eigenfeatures for Image Retrieval. In IEEE Trans. on Pattern Analysis and Machine Learning, Vol. 18,No. 8 (1996) 831–836CrossRefGoogle Scholar
  30. 30.
    Taubin, G., Cooper, D.B.: Recognition and Positioning of Rigid Objects Using Algebraic Moment Invariants. In SPIE, Vol. 1570 (1991) 318–327Google Scholar
  31. 31.
    White, D.A., Jain, R.: Similarity Indexing with the SS-Tree. In ICDE (1996) 516–523Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Kevin Beyer
    • 1
  • Jonathan Goldstein
    • 1
  • Raghu Ramakrishnan
    • 1
  • Uri Shaft
    • 1
  1. 1.CS Dept.University of Wisconsin-MadisonMadison

Personalised recommendations