Advertisement

Approximate k-Closest-Pairs in Large High-Dimensional Data Sets

  • Fabrizio Angiulli
  • Clara Pizzuti
Article
  • 77 Downloads

Abstract

An approximate algorithm to efficiently solve the k-Closest-Pairs problem on large high-dimensional data sets is presented. The algorithm runs, for a suitable choice of the input parameters, in \(\mathcal{O}(d^{2}nk)\) time, where d is the dimensionality and n is the number of points of the input data set, and requires linear space in the input size. It performs at most d+1 iterations. At each iteration a shifted version of the data set is sequentially scanned according to the order induced on it by the Hilbert space filling curve and points whose contribution to the solution has already been analyzed are detected and eliminated. The pruning is lossless, in fact the remaining points along with the approximate solution found can be used for the computation of the exact solution. If the data set is entirely pruned, then the algorithm returns the exact solution. We prove that the pruning ability of the algorithm is related to the nearest neighbor distance distribution of the data set and show that there exists a class of data sets for which the method, augmented with a final step that applies an exact method to the reduced data set, calculates the exact solution with the same time requirements.

Although we are able to guarantee a \(\mathcal{O}(d^{1+{1}/{t}})\) approximation to the solution, where t∈{1,2,. . .,∞} identifies the Minkowski (Lt) metric of interest, experimental results give the exact k closest pairs for all the large high-dimensional synthetic and real data sets considered and show that the pruning of the search space is effective. We present a thorough scaling analysis of the algorithm for in-memory and disk-resident data sets showing that the algorithm scales well in both cases.

Keywords

k-Closest-Pairs problem Space Filling Curves approximate algorithms 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aluru, S. and Sevilgen, F. E.: Parallel domain decomposition and load balancing using space-filling curves, in Proceedings of the International Conference on High Performace Computing, 1997, pp. 230–235. Google Scholar
  2. 2.
    Andrews, H. C.: Introduction to Mathematical Techniques in Pattern Recognition, Wiley-Interscience, New York, 1972. Google Scholar
  3. 3.
    Angiulli, F. and Pizzuti, C.: Approximate k-closest-pairs with space filling curves, inY. Kambayashi, W. Winiwarter and M. Arikawa (eds), Proceedings of the Fourth International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2002), Aix-en-Provence, France, September 2002, Lecture Notes in Comput. Sci. 2454, Springer-Verlag, pp. 124–134. Google Scholar
  4. 4.
    Arya, S., Mount, D. M., Nethanyahu, N. S., Silverman, R. and Wu, A. Y.: An optimal algorithm for approximate nearest neighbour searching in fixed dimensions, J. ACM 45(6) (1998), 891–923. CrossRefGoogle Scholar
  5. 5.
    Bentley, J. L. and Shamos, M. I.: Divide-and-conquer in multidimensional space, in Proceedings of the 8th Annual ACM Symposium on Theory of Computing, 1996, pp. 220–230. Google Scholar
  6. 6.
    Bespamyatnikh, S.: An optimal algorithm for closest pair maintenance (extended abstract), in Proceedings of the 11th Annual ACM Symposium on Computational Geometry, 1995, pp. 152–161. Google Scholar
  7. 7.
    Beyer, K., Goldstein, J., Ramakrishnan, R. and Shaft, U.: When is “nearest neighbor” meaningful? in Proceedings of the Internatinal Conference on Database Theory, 1999, pp. 217–235. Google Scholar
  8. 8.
    Chan, T.: Approximate nearest neighbor queries revisited, in Proceedings of the 13th Annual ACM Symposium on Computational Geometry, 1997, pp. 352–358. Google Scholar
  9. 9.
    Chan, T.: On enumerating and selecting distances, in Proceedings of the 14th Annual ACM Symposium on Computational Geometry, 1998, pp. 279–286. Google Scholar
  10. 10.
    Chan, T.: Closest-point problems simplified on the ram, in Proceedings of the ACM Symposium on Discrete Algorithms (SODA’02), 2002. Google Scholar
  11. 11.
    Corral, A., Canadas, J. and Vassilakopoulos, M.: Approximate algorithms for distance-based queries in high-dimensional data spaces using R-trees, in Proceedings East-European Conference on Advances in Databases and Information Systems (ADBIS’02), 2002, pp. 163–176. Google Scholar
  12. 12.
    Corral, A., Manolopoulos, Y., Theodoridis, Y. and Vassilakopoulos, M.: Closest pair queries in spatial databases, in Proceedings of the ACM International Conference on Managment of Data (SIGMOD’00), 2000, pp. 189–200. Google Scholar
  13. 13.
    Corral, A., Manolopoulos, Y., Theodoridis, Y. and Vassilakopoulos, M.: Algorithms for processing k-closest-pair queries in spatial databases, Data & Knowledge Engineering 49 (2004), 67–104. Google Scholar
  14. 14.
    Dietzfelbinger, M., Hagerup, T., Katajainen, J. and Penttonen, M.: A reliable randomized algorithm for closest-pair problem, J. Algorithms 25(1) (1997), 19–51. CrossRefGoogle Scholar
  15. 15.
    Duda, R. O. and Hart, P. E.: Pattern Classification and Scene Analysis, Wiley, New York, 1973. Google Scholar
  16. 16.
    Eppstein, D.: Fast hierarchical clustering and other applications of dynamic closest pairs, in Proceedings of the ACM Symposium on Discrete Algorithms (SODA’98), 1998. Google Scholar
  17. 17.
    Faloutsos, C.: Multiattribute hashing using gray codes, in Proceedings of the ACM International Conference on Managment of Data (SIGMOD’86), 1986, pp. 227–238. Google Scholar
  18. 18.
    Faloutsos, C. and Roseman, S.: Fractals for secondary key retrieval, in Proceedings of the ACM International Conference on Principles of Database Systems (PODS’89), 1989, pp. 247–252. Google Scholar
  19. 19.
    Gionis, A., Indyk, P. and Motwani, R.: Similarity search in high dimensional via hashing, in Proceedings of the 25th International Conference on Very Large Databases (VLDB’99), 1999. Google Scholar
  20. 20.
    Hartigan, J. A.: Clustering Algorithms, Wiley, New York, 1975. Google Scholar
  21. 21.
    Hjaltason, G. R. and Samet, H.: Incremental distance join algorithms for spatial databases, in Proceedings of the ACM International Conference on Managment of Data (SIGMOD’98), 1998, pp. 237–248. Google Scholar
  22. 22.
    Indyk, P.: Sublinear time algorithm for metric space problems, in ACM Symposium on Theory of Computing, 1999, pp. 428–434. Google Scholar
  23. 23.
    Indyk, P.: High dimensional computational geometry, PhD thesis, Stanford University,September 2000. Google Scholar
  24. 24.
    Jagadish, H. V.: Linear clustering of objects with multiple atributes, in Proceedings of the ACM International Conference on Managment of Data (SIGMOD’90), 1990, pp. 332–342. Google Scholar
  25. 25.
    Katoh, N. and Iwano, K.: Finding k furthest pairs and k closest/farthest bichromatic pairs for points in the plane, in Proceedings of the 8th ACM Symposium on Computational Geometry, 1992, pp. 320–329. Google Scholar
  26. 26.
    Khuller, S. and Matias, Y.: A simple randomized sieve algorithm for closest-pair problem, Information and Computation 118(1) (1995), 34–37. CrossRefGoogle Scholar
  27. 27.
    Knuth, D. E.: The Art of Computer Programming, Addison-Wesley, 1998. Google Scholar
  28. 28.
    Lenhof, H. P. and Smid, M.: Enumerating the k closest pairs optimally, in Proceedings of the 33rd IEEE Symposium on Foundation of Computer Science (FOCS92), 1992, pp. 380–386. Google Scholar
  29. 29.
    Lopez, M. and Liao, S.: Finding k-closest-pairs efficiently for high-dimensional data, in Proceedings of the 12th Canadian Conference on Computational Geometry (CCCG), 2000, pp. 197–204. Google Scholar
  30. 30.
    Moon, B., Jagadish, H. V., Faloutsos, C. and Saltz, J. H.: Analysis of the clustering properties of the Hilbet space-filling curve, IEEE Trans. Knowledge Data Eng. 13(1) (2001), 124–141. CrossRefGoogle Scholar
  31. 31.
    Nanopoulos, A., Theodoridis, Y. and Manolopoulos, Y.: C2P: Clustering based on closestpairs, in Proceedings of the 27th Very Large Database Conference (VLDB’01), 2001, pp. 331–340. Google Scholar
  32. 32.
    Preparata, F. P. and Shamos, M. I.: Computational Geometry. An Introduction, Springer-Verlag, New York, 1985. Google Scholar
  33. 33.
    Rabin, M. O.: Probabilistic algorithms, in J. F. Traub (ed.), Algorithms and Complexity: New Directions and Recent Results, Academic Press, 1976, pp. 21–39. Google Scholar
  34. 34.
    Shim, K., Guha, S. and Rastogi, R.: Cure: An efficient clustering algorithm for large databases, in Proceedings of the ACM International Conference on Managment of Data (SIGMOD’86), 1998, pp. 73–84. Google Scholar
  35. 35.
    Sagan, H.: Space Filling Curves, Springer-Verlag, 1994. Google Scholar
  36. 36.
    Schwarz, C., Smid, M. and Snoeyink, J.: An optimal algorithm for the on-line closest-pair problem, in Proceedings of the 8th ACM Symposium on Computational Geometry, 1992, pp. 330–336. Google Scholar
  37. 37.
    Shepherd, J., Zhu, X. and Megiddo, N.: A fast indexing method for multidimensional nearest neighbor search, in Proceedings of SPIE Vol. 3656, Storage and Retrieval for Image and Video Databases, 1998, pp. 350–355. Google Scholar
  38. 38.
    Smid, M.: Closest-point problems in computational geometry, in J. Sack and J. Urrutia (eds), Handbook of Computational Geometry, Elsevier Science, Amsterdam, 1999, pp. 877–935. Google Scholar
  39. 39.
    Strongin, R. G. and Sergeyev, Y. D.: Global Optimization with Non-Convex Costraints, Kluwer Academic Publishers, 2000. Google Scholar
  40. 40.
    Yang, C. and Lin, K.: An index structure for improving closest pairs and related join queried in spatial databases, in Proceedings of the International Database Engineering and Applications Symposium IDEAS’02, 2002. Google Scholar

Copyright information

© Springer 2005

Authors and Affiliations

  1. 1.ICAR-CNRUniversità della CalabriaRende (CS)Italy

Personalised recommendations