Skip to main content

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

Abstract

Nearest neighbor (NN) search is inherently computationally expensive in high-dimensional spaces due to the curse of dimensionality. As a well-known solution, locality-sensitive hashing (LSH) is able to answer c-approximate NN (c-ANN) queries in sublinear time with constant probability. Existing LSH methods focus mainly on building hash bucket-based indexing such that the candidate points can be retrieved quickly. However, existing coarse-grained structures fail to offer accurate distance estimation for candidate points, which translates into additional computational overhead when having to examine unnecessary points. This in turn reduces the performance of query processing. In contrast, we propose a fast and accurate in-memory LSH framework, called PM-LSH, that aims to compute the c-ANN query on large-scale, high-dimensional datasets. First, we adopt a simple yet effective PM-tree to index the data points. Second, we develop a tunable confidence interval to achieve accurate distance estimation and guarantee high result quality. Third, we propose an efficient algorithm on top of the PM-tree to improve the performance of computing c-ANN queries. In addition, we extend PM-LSH to support closest pair (CP) search in high-dimensional spaces. Here, we again adopt the PM-tree to organize the points in a low-dimensional space, and we propose a branch and bound algorithm together with a radius pruning technique to improve the performance of computing c-approximate closest pair (c-ACP) queries. Extensive experiments with real-world data offer evidence that PM-LSH is capable of outperforming existing proposals with respect to both efficiency and accuracy for both NN and CP search.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

References

  1. 1.

    Abdulhayoglu, M.A., Thijs, B.: Use of locality sensitive hashing (LSH) algorithm to match web of science and scopus. Scientometrics 116(2), 1229–1245 (2018)

    Article  Google Scholar 

  2. 2.

    Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M.E., Kawarabayashi, K., Nett, M.: Estimating local intrinsic dimensionality. In: KDD, pp. 29–38 (2015)

  3. 3.

    Andoni, A., Indyk, P.: LSH algorithm and implementation (E2LSH) (2016)

  4. 4.

    Angiulli, F., Pizzuti, C.: An approximate algorithm for top-k closest pairs join query in large high dimensional data. Data Knowl. Eng. 53(3), 263–281 (2005)

    Article  Google Scholar 

  5. 5.

    Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: WWW, pp. 651–660 (2005)

  6. 6.

    Beckmann, N., Kriegel, H.. Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. In: SIGMOD, pp. 322–331 (1990)

  7. 7.

    Cai, X., Rajasekaran, S., Zhang, F.: Efficient approximate algorithms for the closest pair problem in high dimensional spaces. In: PAKDD (3), volume 10939 of Lecture Notes in Computer Science, pp. 151–163 (2018)

  8. 8.

    Chen, L., Gao, Y., Li, X., Jensen, C.S., Chen, G.: Efficient metric indexing for similarity search. In: ICDE, pp. 591–602 (2015)

  9. 9.

    Ciaccia, P., Patella, M., Rabitti, F., Zezula, P.: Indexing metric spaces with m-tree. In: SEBD, pp. 67–86 (1997)

  10. 10.

    Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In: PODS, pp. 59–68 (1998)

  11. 11.

    Corral, A., D’Ermiliis, A., Manolopoulos, Y., Vassilakopoulos, M.: VA-files versus R*-trees in distance join queries. In: ADBIS, volume 3631 of Lecture Notes in Computer Science, pp. 153–166 (2005)

  12. 12.

    Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest pair queries in spatial databases. In: SIGMOD, pp. 189–200 (2000)

  13. 13.

    Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for processing k-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1), 67–104 (2004)

    Article  Google Scholar 

  14. 14.

    Das, A., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW, pp. 271–280 (2007)

  15. 15.

    Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on Computational Geometry, pp. 253–262 (2004)

  16. 16.

    Dong, W., Wang, Z., Josephson, W., Charikar, M., Li, K.: Modeling LSH for performance tuning. In: CIKM, pp. 669–678 (2008)

  17. 17.

    Fredriksson, K., Braithwaite, B.: Quicker similarity joins in metric spaces. In: SISAP, volume 8199 of Lecture Notes in Computer Science, pp. 127–140 (2013)

  18. 18.

    Gan, J., Feng, J., Fang, Q., Ng, W.: Locality-sensitive hashing scheme based on dynamic collision counting. In: SIGMOD, pp. 541–552 (2012)

  19. 19.

    Gao, Y., Chen, L., Li, X., Yao, B., Chen, G.: Efficient k-closest pair queries in general metric spaces. VLDB J. 24(3), 415–439 (2015)

    Article  Google Scholar 

  20. 20.

    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)

  21. 21.

    Gong, Y., Lazebnik, S., Gordo, A., Perronnin, F.: Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. TPAMI 35(12), 2916–2929 (2013)

    Article  Google Scholar 

  22. 22.

    Gutierrez, G., Sáez, P.: The k closest pairs in spatial databases—when only one set is indexed. GeoInformatica 17(4), 543–565 (2013)

    Article  Google Scholar 

  23. 23.

    Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: EDBT, pp. 744–755 (2009)

  24. 24.

    Harris, J., Stöcker, H.: Handbook of Mathematics and Computational Science (1998)

  25. 25.

    He, J., Kumar, S., Chang, S.: On the difficulty of nearest neighbor search. In: ICML (2012)

  26. 26.

    Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In: SIGMOD, pp. 237–248 (1998)

  27. 27.

    Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB 9(1), 1–12 (2015)

    Google Scholar 

  28. 28.

    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)

  29. 29.

    Kim, Y.J., Patel, J.M.: Performance comparison of the R\({}^{*}\)-tree and the quadtree for knn and distance join queries. TKDE 22(7), 1014–1027 (2010)

    Google Scholar 

  30. 30.

    Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: ICCV, pp. 2130–2137 (2009)

  31. 31.

    Kurasawa, H., Takasu, A., Adachi, J.: Finding the k-closest pairs in metric spaces. In: NTSS, pp. 8–13 (2011)

  32. 32.

    Li, H., Nutanong, S., Xu, H., Yu, C., Ha, F.: C2net: a network-efficient approach to collision counting LSH similarity join. TKDE 31(3), 423–436 (2019)

    Google Scholar 

  33. 33.

    Li, J., Yan, X., Zhang, J., Xu, A., Cheng, J., Liu, J., Ng, K. K. W., Cheng, T.: A general and efficient querying method for learning to hash. In: SIGMOD, pp. 1333–1347 (2018)

  34. 34.

    Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement. TKDE 32(8), 1475–1488 (2020)

    Google Scholar 

  35. 35.

    Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)

  36. 36.

    Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Intelligent probing for locality sensitive hashing: multi-probe LSH and beyond. PVLDB 10(12), 2021–2024 (2017)

    Google Scholar 

  37. 37.

    Mueen, A., Keogh, E. J., Zhu, Q., Cash, S., Westover, M. B.: Exact discovery of time series motifs. In: SDM, pp. 473–484 (2009)

  38. 38.

    Narang, A., Bhattacherjee, S.: Real-time approximate range motif discovery and data redundancy removal algorithm. In: EDBT, pp. 485–496 (2011)

  39. 39.

    Panigrahy, R.: Entropy based nearest neighbor search in high dimensions. In: SODA, pp. 1186–1195 (2006)

  40. 40.

    Paredes, R., Reyes, N.: Solving similarity joins and range queries in metric spaces with the list of twin clusters. J. Discrete Algorithms 7(1), 18–35 (2009)

    MathSciNet  Article  Google Scholar 

  41. 41.

    Pearson, S.S., Silva, Y.N.: Index-based R-S similarity joins. In: SISAP, volume 8821 of Lecture Notes in Computer Science, pp. 106–112 (2014)

  42. 42.

    Pirbonyeh, M., Rezaie, V., Parvin, H., Nejatian, S., Mehrabi, M.: A linear unsupervised transfer learning by preservation of cluster-and-neighborhood data organization. Pattern Anal. Appl. 22(3), 1149–1160 (2019)

    MathSciNet  Article  Google Scholar 

  43. 43.

    Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. PVLDB 5(5), 430–441 (2012)

    Google Scholar 

  44. 44.

    Shan, J., Zhang, D., Salzberg, B.: On spatial-range closest-pair query. In: SSTD, volume 2750 of Lecture Notes in Computer Science, pp. 252–269 (2003)

  45. 45.

    Shin, H., Moon, B., Lee, S.: Adaptive and incremental processing for distance join queries. TKDE 15(6), 1561–1578 (2003)

    Google Scholar 

  46. 46.

    Skopal, T., Pokorný, J., Snásel, V.: Nearest neighbours search using the PM-tree. In: DASFAA, pp. 803–815 (2005)

  47. 47.

    Sun, Y., Wang, W., Qin, J., Zhang, Y., Lin, X.: SRS: solving c-approximate nearest neighbor queries in high dimensional Euclidean space with a tiny index. PVLDB 8(1), 1–12 (2014)

    Google Scholar 

  48. 48.

    Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp. 563–576 (2009)

  49. 49.

    Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20:1–20:46 (2010)

    Article  Google Scholar 

  50. 50.

    Wang, J., Zhang, T., Song, J., Sebe, N., Shen, H.T.: A survey on learning to hash. TPAMI 40(4), 769–790 (2018)

    Article  Google Scholar 

  51. 51.

    Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: KDD, pp. 829–837 (2013)

  52. 52.

    Yu, C., Nutanong, S., Li, H., Wang, C., Yuan, X.: A generic method for accelerating lsh-based similarity join processing. TKDE 29(4), 712–726 (2017)

    Google Scholar 

  53. 53.

    Zheng, B., Zhao, X.,. Weng, L., Hung, N. Q. V., Liu, H., Jensen. C. S.: PM-LSH: A fast and accurate LSH framework for highdimensional approximate NN search. PVLDB 13(5):643–655 (2020)

  54. 54.

    Zheng, Y., Guo, Q., Tung, A.K.H., Wu, S.: Lazylsh: approximate nearest neighbor search for multiple distance functions with a single index. In: SIGMOD, pp. 2023–2037 (2016)

  55. 55.

    Zhou, X., Wu, B., Jin, Q.: Analysis of user network and correlation for community discovery based on topic-aware similarity and behavioral influence. IEEE Trans. Hum. Mach. Syst. 48(6), 559–571 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported in part by the NSFC (Grants No. 61902134, 62011530437), the Hubei Natural Science Foundation (Grant No. 2020CFB871), and the Fundamental Research Funds for the Central Universities (HUST: Grants No. 2019kfyXKJC021, 2019kfyXJJS091).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bolong Zheng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zheng, B., Zhao, X., Weng, L. et al. PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00680-7

Download citation

Keywords

  • High-dimensional data
  • Approximate nearest neighbor
  • Closest pair
  • LSH