Skip to main content
Log in

Learning-based query optimization for multi-probe approximate nearest neighbor search

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Approximate nearest neighbor search (ANNS) is a fundamental problem that has attracted widespread attention for decades. Multi-probe ANNS is one of the most important classes of ANNS methods, playing crucial roles in disk-based, GPU-based, and distributed scenarios. The state-of-the-art multi-probe ANNS approaches typically perform in a fixed-configuration manner. For example, each query is dispatched to a fixed number of partitions to run ANNS algorithms locally, and the results will be merged to obtain the final result set. Our observation shows that such fixed configurations typically lead to a non-optimal accuracy–efficiency trade-off. To further optimize multi-probe ANNS, we propose to generate efficient configurations for each query individually. By formalizing the per-query optimization as a 0–1 knapsack problem and its variants, we identify that the kNN distribution (the proportion of k nearest neighbors of a query placed in each partition) is essential to the optimization. Then we develop LEQAT (LEarned Query-Aware OpTimizer), which leverages kNN distribution to seek optimal configurations for each query. LEQAT comes with (i) a machine learning model to learn and estimate kNN distributions based on historical or sample queries and (ii) efficient query optimization algorithms to determine the partitions to probe and the number of searching neighbors in each partition. We apply LEQAT to three state-of-the-art ANNS methods IVF, HNSW, and SSG under clustering-based partitioning, evaluating the overall performance on several real-world datasets. The results show that LEQAT consistently reduces the latency by up to 58% and improves the throughput by up to 3.9 times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. https://github.com/FALCONN-LIB/FALCONN

  2. https://github.com/nmslib/hnswlib

  3. https://github.com/ZJULearning/SSG

References

  1. Adeniyi, D.A., Wei, Z., Yongquan, Y.: Automated web usage data mining and recommendation system using \(k\)-nearest neighbor (\(k\)NN) classification method. Appl. Comput. Inform. 12, 90–108 (2016)

    Article  Google Scholar 

  2. Anagnostopoulos, C., Triantafillou, P.: Learning set cardinality in distance nearest neighbours. In: ICDM, pp. 691–696. IEEE (2015)

  3. Andoni, A., Indyk P., Laarhoven T., Razenshteyn I., Schmidt, L.: Practical and optimal LSH for angular distance. arXiv preprint arXiv:1509.02897 (2015)

  4. Babenko, A., Lempitsky, V.: The inverted multi-index. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1247–1260 (2014)

    Article  Google Scholar 

  5. Babenko, A., Lempitsky, V.: Efficient indexing of billion-scale datasets of deep descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2055–2063 (2016)

  6. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975)

    Article  MATH  Google Scholar 

  7. Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: \(k\)NN based machine learning approach for text and document mining. Int. J. Database Theory Appl. 7, 61–70 (2014)

    Article  Google Scholar 

  8. Chauvin, Y., Rumelhart, D.E.: Backpropagation: Theory, Architectures, and Applications. Psychology Press, London (1995)

    Google Scholar 

  9. Chen, L., Özsu, M.T., Oria, V.: Robust and fast similarity search for moving object trajectories. In: SIGMOD, pp. 491–502 (2005)

  10. Chen, Q., Wang, H., Li, M., Ren, G., Li, S., Zhu, J., Li, J., Liu, C., Zhang, L., Wang, J.: SPTAG: a library for fast approximate nearest neighbor search (2018)

  11. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on Computational Geometry, pp. 253–262 (2004)

  12. Dong, Y., Indyk, P., Razenshteyn, I., Wagner, T.: Learning space partitions for nearest neighbor search. In: ICLR (2020)

  13. Dudziński, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Oper. Res. 28(1), 3–21 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  14. Fu, C., Wang, C., Cai, D.: Satellite system graph: towards the efficiency up-boundary of graph-based approximate nearest neighbor search. CoRR, arXiv:1907.06146 (2019)

  15. Fu, C., Xiang, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with the navigating spreading-out graph. VLDB 12, 461–474 (2019)

    Google Scholar 

  16. Gardner, M.W., Dorling, S.: Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences. Atmos. Environ. 32(14–15), 2627–2636 (1998)

    Article  Google Scholar 

  17. Ge, T., He, K., Ke, Q., Sun, J.: Optimized product quantization for approximate nearest neighbor search. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2946–2953 (2013)

  18. Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. VLDB 99, 518–529 (1999)

    Google Scholar 

  19. Gray, R.: Vector quantization. IEEE Assp Mag. 1, 4–29 (1984)

    Article  Google Scholar 

  20. Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: \(k\)NN model-based approach in classification. In: OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, pp. 986–996 (2003)

  21. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: ACM (1984)

  22. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a \(k\)-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28, 100–108 (1979)

    MATH  Google Scholar 

  23. Hilbe, J.M.: Logistic Regression Models. Chapman and Hall/CRC, Boca Raton (2009)

    Book  MATH  Google Scholar 

  24. Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. VLDB 9, 1–12 (2015)

    Google Scholar 

  25. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp. 604–613 (1998)

  26. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2010)

    Article  Google Scholar 

  27. Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 117–128 (2011)

    Article  Google Scholar 

  28. Jégou, H., Tavenard, R., Douze, M., Amsaleg, L.: Searching in one billion vectors: re-rank with source coding. In: ICASSP, pp. 861–864 (2011)

  29. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data 7(3), 535–547 (2019)

    Article  Google Scholar 

  30. Kalantidis, Y., Avrithis, Y.: Locally optimized product quantization for approximate nearest neighbor search. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2321–2328 (2014)

  31. Kellerer, H., Pferschy, U., Pisinger, D.: The multiple-choice knapsack problem. In: Knapsack Problems, pp. 317–347. Springer, Berlin, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24777-7_11

  32. Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: SIGMOD, pp. 489–504 (2018)

  33. Li, C., Zhang, M., Andersen, D.G., He, Y.: Improving approximate nearest neighbor search through learned adaptive early termination. In: SIGMOD, pp. 2539–2554 (2020)

  34. Li, P., Lu, H., Zheng, Q., Yang, L., Pan, G.: Lisa: a learned index structure for spatial data. In: SIGMOD, pp. 2119–2133 (2020)

  35. Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. IEEE Trans. Knowl. Data Eng. 32, 1475–1488 (2019)

    Article  Google Scholar 

  36. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 950–961. Association for Computing Machinery, Inc. (2007)

  37. Malkov, Y., Ponomarenko, A., Logvinov, A., Krylov, V.: Approximate nearest neighbor algorithm based on navigable small world graphs. Inf. Syst. 45, 61–68 (2014)

    Article  Google Scholar 

  38. Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2018)

    Article  Google Scholar 

  39. Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2227–2240 (2014)

    Article  Google Scholar 

  40. Omohundro, S.M.: Five balltree construction algorithms. In: International Computer Science Institute Berkeley (1989)

  41. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

  42. Qin, J., Wang, W., Xiao, C., Zhang, Y.: Similarity query processing for high-dimensional data. VLDB 13(12), 3437–3440 (2020)

    Google Scholar 

  43. Robinson, J.T.: The KDB-tree: a search structure for large multidimensional dynamic indexes. In: SIGMOD, pp. 10–18 (1981)

  44. Rui, Y., Huang, T.S., Chang, S.-F.: Image retrieval: current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 10, 39–62 (1999)

    Article  Google Scholar 

  45. Shamos, M.I., Hoey, D.: Closest-point problems. In: Symposium on Foundations of Computer Science, pp. 151–162 (1975)

  46. Suchal, J., Návrat, P.: Full text search engine as scalable \(k\)-nearest neighbor recommendation system. In: International Conference on Artificial Intelligence in Theory and Practice, pp. 165–173 (2010)

  47. Sun, J., Li, G., Tang, N.: Learned cardinality estimation for similarity queries. In: SIGMOD, pp. 1745–1757 (2021)

  48. Sun, Y., Wang, W., Qin, J., Zhang, Y., Lin, X.: SRS: solving c-approximate nearest neighbor queries in high dimensional Euclidean space with a tiny index. VLDB 8, 1–12 (2014)

    Google Scholar 

  49. Sundaram, N., Turmukhametova, A., Satish, N., Mostak, T., Indyk, P., Madden, S., Dubey, P.: Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. VLDB 6, 1930–1941 (2013)

    Google Scholar 

  50. Wang, H., Fu, X., Xu, J., Lu, H.: Learned index for spatial queries. In: MDM, pp. 569–574. IEEE (2019)

  51. Wang, Y., Xiao, C., Qin, J., Cao, X., Sun, Y., Wang, W., Onizuka, M.: Monotonic cardinality estimation of similarity selection: a deep learning approach. In: SIGMOD, pp. 1197–1212 (2020)

  52. Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. VLDB 98, 194–205 (1998)

    Google Scholar 

  53. Wei, C., Wu, B., Wang, S., Lou, R., Zhan, C., Li, F., Cai, Y.: Analyticdb-v: a hybrid analytical engine towards query fusion for structured and unstructured data. VLDB 13, 3152–3165 (2020)

    Google Scholar 

  54. Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD, pp. 1071–1085 (2016)

  55. Yang, W., Li, T., Fang, G., Wei, H.: Pase: postgresql ultra-high-dimensional approximate nearest neighbor search extension. In: SIGMOD, pp. 2241–2253 (2020)

  56. Yao, B., Li, F., Kumar, P.: \(K\) nearest neighbor queries and \(k\)NN-joins in large relational databases (almost) for free. In: IEEE International Conference on Data Engineering (ICDE), pp. 4–15 (2010)

  57. Yu, J., Wu, J., Sarwat, M.: Geospark: a cluster computing framework for processing large-scale spatial data. In: ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 1–4 (2015)

  58. Zhang, S., Li, X., Zong, M., Zhu, X., Cheng, D.: Learning \(k\) for \(k\)NN classification. ACM Trans. Intell. Syst. Technol. (TIST) 8, 1–19 (2017)

    Google Scholar 

  59. Zhang, W., Li, D., Xu, Y., Zhang, Y.: Shuffle-efficient distributed locality sensitive hashing on spark. In: IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 766–767 (2016)

Download references

Acknowledgements

This work was supported by the NSFC (61922054, 61872235, 61832017, 61729202, 61832013), the National Key Research and Development Program of China (2020YFB1710200), the Science and Technology Commission of Shanghai Municipality (STCSM) AI under Project 19511120300 and Hangzhou Qianjiang Distinguished Expert Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Yao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, P., Yao, B., Gao, C. et al. Learning-based query optimization for multi-probe approximate nearest neighbor search. The VLDB Journal 32, 623–645 (2023). https://doi.org/10.1007/s00778-022-00762-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00762-0

Keywords

Navigation