Abstract
Approximate nearest neighbor search (ANNS) is a fundamental problem that has attracted widespread attention for decades. Multi-probe ANNS is one of the most important classes of ANNS methods, playing crucial roles in disk-based, GPU-based, and distributed scenarios. The state-of-the-art multi-probe ANNS approaches typically perform in a fixed-configuration manner. For example, each query is dispatched to a fixed number of partitions to run ANNS algorithms locally, and the results will be merged to obtain the final result set. Our observation shows that such fixed configurations typically lead to a non-optimal accuracy–efficiency trade-off. To further optimize multi-probe ANNS, we propose to generate efficient configurations for each query individually. By formalizing the per-query optimization as a 0–1 knapsack problem and its variants, we identify that the kNN distribution (the proportion of k nearest neighbors of a query placed in each partition) is essential to the optimization. Then we develop LEQAT (LEarned Query-Aware OpTimizer), which leverages kNN distribution to seek optimal configurations for each query. LEQAT comes with (i) a machine learning model to learn and estimate kNN distributions based on historical or sample queries and (ii) efficient query optimization algorithms to determine the partitions to probe and the number of searching neighbors in each partition. We apply LEQAT to three state-of-the-art ANNS methods IVF, HNSW, and SSG under clustering-based partitioning, evaluating the overall performance on several real-world datasets. The results show that LEQAT consistently reduces the latency by up to 58% and improves the throughput by up to 3.9 times.
Similar content being viewed by others
References
Adeniyi, D.A., Wei, Z., Yongquan, Y.: Automated web usage data mining and recommendation system using \(k\)-nearest neighbor (\(k\)NN) classification method. Appl. Comput. Inform. 12, 90–108 (2016)
Anagnostopoulos, C., Triantafillou, P.: Learning set cardinality in distance nearest neighbours. In: ICDM, pp. 691–696. IEEE (2015)
Andoni, A., Indyk P., Laarhoven T., Razenshteyn I., Schmidt, L.: Practical and optimal LSH for angular distance. arXiv preprint arXiv:1509.02897 (2015)
Babenko, A., Lempitsky, V.: The inverted multi-index. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1247–1260 (2014)
Babenko, A., Lempitsky, V.: Efficient indexing of billion-scale datasets of deep descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2055–2063 (2016)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975)
Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: \(k\)NN based machine learning approach for text and document mining. Int. J. Database Theory Appl. 7, 61–70 (2014)
Chauvin, Y., Rumelhart, D.E.: Backpropagation: Theory, Architectures, and Applications. Psychology Press, London (1995)
Chen, L., Özsu, M.T., Oria, V.: Robust and fast similarity search for moving object trajectories. In: SIGMOD, pp. 491–502 (2005)
Chen, Q., Wang, H., Li, M., Ren, G., Li, S., Zhu, J., Li, J., Liu, C., Zhang, L., Wang, J.: SPTAG: a library for fast approximate nearest neighbor search (2018)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on Computational Geometry, pp. 253–262 (2004)
Dong, Y., Indyk, P., Razenshteyn, I., Wagner, T.: Learning space partitions for nearest neighbor search. In: ICLR (2020)
Dudziński, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Oper. Res. 28(1), 3–21 (1987)
Fu, C., Wang, C., Cai, D.: Satellite system graph: towards the efficiency up-boundary of graph-based approximate nearest neighbor search. CoRR, arXiv:1907.06146 (2019)
Fu, C., Xiang, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with the navigating spreading-out graph. VLDB 12, 461–474 (2019)
Gardner, M.W., Dorling, S.: Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences. Atmos. Environ. 32(14–15), 2627–2636 (1998)
Ge, T., He, K., Ke, Q., Sun, J.: Optimized product quantization for approximate nearest neighbor search. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2946–2953 (2013)
Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. VLDB 99, 518–529 (1999)
Gray, R.: Vector quantization. IEEE Assp Mag. 1, 4–29 (1984)
Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: \(k\)NN model-based approach in classification. In: OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, pp. 986–996 (2003)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: ACM (1984)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a \(k\)-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28, 100–108 (1979)
Hilbe, J.M.: Logistic Regression Models. Chapman and Hall/CRC, Boca Raton (2009)
Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. VLDB 9, 1–12 (2015)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2010)
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 117–128 (2011)
Jégou, H., Tavenard, R., Douze, M., Amsaleg, L.: Searching in one billion vectors: re-rank with source coding. In: ICASSP, pp. 861–864 (2011)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data 7(3), 535–547 (2019)
Kalantidis, Y., Avrithis, Y.: Locally optimized product quantization for approximate nearest neighbor search. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2321–2328 (2014)
Kellerer, H., Pferschy, U., Pisinger, D.: The multiple-choice knapsack problem. In: Knapsack Problems, pp. 317–347. Springer, Berlin, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24777-7_11
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: SIGMOD, pp. 489–504 (2018)
Li, C., Zhang, M., Andersen, D.G., He, Y.: Improving approximate nearest neighbor search through learned adaptive early termination. In: SIGMOD, pp. 2539–2554 (2020)
Li, P., Lu, H., Zheng, Q., Yang, L., Pan, G.: Lisa: a learned index structure for spatial data. In: SIGMOD, pp. 2119–2133 (2020)
Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. IEEE Trans. Knowl. Data Eng. 32, 1475–1488 (2019)
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 950–961. Association for Computing Machinery, Inc. (2007)
Malkov, Y., Ponomarenko, A., Logvinov, A., Krylov, V.: Approximate nearest neighbor algorithm based on navigable small world graphs. Inf. Syst. 45, 61–68 (2014)
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2018)
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2227–2240 (2014)
Omohundro, S.M.: Five balltree construction algorithms. In: International Computer Science Institute Berkeley (1989)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Qin, J., Wang, W., Xiao, C., Zhang, Y.: Similarity query processing for high-dimensional data. VLDB 13(12), 3437–3440 (2020)
Robinson, J.T.: The KDB-tree: a search structure for large multidimensional dynamic indexes. In: SIGMOD, pp. 10–18 (1981)
Rui, Y., Huang, T.S., Chang, S.-F.: Image retrieval: current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 10, 39–62 (1999)
Shamos, M.I., Hoey, D.: Closest-point problems. In: Symposium on Foundations of Computer Science, pp. 151–162 (1975)
Suchal, J., Návrat, P.: Full text search engine as scalable \(k\)-nearest neighbor recommendation system. In: International Conference on Artificial Intelligence in Theory and Practice, pp. 165–173 (2010)
Sun, J., Li, G., Tang, N.: Learned cardinality estimation for similarity queries. In: SIGMOD, pp. 1745–1757 (2021)
Sun, Y., Wang, W., Qin, J., Zhang, Y., Lin, X.: SRS: solving c-approximate nearest neighbor queries in high dimensional Euclidean space with a tiny index. VLDB 8, 1–12 (2014)
Sundaram, N., Turmukhametova, A., Satish, N., Mostak, T., Indyk, P., Madden, S., Dubey, P.: Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. VLDB 6, 1930–1941 (2013)
Wang, H., Fu, X., Xu, J., Lu, H.: Learned index for spatial queries. In: MDM, pp. 569–574. IEEE (2019)
Wang, Y., Xiao, C., Qin, J., Cao, X., Sun, Y., Wang, W., Onizuka, M.: Monotonic cardinality estimation of similarity selection: a deep learning approach. In: SIGMOD, pp. 1197–1212 (2020)
Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. VLDB 98, 194–205 (1998)
Wei, C., Wu, B., Wang, S., Lou, R., Zhan, C., Li, F., Cai, Y.: Analyticdb-v: a hybrid analytical engine towards query fusion for structured and unstructured data. VLDB 13, 3152–3165 (2020)
Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD, pp. 1071–1085 (2016)
Yang, W., Li, T., Fang, G., Wei, H.: Pase: postgresql ultra-high-dimensional approximate nearest neighbor search extension. In: SIGMOD, pp. 2241–2253 (2020)
Yao, B., Li, F., Kumar, P.: \(K\) nearest neighbor queries and \(k\)NN-joins in large relational databases (almost) for free. In: IEEE International Conference on Data Engineering (ICDE), pp. 4–15 (2010)
Yu, J., Wu, J., Sarwat, M.: Geospark: a cluster computing framework for processing large-scale spatial data. In: ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 1–4 (2015)
Zhang, S., Li, X., Zong, M., Zhu, X., Cheng, D.: Learning \(k\) for \(k\)NN classification. ACM Trans. Intell. Syst. Technol. (TIST) 8, 1–19 (2017)
Zhang, W., Li, D., Xu, Y., Zhang, Y.: Shuffle-efficient distributed locality sensitive hashing on spark. In: IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 766–767 (2016)
Acknowledgements
This work was supported by the NSFC (61922054, 61872235, 61832017, 61729202, 61832013), the National Key Research and Development Program of China (2020YFB1710200), the Science and Technology Commission of Shanghai Municipality (STCSM) AI under Project 19511120300 and Hangzhou Qianjiang Distinguished Expert Program.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, P., Yao, B., Gao, C. et al. Learning-based query optimization for multi-probe approximate nearest neighbor search. The VLDB Journal 32, 623–645 (2023). https://doi.org/10.1007/s00778-022-00762-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-022-00762-0