Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

Li, Yuliang; Wang, Jianguo; Pullman, Benjamin; Bandeira, Nuno; Papakonstantinou, Yannis

doi:10.1007/s00224-020-10009-6

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

Published: 26 October 2020

Volume 65, pages 42–83, (2021)
Cite this article

Theory of Computing Systems Aims and scope Submit manuscript

Yuliang Li ORCID: orcid.org/0000-0002-0602-149X¹,
Jianguo Wang²,
Benjamin Pullman³,
Nuno Bandeira³ &
…
Yannis Papakonstantinou³

268 Accesses
2 Citations
Explore all metrics

Abstract

Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold θ. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The paper considers the efficient evaluation of such queries, as well as of the closely related top-k cosine similarity queries. It provides novel optimality guarantees that exhibit good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for θ-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that multiple real-world data sets from mass spectrometry, natural language process, and computer vision exhibit a certain form of data skewness and we exploit this property to obtain better traversal strategies. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 5

Fig. 6

Fig. 7

A Multi-way Divergence Metric for Vector Spaces

Distance-Based Index Structures for Fast Similarity Search

Article 27 July 2017

D. A. Rachkovskij

Efficient Compressed Indexing for Approximate Top-k String Retrieval

Notes

There is no need to include pairs with zero values in the list.
This optimization is equally applicable to the TA’s problem: Scan only the lists that correspond to dimensions that actually affect the function F.
Notice, the unit vector constraint enables inference about the collective weight of the unseen coordinates of a vector.
The inner product threshold problem is the special case where f_i(s[i]) = q_i ⋅s[i].
[d] is the set \(\{1, \dots , d\}\)
Recall that L_i[0] = 1.
We denote by f_i(L_i) the list \([f_{i}(L_{i}[0]), f_{i}(L_{i}[1]), \dots ]\) for every L_i.
where each f_i of the decomposable function F is the identity function
https://proteomics2.ucsd.edu/ProteoSAFe/index.jsp
The linear scan is feasible for join because the naive approach of join is quadratic.

References

Aebersold, R., Mann, M.: Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016)
Article Google Scholar
Ahle, T.D., Pagh, R., Razenshteyn, I., Silvestri, F.: On the complexity of inner product similarity join. In: PODS, pp 151–164 (2016)
Akbarinia, R., Pacitti, E., Valduriez, P.: Best position algorithms for top-k queries. In: VLDB, pp 495–506 (2007)
Anastasiu, D.C., Karypis, G.: L2AP: Fast cosine similarity search with prefix L-2 norm bounds. In: ICDE, pp 784–795 (2014)
Anastasiu, D.C., Karypis, G.: PL2AP: Fast parallel cosine similarity search. In: IA3, pp 8:1–8:8 (2015)
Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal lsh for angular distance. In: NIPS, pp 1225–1233 (2015)
André, F., Kermarrec, A.-M., Scouarnec, N.L.: Cache locality is not enough: High-performance nearest neighbor search with product quantization fast scan. PVLDB 9(4), 288–299 (2015)
Google Scholar
Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: HD-Index: Pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. PVLDB 11(8), 906–919 (2018)
Google Scholar
Bast, H., Majumdar, D., Schenkel, R., Theobald, M., Weikum, G.: Io-top-k: Index-access optimized top-k query processing. In: VLDB, pp 475–486 (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp 131–140 (2007)
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML, pp 97–104 (2006)
Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. CSUR 33(3), 322–373 (2001)
Article Google Scholar
Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: CIKM, pp 426–434 (2003)
Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: ICDE, pp 369–380 (2002)
Chakrabarti, K., Chaudhuri, S., Ganti, V.: Interval-based pruning for top-k processing over compressed lists. In: ICDE, pp 709–720 (2011)
Chen, L., Gao, Y., Zheng, B., Jensen, C.S., Yang, H., Yang, K.: Pivot-based metric indexing. PVLDB 10(10), 1058–1069 (2017)
Google Scholar
Craig, R., Cortens, J.C, Fenyo, D., Beavis, R.C.: Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5 (8), 1843–1849 (2006)
Article Google Scholar
Cui, B., Zhao, J., Cong, G.: ISIS: A new approach for efficient similarity search in sparse databases. In: DASFAA, pp 231–245 (2010)
Curtin, R.R., Gray, A.G., Ram, P.: Fast exact max-kernel search. In: SDM, pp 1–9 (2013)
Dasari, S., Chambers, M.C., Martinez, M.A., Carpenter, K.L., Ham, A.-J.L., Vega-Montoto, L.J., Tabb, D.L.: Pepitome: Evaluating improved spectral library search for identification complementarity and quality assessment. J. Proteome Res. 11(3), 1686–95 (2012)
Article Google Scholar
De Berg, M., Cheong, O., Van Kreveld, M., Overmars, M.: Computational Geometry: Introduction. Springer, Berlin (2008)
Book Google Scholar
Deshpande, P.M., Deepak, P., Kummamuru, K.: Efficient online top-k retrieval with arbitrary similarity measures. In: EDBT, pp 356–367 (2008)
Ding, S., Suel, T.: Faster top-k document retrieval using block-max indexes. In: SIGIR, pp 993–1002 (2011)
Doc2Vec. https://radimrehurek.com/gensim/models/doc2vec.html
Dong, W., Moses, C., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW, pp 577–586 (2011)
Dutta, D., Chen, T.: Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search. Bioinformatics 23(5), 612–618 (2007)
Article Google Scholar
Eghbali, S., Tahvildari, L.: Cosine similarity search with multi index hashing. arXiv:1610.00574
Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)
Article Google Scholar
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp 102–113 (2001)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. JCSS 66(4), 614–656 (2003)
MathSciNet MATH Google Scholar
Fraccaro, M., Paquet, U., Winther, O.: Indexable probabilistic matrix factorization for maximum inner product search. In: AAAI, pp 1554–1560 (2016)
Fu, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with navigating spreading-out graphs. arXiv:1707.00143 (2017)
Graham, R.L.: An efficient algorithm for determining the convex hull of a finite planar set. Info. Pro. Lett. 1, 132–133 (1972)
Article Google Scholar
Güntzer, U., Balke, W.-T., Kiebling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp 419–428 (2000)
Houle, M.E., Nett, M.: Rank-based similarity search: Reducing the dimensional dependence. PAMI 37(1), 136–150 (2015)
Article Google Scholar
Hristidis, V., Koudas, N., Papakonstantinou, Y: PREFER: A system for the efficient execution of multi-parametric ranked queries. In: SIGMOD, pp 259–270 (2001)
Hu, X., Tao, Y., Yi, K.: Output-optimal parallel algorithms for similarity joins. In: PODS, pp 79–90 (2017)
Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. CSUR 40(4), 1–58 (2008)
Article Google Scholar
Img2Vec. https://github.com/christiansafka/img2vec
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ICDT, pp 604–613 (1998)
Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI 33(1), 117–128 (2011)
Article Google Scholar
Jin, W., Patel, J.M.: Efficient and generic evaluation of ranked queries. In: SIGMOD, pp 601–612 (2011)
Johnson, W.B., Lindenstrauss, J., Schechtman, G.: Extensions of lipschitz maps into banach spaces. Israel J. Math. 54(2), 129–138 (1986)
Article MathSciNet Google Scholar
Keivani, O., Sinha, K., Ram, P.: Improved maximum inner product search with better theoretical guarantees. In: IJCNN, pp 2927–2934 (2017)
Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D., Nesvizhskii, A.I.: Msfragger: Ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017)
Article Google Scholar
Kuhn, H.W., Tucker, A.W.: Nonlinear programming. In: Traces and Emergence of Nonlinear Programming, pp 247–258 (2014)
Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: ICCV, pp 2130–2137 (2009)
Lam, H., Deutsch, E.W., Eddes, J.S., Eng, J.K., King, N., Stein, S.E., Aebersold, R.: Development and validation of a spectral library searching method for peptide identification from ms/ms. Proteomics 7(5) (2007)
Learned-Miller, E., Huang, G.B., RoyChowdhury, A., Li, H., Hua, G.: Labeled faces in the wild: A survey. In: Advances in face detection and facial image analysis, pp 189–248. Springer (2016)
Lee, J., Cho, H., Hwang, S.-W.: Efficient dual-resolution layer indexing for top-k queries. In: ICDE, pp 1084–1095 (2012)
Lempitsky, V.: The inverted multi-index. In: CVPR, pp 3069–3076 (2012)
Li, C., Chang, E., Garcia-Molina, H., Wiederhold, G.: Clustering for approximate similarity search in high-dimensional spaces. TKDE 14(4), 792–808 (2002)
Google Scholar
Li, H, Chan, T.N., Yiu, M.L., Mamoulis, N.: Fexipro: Fast and exact inner product retrieval in recommender systems. In: SIGMOD, pp 835–850 (2017)
Li, W., Deng, L., Li, Y., Li, C.: Zigzag: Supporting similarity queries on vector space models. In: SIGMOD (2018)
Li, Y., Wang, J., Pullman, B., Bandeira, N., Papakonstantinou, Y.: Index-Based High-Dimensional, Cosine Threshold Querying with Optimality Guarantees. In: ICDT, vol. 127, pp 11:1–11:20 (2019)
Lian, X., Chen, L.: General cost models for evaluating dimensionality reduction in high-dimensional spaces. TKDE 21(10), 1447–1460 (2009)
Google Scholar
Liaw, Y.-C., Leou, M.-L., Wu, C.-M.: Fast exact k nearest neighbors search using an orthogonal search tree. PR 43(6), 2351–2358 (2010)
MATH Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. PAMI 36(11), 2227–2240 (2014)
Article Google Scholar
Mussmann, S., Ermon, S.: Learning and inference via maximum inner product search. In: ICML, pp 2587–2596 (2016)
Qin, J., Wang, Y., Xiao, C., Wang, W., Lin, X., Ishikawa, Y.: GPH: Similarity search in hamming space. In: ICDE (2018)
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Ram, P., Gray, A.G.: Maximum inner-product search using cone trees. In: SIGKDD, pp 931–939 (2012)
Ramaswamy, S., Rose, K.: Adaptive cluster distance bounding for high-dimensional indexing. TKDE 23(6), 815–830 (2011)
Google Scholar
Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco (2005)
MATH Google Scholar
Savage, J.E.: Models of computation–exploring the power of computing (1998)
Shen, F., Liu, W., Zhang, S., Yang, Y., Shen, H.T.: Learning binary codes for maximum inner product search. In: ICCV, pp 4148–4156 (2015)
Silpa-Anan, C., Hartley, R.: Optimised kd-trees for fast image descriptor matching. In: CVPR, pp 1–8 (2008)
Tang, W.H., Halpern, B.R., Shilov, I.V., Seymour, S.L., Keating, S.P., Loboda, A., Patel, A.A., Schaeffer, D.A., Nuwaysir, L.M.: Discovering known and unanticipated protein modifications using ms/ms database searching. Anal. Chem. 77(13), 3931–3946 (2005)
Article Google Scholar
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp 563–576 (2009)
Teflioudi, C., Gemulla, R.: Exact and approximate maximum inner product search with lemp. TODS 42(1), 5:1–5:49 (2016)
MathSciNet Google Scholar
Teflioudi, C., Gemulla, R., Mykytiuk, O.: Lemp: Fast retrieval of large entries in a matrix product. In: SIGMOD, pp 107–122 (2015)
The Booking.com Dataset. https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe
Wang, J., Pérez-Santiago, J., Katz, J.E., Mallick, P., Bandeira, N.: Peptide identification from mixture tandem mass spectra. MCP 9(7), 1476–1485 (2010)
Google Scholar
Wang, J.: Query Processing of Sorted Lists on Modern Hardware. University of California, San Diego (2019)
Google Scholar
Wang, J., Lin, C., He, R., Chae, M., Papakonstantinou, Y., Swanson, S.: MILC: Inverted list compression in memory. PVLDB 10(8), 853–864 (2017)
Google Scholar
Wang, M., Bandeira, N.: Spectral library generating function for assessing spectrum-spectrum match significance. J. Proteome Res. 12 (9), 3944–3951 (2013)
Article Google Scholar
Wang, M., Carver, J.J., Bandeira, N.: Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34(8), 828–837 (2016)
Article Google Scholar
Wang, Y., Shrivastava, A., Wang, J., Ryu, J.: Randomized algorithms accelerated over cpu-gpu for ultra-high dimensional similarity search. In: SIGMOD (2018)
Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp 194–205 (1998)
Wu, Y., Jin, R., Zhang, X.: Fast and unified local search for random walk based k-nearest-neighbor query in large graphs. In: SIGMOD, pp 1139–1150 (2014)
Xin, D., Han, J., Chang, K.C.: Progressive and selective merge: Computing top-k with ad-hoc ranking functions. In: SIGMOD, pp 103–114 (2007)
Yates, J.R., Morgan, S.F., Gatlin, C.L., Griffin, P.R., Eng, J.K.: Method to compare collision-induced dissociation spectra of peptides: Potential for library searching and subtractive analysis. Anal. Chem. 70(17), 3557–3565 (1998)
Article Google Scholar
Yu, A., Agarwal, P.K., Yang, J.: Top-k preferences in high dimensions. In: ICDE, pp 748–759 (2014)
Zhang, S., Sun, C., He, Z.: Listmerge: Accelerating top-k aggregation queries over large number of lists. In: DASFAA, pp 67–81 (2016)
Zhang, Z., Hwang, S.-W., Chang, K.C.-C., Wang, M., Lang, C.A., Chang, Y.-C.: Boolean + ranking: Querying a database by k-constrained optimization. In: SIGMOD, pp 359–370 (2006)

Download references

Acknowledgments

We thank the anonymous reviewers of ICDT and ToCS for the very constructive and helpful comments. We are also grateful to Victor Vianu who helped us improve the presentation of the paper. This work was supported in part by the National Science Foundation (NSF) under awards BIGDATA 1447943 and ABI 1759980, and by the National Institutes of Health (NIH) under awards P41GM103484 and R24GM127667.

Author information

Authors and Affiliations

Megagon Labs, 444 Castro Street Suite 720, Mountain View, CA, 94041, USA
Yuliang Li
Department of Computer Science, Purdue University, 610 Purdue Mall, West Lafayette, IN, 47907, USA
Jianguo Wang
UC San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA
Benjamin Pullman, Nuno Bandeira & Yannis Papakonstantinou

Authors

Yuliang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Pullman
View author publications
You can also search for this author in PubMed Google Scholar
Nuno Bandeira
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Papakonstantinou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuliang Li.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Database Theory (ICDT 2019)

Guest Editor: Pablo Baceló

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Wang, J., Pullman, B. et al. Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees. Theory Comput Syst 65, 42–83 (2021). https://doi.org/10.1007/s00224-020-10009-6

Download citation

Accepted: 19 September 2020
Published: 26 October 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s00224-020-10009-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

Abstract

Access this article

Similar content being viewed by others

A Multi-way Divergence Metric for Vector Spaces

Distance-Based Index Structures for Fast Similarity Search

Efficient Compressed Indexing for Approximate Top-k String Retrieval

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

Abstract

Access this article

Similar content being viewed by others

A Multi-way Divergence Metric for Vector Spaces

Distance-Based Index Structures for Fast Similarity Search

Efficient Compressed Indexing for Approximate Top-k String Retrieval

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation