Skip to main content
Log in

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold θ. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The paper considers the efficient evaluation of such queries, as well as of the closely related top-k cosine similarity queries. It provides novel optimality guarantees that exhibit good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for θ-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that multiple real-world data sets from mass spectrometry, natural language process, and computer vision exhibit a certain form of data skewness and we exploit this property to obtain better traversal strategies. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. There is no need to include pairs with zero values in the list.

  2. This optimization is equally applicable to the TA’s problem: Scan only the lists that correspond to dimensions that actually affect the function F.

  3. Notice, the unit vector constraint enables inference about the collective weight of the unseen coordinates of a vector.

  4. The inner product threshold problem is the special case where fi(s[i]) = qis[i].

  5. [d] is the set \(\{1, \dots , d\}\)

  6. Recall that Li[0] = 1.

  7. We denote by fi(Li) the list \([f_{i}(L_{i}[0]), f_{i}(L_{i}[1]), \dots ]\) for every Li.

  8. where each fi of the decomposable function F is the identity function

  9. https://proteomics2.ucsd.edu/ProteoSAFe/index.jsp

  10. The linear scan is feasible for join because the naive approach of join is quadratic.

References

  1. Aebersold, R., Mann, M.: Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016)

    Article  Google Scholar 

  2. Ahle, T.D., Pagh, R., Razenshteyn, I., Silvestri, F.: On the complexity of inner product similarity join. In: PODS, pp 151–164 (2016)

  3. Akbarinia, R., Pacitti, E., Valduriez, P.: Best position algorithms for top-k queries. In: VLDB, pp 495–506 (2007)

  4. Anastasiu, D.C., Karypis, G.: L2AP: Fast cosine similarity search with prefix L-2 norm bounds. In: ICDE, pp 784–795 (2014)

  5. Anastasiu, D.C., Karypis, G.: PL2AP: Fast parallel cosine similarity search. In: IA3, pp 8:1–8:8 (2015)

  6. Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal lsh for angular distance. In: NIPS, pp 1225–1233 (2015)

  7. André, F., Kermarrec, A.-M., Scouarnec, N.L.: Cache locality is not enough: High-performance nearest neighbor search with product quantization fast scan. PVLDB 9(4), 288–299 (2015)

    Google Scholar 

  8. Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: HD-Index: Pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. PVLDB 11(8), 906–919 (2018)

    Google Scholar 

  9. Bast, H., Majumdar, D., Schenkel, R., Theobald, M., Weikum, G.: Io-top-k: Index-access optimized top-k query processing. In: VLDB, pp 475–486 (2006)

  10. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp 131–140 (2007)

  11. Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML, pp 97–104 (2006)

  12. Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. CSUR 33(3), 322–373 (2001)

    Article  Google Scholar 

  13. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  14. Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: CIKM, pp 426–434 (2003)

  15. Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: ICDE, pp 369–380 (2002)

  16. Chakrabarti, K., Chaudhuri, S., Ganti, V.: Interval-based pruning for top-k processing over compressed lists. In: ICDE, pp 709–720 (2011)

  17. Chen, L., Gao, Y., Zheng, B., Jensen, C.S., Yang, H., Yang, K.: Pivot-based metric indexing. PVLDB 10(10), 1058–1069 (2017)

    Google Scholar 

  18. Craig, R., Cortens, J.C, Fenyo, D., Beavis, R.C.: Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5 (8), 1843–1849 (2006)

    Article  Google Scholar 

  19. Cui, B., Zhao, J., Cong, G.: ISIS: A new approach for efficient similarity search in sparse databases. In: DASFAA, pp 231–245 (2010)

  20. Curtin, R.R., Gray, A.G., Ram, P.: Fast exact max-kernel search. In: SDM, pp 1–9 (2013)

  21. Dasari, S., Chambers, M.C., Martinez, M.A., Carpenter, K.L., Ham, A.-J.L., Vega-Montoto, L.J., Tabb, D.L.: Pepitome: Evaluating improved spectral library search for identification complementarity and quality assessment. J. Proteome Res. 11(3), 1686–95 (2012)

    Article  Google Scholar 

  22. De Berg, M., Cheong, O., Van Kreveld, M., Overmars, M.: Computational Geometry: Introduction. Springer, Berlin (2008)

    Book  Google Scholar 

  23. Deshpande, P.M., Deepak, P., Kummamuru, K.: Efficient online top-k retrieval with arbitrary similarity measures. In: EDBT, pp 356–367 (2008)

  24. Ding, S., Suel, T.: Faster top-k document retrieval using block-max indexes. In: SIGIR, pp 993–1002 (2011)

  25. Doc2Vec. https://radimrehurek.com/gensim/models/doc2vec.html

  26. Dong, W., Moses, C., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW, pp 577–586 (2011)

  27. Dutta, D., Chen, T.: Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search. Bioinformatics 23(5), 612–618 (2007)

    Article  Google Scholar 

  28. Eghbali, S., Tahvildari, L.: Cosine similarity search with multi index hashing. arXiv:1610.00574

  29. Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)

    Article  Google Scholar 

  30. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp 102–113 (2001)

  31. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. JCSS 66(4), 614–656 (2003)

    MathSciNet  MATH  Google Scholar 

  32. Fraccaro, M., Paquet, U., Winther, O.: Indexable probabilistic matrix factorization for maximum inner product search. In: AAAI, pp 1554–1560 (2016)

  33. Fu, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with navigating spreading-out graphs. arXiv:1707.00143 (2017)

  34. Graham, R.L.: An efficient algorithm for determining the convex hull of a finite planar set. Info. Pro. Lett. 1, 132–133 (1972)

    Article  Google Scholar 

  35. Güntzer, U., Balke, W.-T., Kiebling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp 419–428 (2000)

  36. Houle, M.E., Nett, M.: Rank-based similarity search: Reducing the dimensional dependence. PAMI 37(1), 136–150 (2015)

    Article  Google Scholar 

  37. Hristidis, V., Koudas, N., Papakonstantinou, Y: PREFER: A system for the efficient execution of multi-parametric ranked queries. In: SIGMOD, pp 259–270 (2001)

  38. Hu, X., Tao, Y., Yi, K.: Output-optimal parallel algorithms for similarity joins. In: PODS, pp 79–90 (2017)

  39. Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. CSUR 40(4), 1–58 (2008)

    Article  Google Scholar 

  40. Img2Vec. https://github.com/christiansafka/img2vec

  41. Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ICDT, pp 604–613 (1998)

  42. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI 33(1), 117–128 (2011)

    Article  Google Scholar 

  43. Jin, W., Patel, J.M.: Efficient and generic evaluation of ranked queries. In: SIGMOD, pp 601–612 (2011)

  44. Johnson, W.B., Lindenstrauss, J., Schechtman, G.: Extensions of lipschitz maps into banach spaces. Israel J. Math. 54(2), 129–138 (1986)

    Article  MathSciNet  Google Scholar 

  45. Keivani, O., Sinha, K., Ram, P.: Improved maximum inner product search with better theoretical guarantees. In: IJCNN, pp 2927–2934 (2017)

  46. Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D., Nesvizhskii, A.I.: Msfragger: Ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017)

    Article  Google Scholar 

  47. Kuhn, H.W., Tucker, A.W.: Nonlinear programming. In: Traces and Emergence of Nonlinear Programming, pp 247–258 (2014)

  48. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: ICCV, pp 2130–2137 (2009)

  49. Lam, H., Deutsch, E.W., Eddes, J.S., Eng, J.K., King, N., Stein, S.E., Aebersold, R.: Development and validation of a spectral library searching method for peptide identification from ms/ms. Proteomics 7(5) (2007)

  50. Learned-Miller, E., Huang, G.B., RoyChowdhury, A., Li, H., Hua, G.: Labeled faces in the wild: A survey. In: Advances in face detection and facial image analysis, pp 189–248. Springer (2016)

  51. Lee, J., Cho, H., Hwang, S.-W.: Efficient dual-resolution layer indexing for top-k queries. In: ICDE, pp 1084–1095 (2012)

  52. Lempitsky, V.: The inverted multi-index. In: CVPR, pp 3069–3076 (2012)

  53. Li, C., Chang, E., Garcia-Molina, H., Wiederhold, G.: Clustering for approximate similarity search in high-dimensional spaces. TKDE 14(4), 792–808 (2002)

    Google Scholar 

  54. Li, H, Chan, T.N., Yiu, M.L., Mamoulis, N.: Fexipro: Fast and exact inner product retrieval in recommender systems. In: SIGMOD, pp 835–850 (2017)

  55. Li, W., Deng, L., Li, Y., Li, C.: Zigzag: Supporting similarity queries on vector space models. In: SIGMOD (2018)

  56. Li, Y., Wang, J., Pullman, B., Bandeira, N., Papakonstantinou, Y.: Index-Based High-Dimensional, Cosine Threshold Querying with Optimality Guarantees. In: ICDT, vol. 127, pp 11:1–11:20 (2019)

  57. Lian, X., Chen, L.: General cost models for evaluating dimensionality reduction in high-dimensional spaces. TKDE 21(10), 1447–1460 (2009)

    Google Scholar 

  58. Liaw, Y.-C., Leou, M.-L., Wu, C.-M.: Fast exact k nearest neighbors search using an orthogonal search tree. PR 43(6), 2351–2358 (2010)

    MATH  Google Scholar 

  59. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  Google Scholar 

  60. Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. PAMI 36(11), 2227–2240 (2014)

    Article  Google Scholar 

  61. Mussmann, S., Ermon, S.: Learning and inference via maximum inner product search. In: ICML, pp 2587–2596 (2016)

  62. Qin, J., Wang, Y., Xiao, C., Wang, W., Lin, X., Ishikawa, Y.: GPH: Similarity search in hamming space. In: ICDE (2018)

  63. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  64. Ram, P., Gray, A.G.: Maximum inner-product search using cone trees. In: SIGKDD, pp 931–939 (2012)

  65. Ramaswamy, S., Rose, K.: Adaptive cluster distance bounding for high-dimensional indexing. TKDE 23(6), 815–830 (2011)

    Google Scholar 

  66. Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco (2005)

    MATH  Google Scholar 

  67. Savage, J.E.: Models of computation–exploring the power of computing (1998)

  68. Shen, F., Liu, W., Zhang, S., Yang, Y., Shen, H.T.: Learning binary codes for maximum inner product search. In: ICCV, pp 4148–4156 (2015)

  69. Silpa-Anan, C., Hartley, R.: Optimised kd-trees for fast image descriptor matching. In: CVPR, pp 1–8 (2008)

  70. Tang, W.H., Halpern, B.R., Shilov, I.V., Seymour, S.L., Keating, S.P., Loboda, A., Patel, A.A., Schaeffer, D.A., Nuwaysir, L.M.: Discovering known and unanticipated protein modifications using ms/ms database searching. Anal. Chem. 77(13), 3931–3946 (2005)

    Article  Google Scholar 

  71. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp 563–576 (2009)

  72. Teflioudi, C., Gemulla, R.: Exact and approximate maximum inner product search with lemp. TODS 42(1), 5:1–5:49 (2016)

    MathSciNet  Google Scholar 

  73. Teflioudi, C., Gemulla, R., Mykytiuk, O.: Lemp: Fast retrieval of large entries in a matrix product. In: SIGMOD, pp 107–122 (2015)

  74. The Booking.com Dataset. https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

  75. Wang, J., Pérez-Santiago, J., Katz, J.E., Mallick, P., Bandeira, N.: Peptide identification from mixture tandem mass spectra. MCP 9(7), 1476–1485 (2010)

    Google Scholar 

  76. Wang, J.: Query Processing of Sorted Lists on Modern Hardware. University of California, San Diego (2019)

    Google Scholar 

  77. Wang, J., Lin, C., He, R., Chae, M., Papakonstantinou, Y., Swanson, S.: MILC: Inverted list compression in memory. PVLDB 10(8), 853–864 (2017)

    Google Scholar 

  78. Wang, M., Bandeira, N.: Spectral library generating function for assessing spectrum-spectrum match significance. J. Proteome Res. 12 (9), 3944–3951 (2013)

    Article  Google Scholar 

  79. Wang, M., Carver, J.J., Bandeira, N.: Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34(8), 828–837 (2016)

    Article  Google Scholar 

  80. Wang, Y., Shrivastava, A., Wang, J., Ryu, J.: Randomized algorithms accelerated over cpu-gpu for ultra-high dimensional similarity search. In: SIGMOD (2018)

  81. Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp 194–205 (1998)

  82. Wu, Y., Jin, R., Zhang, X.: Fast and unified local search for random walk based k-nearest-neighbor query in large graphs. In: SIGMOD, pp 1139–1150 (2014)

  83. Xin, D., Han, J., Chang, K.C.: Progressive and selective merge: Computing top-k with ad-hoc ranking functions. In: SIGMOD, pp 103–114 (2007)

  84. Yates, J.R., Morgan, S.F., Gatlin, C.L., Griffin, P.R., Eng, J.K.: Method to compare collision-induced dissociation spectra of peptides: Potential for library searching and subtractive analysis. Anal. Chem. 70(17), 3557–3565 (1998)

    Article  Google Scholar 

  85. Yu, A., Agarwal, P.K., Yang, J.: Top-k preferences in high dimensions. In: ICDE, pp 748–759 (2014)

  86. Zhang, S., Sun, C., He, Z.: Listmerge: Accelerating top-k aggregation queries over large number of lists. In: DASFAA, pp 67–81 (2016)

  87. Zhang, Z., Hwang, S.-W., Chang, K.C.-C., Wang, M., Lang, C.A., Chang, Y.-C.: Boolean + ranking: Querying a database by k-constrained optimization. In: SIGMOD, pp 359–370 (2006)

Download references

Acknowledgments

We thank the anonymous reviewers of ICDT and ToCS for the very constructive and helpful comments. We are also grateful to Victor Vianu who helped us improve the presentation of the paper. This work was supported in part by the National Science Foundation (NSF) under awards BIGDATA 1447943 and ABI 1759980, and by the National Institutes of Health (NIH) under awards P41GM103484 and R24GM127667.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuliang Li.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Database Theory (ICDT 2019)

Guest Editor: Pablo Baceló

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Wang, J., Pullman, B. et al. Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees. Theory Comput Syst 65, 42–83 (2021). https://doi.org/10.1007/s00224-020-10009-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-020-10009-6

Keywords

Navigation