Abstract
A Locality-Sensitive Hash (LSH) function is called (r, cr, \(p_1,p_2)\)-sensitive, if two data-points with a distance less than r collide with probability at least \(p_1\) while data points with a distance greater than cr collide with probability at most \(p_2\). These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a c-approximate nearest neighbour data structure with query time \(\tilde{O}(n^\rho /p_1)\) where \(\rho =\frac{\log 1/p_1}{\log 1/p_2}\in (0,1)\). This is sub-linear as long as \(p_1\) is not too small. Such an algorithm is significant, since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can’t be solved exact, faster than a brute force linear-time scan of the database.
Unfortunately many of the best LSH functions tend to have very low collision probabilities, including the best functions for Cosine and Jaccard Similarity. This means that the \(n^\rho /p_1\) query time of LSH is often not sub-linear after all, even for approximate nearest neighbours!
In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to \(\tilde{O}(n^\rho /p_1^{1-\rho })\) (and the space usage correspondingly.) Since \(n^\rho /p_1^{1-\rho } < n \Leftrightarrow p_1 > n^{-1}\), our algorithm always obtains sublinear query time, for all collision probabilities at least 1/n. For \(p_1\) and \(p_2\) small enough, our improvement over all previous methods can be up to a factor n in both query time and space.
The improvement comes from a simple change to the Indyk-Motwani algorithm, which we call “LSH with High-Low Tables”. This technique can easily be implemented in existing software packages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
If we don’t know how many points will be inserted, several black box reductions allow transforming LSH into a dynamic data structure.
References
Abboud, A., Rubinstein, A., Williams, R.: Distributed PCP theorems for hardness of approximation in P. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 25–36. IEEE (2017)
Ahle, T.D., Aumüller, M., Pagh, R.: Parameter-free locality sensitive hashing for spherical range reporting. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 239–256. SIAM (2017)
Ahle, T.D.: Optimal Las Vegas locality sensitive data structures. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 938–949. IEEE (2017)
Ahle, T.D., Knudsen, J.B.T.: Subsets and supermajorities: optimal hashing-based set similarity search. arXiv preprint arXiv:1904.04045 (2020)
Ahle, T.D., Pagh, R., Razenshteyn, I., Silvestri, F.: On the complexity of inner product similarity join. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 151–164. ACM (2016)
Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal LSH for angular distance. In: Advances in Neural Information Processing Systems, pp. 1225–1233 (2015)
Andoni, A., Razenshteyn, I., Nosatzki, N.S.: LSH forest: practical algorithms made theoretical. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 67–78. SIAM (2017)
Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, pp. 651–660 (2005)
Becker, A., Ducas, L., Gama, N., Laarhoven, T.: New directions in nearest neighbor searching with applications to lattice sieving. In: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 10–24. SIAM (2016)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 327–336. ACM (1998)
Christiani, T.: Fast locality-sensitive hashing frameworks for approximate near neighbor search. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 3–17. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_1
Christiani, T., Pagh, R.: Set similarity search beyond MinHash. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, 19–23 June 2017, pp. 1094–1107 (2017)
Christiani, T., Pagh, R., Thorup, M.: Confirmation sampling for exact nearest neighbor search. arXiv preprint arXiv:1812.02603 (2018)
Christiani, T.L., Pagh, R., Aumüller, M., Vesterli, M.E.: PUFFINN: parameterless and universally fast finding of nearest neighbors. In: European Symposium on Algorithms, pp. 1–16 (2019)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM (2004)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press, Cambridge (2020)
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 9(9), 636–647 (2016)
Razenshteyn, I., Schmidt, L.: FALCONN-fast lookups of cosine and other nearest neighbors (2018)
Wei, A.: Optimal Las Vegas approximate near neighbors in lp. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1794–1813. SIAM (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ahle, T.D. (2020). On the Problem of \(p_1^{-1}\) in Locality-Sensitive Hashing. In: Satoh, S., et al. Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science(), vol 12440. Springer, Cham. https://doi.org/10.1007/978-3-030-60936-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-60936-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60935-1
Online ISBN: 978-3-030-60936-8
eBook Packages: Computer ScienceComputer Science (R0)