On the Problem of $$p_1^{-1}$$ in Locality-Sensitive Hashing

Ahle, Thomas Dybdahl

doi:10.1007/978-3-030-60936-8_7

Thomas Dybdahl Ahle ORCID: orcid.org/0000-0001-9747-0479¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12440))

Included in the following conference series:

International Conference on Similarity Search and Applications

790 Accesses
2 Citations

Abstract

A Locality-Sensitive Hash (LSH) function is called (r, cr, $p_1,p_2)$-sensitive, if two data-points with a distance less than r collide with probability at least $p_1$ while data points with a distance greater than cr collide with probability at most $p_2$. These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a c-approximate nearest neighbour data structure with query time $\tilde{O}(n^\rho /p_1)$ where $\rho =\frac{\log 1/p_1}{\log 1/p_2}\in (0,1)$. This is sub-linear as long as $p_1$ is not too small. Such an algorithm is significant, since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can’t be solved exact, faster than a brute force linear-time scan of the database.

Unfortunately many of the best LSH functions tend to have very low collision probabilities, including the best functions for Cosine and Jaccard Similarity. This means that the $n^\rho /p_1$ query time of LSH is often not sub-linear after all, even for approximate nearest neighbours!

In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to $\tilde{O}(n^\rho /p_1^{1-\rho })$ (and the space usage correspondingly.) Since $n^\rho /p_1^{1-\rho } < n \Leftrightarrow p_1 > n^{-1}$, our algorithm always obtains sublinear query time, for all collision probabilities at least 1/n. For $p_1$ and $p_2$ small enough, our improvement over all previous methods can be up to a factor n in both query time and space.

The improvement comes from a simple change to the Indyk-Motwani algorithm, which we call “LSH with High-Low Tables”. This technique can easily be implemented in existing software packages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In general we expect the exact problem to be impossible to solve in sub-linear time, given the hardness results of [1, 5]. However for practical datasets it is often possible.
2.
If we don’t know how many points will be inserted, several black box reductions allow transforming LSH into a dynamic data structure.

References

Abboud, A., Rubinstein, A., Williams, R.: Distributed PCP theorems for hardness of approximation in P. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 25–36. IEEE (2017)
Google Scholar
Ahle, T.D., Aumüller, M., Pagh, R.: Parameter-free locality sensitive hashing for spherical range reporting. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 239–256. SIAM (2017)
Google Scholar
Ahle, T.D.: Optimal Las Vegas locality sensitive data structures. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 938–949. IEEE (2017)
Google Scholar
Ahle, T.D., Knudsen, J.B.T.: Subsets and supermajorities: optimal hashing-based set similarity search. arXiv preprint arXiv:1904.04045 (2020)
Ahle, T.D., Pagh, R., Razenshteyn, I., Silvestri, F.: On the complexity of inner product similarity join. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 151–164. ACM (2016)
Google Scholar
Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal LSH for angular distance. In: Advances in Neural Information Processing Systems, pp. 1225–1233 (2015)
Google Scholar
Andoni, A., Razenshteyn, I., Nosatzki, N.S.: LSH forest: practical algorithms made theoretical. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 67–78. SIAM (2017)
Google Scholar
Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, pp. 651–660 (2005)
Google Scholar
Becker, A., Ducas, L., Gama, N., Laarhoven, T.: New directions in nearest neighbor searching with applications to lattice sieving. In: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 10–24. SIAM (2016)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 327–336. ACM (1998)
Google Scholar
Christiani, T.: Fast locality-sensitive hashing frameworks for approximate near neighbor search. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 3–17. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_1
Chapter Google Scholar
Christiani, T., Pagh, R.: Set similarity search beyond MinHash. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, 19–23 June 2017, pp. 1094–1107 (2017)
Google Scholar
Christiani, T., Pagh, R., Thorup, M.: Confirmation sampling for exact nearest neighbor search. arXiv preprint arXiv:1812.02603 (2018)
Christiani, T.L., Pagh, R., Aumüller, M., Vesterli, M.E.: PUFFINN: parameterless and universally fast finding of nearest neighbors. In: European Symposium on Algorithms, pp. 1–16 (2019)
Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM (2004)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press, Cambridge (2020)
Book Google Scholar
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 9(9), 636–647 (2016)
Article Google Scholar
Razenshteyn, I., Schmidt, L.: FALCONN-fast lookups of cosine and other nearest neighbors (2018)
Google Scholar
Wei, A.: Optimal Las Vegas approximate near neighbors in lp. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1794–1813. SIAM (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

IT University and BARC, Copenhagen, Denmark
Thomas Dybdahl Ahle

Authors

Thomas Dybdahl Ahle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Dybdahl Ahle .

Editor information

Editors and Affiliations

National Institute of Informatics, Tokyo, Japan
Shin'ichi Satoh
ISTI-CNR, Pisa, Italy
Lucia Vadicamo
University of Southern Denmark, Odense M, Denmark
Arthur Zimek
ISTI-CNR, Pisa, Italy
Fabio Carrara
University of Bologna, Bologna, Italy
Ilaria Bartolini
IT University of Copenhagen, Copenhagen, Denmark
Martin Aumüller
IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
IT University of Copenhagen, Copenhagen, Denmark
Rasmus Pagh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahle, T.D. (2020). On the Problem of $p_1^{-1}$ in Locality-Sensitive Hashing. In: Satoh, S., et al. Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science(), vol 12440. Springer, Cham. https://doi.org/10.1007/978-3-030-60936-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-60936-8_7
Published: 14 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60935-1
Online ISBN: 978-3-030-60936-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Problem of \(p_1^{-1}\) in Locality-Sensitive Hashing

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On the Problem of \(p_1^{-1}\) in Locality-Sensitive Hashing

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation