Skip to main content

On the Problem of \(p_1^{-1}\) in Locality-Sensitive Hashing

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12440))

Included in the following conference series:

Abstract

A Locality-Sensitive Hash (LSH) function is called (rcr\(p_1,p_2)\)-sensitive, if two data-points with a distance less than r collide with probability at least \(p_1\) while data points with a distance greater than cr collide with probability at most \(p_2\). These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a c-approximate nearest neighbour data structure with query time \(\tilde{O}(n^\rho /p_1)\) where \(\rho =\frac{\log 1/p_1}{\log 1/p_2}\in (0,1)\). This is sub-linear as long as \(p_1\) is not too small. Such an algorithm is significant, since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can’t be solved exact, faster than a brute force linear-time scan of the database.

Unfortunately many of the best LSH functions tend to have very low collision probabilities, including the best functions for Cosine and Jaccard Similarity. This means that the \(n^\rho /p_1\) query time of LSH is often not sub-linear after all, even for approximate nearest neighbours!

In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to \(\tilde{O}(n^\rho /p_1^{1-\rho })\) (and the space usage correspondingly.) Since \(n^\rho /p_1^{1-\rho } < n \Leftrightarrow p_1 > n^{-1}\), our algorithm always obtains sublinear query time, for all collision probabilities at least 1/n. For \(p_1\) and \(p_2\) small enough, our improvement over all previous methods can be up to a factor n in both query time and space.

The improvement comes from a simple change to the Indyk-Motwani algorithm, which we call “LSH with High-Low Tables”. This technique can easily be implemented in existing software packages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In general we expect the exact problem to be impossible to solve in sub-linear time, given the hardness results of  [1, 5]. However for practical datasets it is often possible.

  2. 2.

    If we don’t know how many points will be inserted, several black box reductions allow transforming LSH into a dynamic data structure.

References

  1. Abboud, A., Rubinstein, A., Williams, R.: Distributed PCP theorems for hardness of approximation in P. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 25–36. IEEE (2017)

    Google Scholar 

  2. Ahle, T.D., Aumüller, M., Pagh, R.: Parameter-free locality sensitive hashing for spherical range reporting. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 239–256. SIAM (2017)

    Google Scholar 

  3. Ahle, T.D.: Optimal Las Vegas locality sensitive data structures. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 938–949. IEEE (2017)

    Google Scholar 

  4. Ahle, T.D., Knudsen, J.B.T.: Subsets and supermajorities: optimal hashing-based set similarity search. arXiv preprint arXiv:1904.04045 (2020)

  5. Ahle, T.D., Pagh, R., Razenshteyn, I., Silvestri, F.: On the complexity of inner product similarity join. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 151–164. ACM (2016)

    Google Scholar 

  6. Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal LSH for angular distance. In: Advances in Neural Information Processing Systems, pp. 1225–1233 (2015)

    Google Scholar 

  7. Andoni, A., Razenshteyn, I., Nosatzki, N.S.: LSH forest: practical algorithms made theoretical. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 67–78. SIAM (2017)

    Google Scholar 

  8. Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, pp. 651–660 (2005)

    Google Scholar 

  9. Becker, A., Ducas, L., Gama, N., Laarhoven, T.: New directions in nearest neighbor searching with applications to lattice sieving. In: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 10–24. SIAM (2016)

    Google Scholar 

  10. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 327–336. ACM (1998)

    Google Scholar 

  11. Christiani, T.: Fast locality-sensitive hashing frameworks for approximate near neighbor search. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 3–17. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_1

    Chapter  Google Scholar 

  12. Christiani, T., Pagh, R.: Set similarity search beyond MinHash. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, 19–23 June 2017, pp. 1094–1107 (2017)

    Google Scholar 

  13. Christiani, T., Pagh, R., Thorup, M.: Confirmation sampling for exact nearest neighbor search. arXiv preprint arXiv:1812.02603 (2018)

  14. Christiani, T.L., Pagh, R., Aumüller, M., Vesterli, M.E.: PUFFINN: parameterless and universally fast finding of nearest neighbors. In: European Symposium on Algorithms, pp. 1–16 (2019)

    Google Scholar 

  15. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM (2004)

    Google Scholar 

  16. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)

    Google Scholar 

  17. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press, Cambridge (2020)

    Book  Google Scholar 

  18. Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 9(9), 636–647 (2016)

    Article  Google Scholar 

  19. Razenshteyn, I., Schmidt, L.: FALCONN-fast lookups of cosine and other nearest neighbors (2018)

    Google Scholar 

  20. Wei, A.: Optimal Las Vegas approximate near neighbors in lp. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1794–1813. SIAM (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Dybdahl Ahle .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ahle, T.D. (2020). On the Problem of \(p_1^{-1}\) in Locality-Sensitive Hashing. In: Satoh, S., et al. Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science(), vol 12440. Springer, Cham. https://doi.org/10.1007/978-3-030-60936-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60936-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60935-1

  • Online ISBN: 978-3-030-60936-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics