Skip to main content

Approximately Minwise Independence with Twisted Tabulation

  • Conference paper
Book cover Algorithm Theory – SWAT 2014 (SWAT 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8503))

Included in the following conference series:

Abstract

A random hash function h is ε-minwise if for any set S, |S| = n, and element x ∈ S, \(\Pr[h(x)=\min h(S)]=(1\pm\varepsilon )/n\). Minwise hash functions with low bias ε have widespread applications within similarity estimation.

Hashing from a universe [u], the twisted tabulation hashing of Pǎtraşcu and Thorup [SODA’13] makes c = O(1) lookups in tables of size u 1/c. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields \(\tilde O(1/u^{1/c})\)-minwise hashing.

In the classic independence paradigm of Wegman and Carter [FOCS’79] \(\tilde O(1/u^{1/c})\)-minwise hashing requires Ω(logu)-independence [Indyk SODA’99]. Pǎtraşcu and Thorup [STOC’11] had shown that simple tabulation, using same space and lookups yields \(\tilde O(1/n^{1/c})\)-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.

Research partly supported by Thorup’s Advanced Grant from the Danish Council for Independent Research under the Sapere Aude research carrier programme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Broder, A.Z.: On the resemblance and containment of documents. In: Proc. Compression and Complexity of Sequences (SEQUENCES), pp. 21–29 (1997)

    Google Scholar 

  2. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. Journal of Computer and System Sciences 60(3), 630–659 (2000); See also STOC 1998

    Google Scholar 

  3. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29, 1157–1166 (1997)

    Article  Google Scholar 

  4. Datar, M., Muthukrishnan, S.M.: Estimating rarity and similarity over data stream windows. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 323–334. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  5. Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  6. Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proc. 10th WWW, pp. 141–150 (2007)

    Google Scholar 

  7. Yang, H., Callan, J.P.: Near-duplicate detection by instance-level constrained clustering. In: Proc. 29th SIGIR, pp. 421–428 (2006)

    Google Scholar 

  8. Henzinger, M.R.: Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: Proc. ACM SIGIR, pp. 284–291 (2006)

    Google Scholar 

  9. Li, P., Shrivastava, A., Moore, J.L., König, A.C.: Hashing algorithms for large-scale learning. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2011)

    Google Scholar 

  10. Bachrach, Y., Herbrich, R., Porat, E.: Sketching algorithms for approximating rank correlations in collaborative filtering systems. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 344–352. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  11. Bachrach, Y., Porat, E., Rosenschein, J.S.: Sketching techniques for collaborative filtering. In: Proc. 21st IJCAI, pp. 2016–2021 (2009)

    Google Scholar 

  12. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13(1), 64–78 (2001)

    Article  Google Scholar 

  13. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Proc. SIGMOD, pp. 76–85 (2003)

    Google Scholar 

  14. Wegman, M.N., Carter, L.: New classes and applications of hash functions. Journal of Computer and System Sciences 22(3), 265–279 (1981); See also FOCS 1979

    Google Scholar 

  15. Indyk, P.: A small approximately min-wise independent family of hash functions. Journal of Algorithms 38(1), 84–90 (2001); See also SODA 1999

    Google Scholar 

  16. Pǎtraşcu, M., Thorup, M.: On the k-independence required by linear probing and minwise independence. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6198, pp. 715–726. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  17. Zobrist, A.L.: A new hashing method with application for game playing. Technical Report 88, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin (1970)

    Google Scholar 

  18. Pǎtraşcu, M., Thorup, M.: The power of simple tabulation-based hashing. Journal of the ACM 59(3) (2012); Article 14 Announced at STOC 2011

    Google Scholar 

  19. Pǎtraşcu, M., Thorup, M.: Twisted tabulation hashing. In: Proc. 24th ACM/SIAM Symposium on Discrete Algorithms (SODA), pp. 209–228 (2013)

    Google Scholar 

  20. Thorup, M.: Simple tabulation, fast expanders, double tabulation, and high independence. In: FOCS, pp. 90–99 (2013)

    Google Scholar 

  21. Klassen, T.Q., Woelfel, P.: Independence of tabulation-based hash classes. In: Proc. 10th Latin American Theoretical Informatics (LATIN), pp. 506–517 (2012)

    Google Scholar 

  22. Thorup, M., Zhang, Y.: Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM Journal on Computing 41(2), 293–331 (2012); Announced at SODA 2004 and ALENEX 2010

    Google Scholar 

  23. Thorup, M.: Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In: Proc. 45th ACM Symposium on Theory of Computing, STOC (2013)

    Google Scholar 

  24. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)

    MATH  Google Scholar 

  25. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-gradient solver for svm. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 807–814 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Dahlgaard, S., Thorup, M. (2014). Approximately Minwise Independence with Twisted Tabulation. In: Ravi, R., Gørtz, I.L. (eds) Algorithm Theory – SWAT 2014. SWAT 2014. Lecture Notes in Computer Science, vol 8503. Springer, Cham. https://doi.org/10.1007/978-3-319-08404-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08404-6_12

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08403-9

  • Online ISBN: 978-3-319-08404-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics