Approximately Minwise Independence with Twisted Tabulation

Dahlgaard, Søren; Thorup, Mikkel

doi:10.1007/978-3-319-08404-6_12

Søren Dahlgaard¹⁷ &
Mikkel Thorup¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8503))

Included in the following conference series:

Scandinavian Workshop on Algorithm Theory

1060 Accesses
6 Citations

Abstract

A random hash function h is ε-minwise if for any set S, |S| = n, and element x ∈ S, \(\Pr[h(x)=\min h(S)]=(1\pm\varepsilon )/n\). Minwise hash functions with low bias ε have widespread applications within similarity estimation.

Hashing from a universe [u], the twisted tabulation hashing of Pǎtraşcu and Thorup [SODA’13] makes c = O(1) lookups in tables of size u ^1/c. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields \(\tilde O(1/u^{1/c})\)-minwise hashing.

In the classic independence paradigm of Wegman and Carter [FOCS’79] \(\tilde O(1/u^{1/c})\)-minwise hashing requires Ω(logu)-independence [Indyk SODA’99]. Pǎtraşcu and Thorup [STOC’11] had shown that simple tabulation, using same space and lookups yields \(\tilde O(1/n^{1/c})\)-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.

Research partly supported by Thorup’s Advanced Grant from the Danish Council for Independent Research under the Sapere Aude research carrier programme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Broder, A.Z.: On the resemblance and containment of documents. In: Proc. Compression and Complexity of Sequences (SEQUENCES), pp. 21–29 (1997)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. Journal of Computer and System Sciences 60(3), 630–659 (2000); See also STOC 1998
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29, 1157–1166 (1997)
Article Google Scholar
Datar, M., Muthukrishnan, S.M.: Estimating rarity and similarity over data stream windows. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 323–334. Springer, Heidelberg (2002)
Chapter Google Scholar
Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Chapter Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proc. 10th WWW, pp. 141–150 (2007)
Google Scholar
Yang, H., Callan, J.P.: Near-duplicate detection by instance-level constrained clustering. In: Proc. 29th SIGIR, pp. 421–428 (2006)
Google Scholar
Henzinger, M.R.: Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: Proc. ACM SIGIR, pp. 284–291 (2006)
Google Scholar
Li, P., Shrivastava, A., Moore, J.L., König, A.C.: Hashing algorithms for large-scale learning. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2011)
Google Scholar
Bachrach, Y., Herbrich, R., Porat, E.: Sketching algorithms for approximating rank correlations in collaborative filtering systems. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 344–352. Springer, Heidelberg (2009)
Chapter Google Scholar
Bachrach, Y., Porat, E., Rosenschein, J.S.: Sketching techniques for collaborative filtering. In: Proc. 21st IJCAI, pp. 2016–2021 (2009)
Google Scholar
Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13(1), 64–78 (2001)
Article Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Proc. SIGMOD, pp. 76–85 (2003)
Google Scholar
Wegman, M.N., Carter, L.: New classes and applications of hash functions. Journal of Computer and System Sciences 22(3), 265–279 (1981); See also FOCS 1979
Google Scholar
Indyk, P.: A small approximately min-wise independent family of hash functions. Journal of Algorithms 38(1), 84–90 (2001); See also SODA 1999
Google Scholar
Pǎtraşcu, M., Thorup, M.: On the k-independence required by linear probing and minwise independence. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6198, pp. 715–726. Springer, Heidelberg (2010)
Chapter Google Scholar
Zobrist, A.L.: A new hashing method with application for game playing. Technical Report 88, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin (1970)
Google Scholar
Pǎtraşcu, M., Thorup, M.: The power of simple tabulation-based hashing. Journal of the ACM 59(3) (2012); Article 14 Announced at STOC 2011
Google Scholar
Pǎtraşcu, M., Thorup, M.: Twisted tabulation hashing. In: Proc. 24th ACM/SIAM Symposium on Discrete Algorithms (SODA), pp. 209–228 (2013)
Google Scholar
Thorup, M.: Simple tabulation, fast expanders, double tabulation, and high independence. In: FOCS, pp. 90–99 (2013)
Google Scholar
Klassen, T.Q., Woelfel, P.: Independence of tabulation-based hash classes. In: Proc. 10th Latin American Theoretical Informatics (LATIN), pp. 506–517 (2012)
Google Scholar
Thorup, M., Zhang, Y.: Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM Journal on Computing 41(2), 293–331 (2012); Announced at SODA 2004 and ALENEX 2010
Google Scholar
Thorup, M.: Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In: Proc. 45th ACM Symposium on Theory of Computing, STOC (2013)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)
MATH Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-gradient solver for svm. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 807–814 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Copenhagen, Denmark
Søren Dahlgaard & Mikkel Thorup

Authors

Søren Dahlgaard
View author publications
You can also search for this author in PubMed Google Scholar
Mikkel Thorup
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Tepper School of Business, Carnegie Mellon University, 15213, Pittsburgh, PA, USA
R. Ravi
DTU Informatics, 2800, Kongens Lyngby, Denmark
Inge Li Gørtz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dahlgaard, S., Thorup, M. (2014). Approximately Minwise Independence with Twisted Tabulation. In: Ravi, R., Gørtz, I.L. (eds) Algorithm Theory – SWAT 2014. SWAT 2014. Lecture Notes in Computer Science, vol 8503. Springer, Cham. https://doi.org/10.1007/978-3-319-08404-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-08404-6_12
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08403-9
Online ISBN: 978-3-319-08404-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics