Abstract
A random hash function h is ε-minwise if for any set S, |S| = n, and element x ∈ S, \(\Pr[h(x)=\min h(S)]=(1\pm\varepsilon )/n\). Minwise hash functions with low bias ε have widespread applications within similarity estimation.
Hashing from a universe [u], the twisted tabulation hashing of Pǎtraşcu and Thorup [SODA’13] makes c = O(1) lookups in tables of size u 1/c. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields \(\tilde O(1/u^{1/c})\)-minwise hashing.
In the classic independence paradigm of Wegman and Carter [FOCS’79] \(\tilde O(1/u^{1/c})\)-minwise hashing requires Ω(logu)-independence [Indyk SODA’99]. Pǎtraşcu and Thorup [STOC’11] had shown that simple tabulation, using same space and lookups yields \(\tilde O(1/n^{1/c})\)-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.
Research partly supported by Thorup’s Advanced Grant from the Danish Council for Independent Research under the Sapere Aude research carrier programme.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Broder, A.Z.: On the resemblance and containment of documents. In: Proc. Compression and Complexity of Sequences (SEQUENCES), pp. 21–29 (1997)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. Journal of Computer and System Sciences 60(3), 630–659 (2000); See also STOC 1998
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29, 1157–1166 (1997)
Datar, M., Muthukrishnan, S.M.: Estimating rarity and similarity over data stream windows. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 323–334. Springer, Heidelberg (2002)
Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proc. 10th WWW, pp. 141–150 (2007)
Yang, H., Callan, J.P.: Near-duplicate detection by instance-level constrained clustering. In: Proc. 29th SIGIR, pp. 421–428 (2006)
Henzinger, M.R.: Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: Proc. ACM SIGIR, pp. 284–291 (2006)
Li, P., Shrivastava, A., Moore, J.L., König, A.C.: Hashing algorithms for large-scale learning. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2011)
Bachrach, Y., Herbrich, R., Porat, E.: Sketching algorithms for approximating rank correlations in collaborative filtering systems. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 344–352. Springer, Heidelberg (2009)
Bachrach, Y., Porat, E., Rosenschein, J.S.: Sketching techniques for collaborative filtering. In: Proc. 21st IJCAI, pp. 2016–2021 (2009)
Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13(1), 64–78 (2001)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Proc. SIGMOD, pp. 76–85 (2003)
Wegman, M.N., Carter, L.: New classes and applications of hash functions. Journal of Computer and System Sciences 22(3), 265–279 (1981); See also FOCS 1979
Indyk, P.: A small approximately min-wise independent family of hash functions. Journal of Algorithms 38(1), 84–90 (2001); See also SODA 1999
Pǎtraşcu, M., Thorup, M.: On the k-independence required by linear probing and minwise independence. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6198, pp. 715–726. Springer, Heidelberg (2010)
Zobrist, A.L.: A new hashing method with application for game playing. Technical Report 88, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin (1970)
Pǎtraşcu, M., Thorup, M.: The power of simple tabulation-based hashing. Journal of the ACM 59(3) (2012); Article 14 Announced at STOC 2011
Pǎtraşcu, M., Thorup, M.: Twisted tabulation hashing. In: Proc. 24th ACM/SIAM Symposium on Discrete Algorithms (SODA), pp. 209–228 (2013)
Thorup, M.: Simple tabulation, fast expanders, double tabulation, and high independence. In: FOCS, pp. 90–99 (2013)
Klassen, T.Q., Woelfel, P.: Independence of tabulation-based hash classes. In: Proc. 10th Latin American Theoretical Informatics (LATIN), pp. 506–517 (2012)
Thorup, M., Zhang, Y.: Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM Journal on Computing 41(2), 293–331 (2012); Announced at SODA 2004 and ALENEX 2010
Thorup, M.: Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In: Proc. 45th ACM Symposium on Theory of Computing, STOC (2013)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-gradient solver for svm. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 807–814 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Dahlgaard, S., Thorup, M. (2014). Approximately Minwise Independence with Twisted Tabulation. In: Ravi, R., Gørtz, I.L. (eds) Algorithm Theory – SWAT 2014. SWAT 2014. Lecture Notes in Computer Science, vol 8503. Springer, Cham. https://doi.org/10.1007/978-3-319-08404-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-08404-6_12
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08403-9
Online ISBN: 978-3-319-08404-6
eBook Packages: Computer ScienceComputer Science (R0)