COCA Filters: Co-occurrence Aware Bloom Filters

Tirdad, Kamran; Ghodsnia, Pedram; Munro, J. Ian; López-Ortiz, Alejandro

doi:10.1007/978-3-642-24583-1_31

Kamran Tirdad¹⁸,
Pedram Ghodsnia¹⁸,
J. Ian Munro¹⁸ &
…
Alejandro López-Ortiz¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7024))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

726 Accesses

Abstract

We propose an indexing data structure based on a novel variation of Bloom filters. Signature files have been proposed in the past as a method to index large text databases though they suffer from a high false positive error problem. In this paper we introduce COCA Filters, a new type of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique we can reduce the false positive error by up to 21.6 times for the same index size. Furthermore Bloom filters can be replaced by COCA filters wherever the co-occurrence of any two members of the universe is identifiable.

This work was supported by NSERC of Canada and the Canada Research Chairs program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

http://www.google.com/programming-contest (2002) (accessed, January 2011)
http://www.wikipediaondvd.com/site.php (2007) (accessed, January 2011)
http://schools-wikipedia.org (2008) (accessed, January 2011)
http://en.wikipedia.org/wiki/Wikipedia:Words_per_article (2009) (accessed, January 2011)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970)
Article MATH Google Scholar
Bose, P., Guo, H., Kranakis, E., Maheshwari, A., Morin, P., Morrison, J., Smid, M.H.M., Tang, Y.: On the false-positive rate of bloom filters. Inf. Process. Lett. 108(4), 210–213 (2008)
Article MathSciNet MATH Google Scholar
Broder, A., Mitzenmacher, M.: Network applications of bloom filters: A survey. In: Internet Mathematics, pp. 636–646 (2002)
Google Scholar
Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Chapter Google Scholar
Broder, A.: Min-wise independent permutations: Theory and practice. In: Welzl, E., Montanari, U., Rolim, J.D.P. (eds.) ICALP 2000. LNCS, vol. 1853, p. 808. Springer, Heidelberg (2000)
Chapter Google Scholar
Buhler, J., Tompa, M.: Finding motifs using random projections. Journal of Computational Biology 9(2), 225–242 (2002)
Article Google Scholar
Carterette, B., Can, F.: Comparing inverted files and signature files for searching a large lexicon. Inf. Process. Manage. 41(3), 613–633 (2005)
Article MATH Google Scholar
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Google Scholar
Chazelle, B., Kilian, J., Rubinfeld, R., Tal, A.: The bloomier filter: an efficient data structure for static support lookup tables. In: SODA, pp. 30–39 (2004)
Google Scholar
Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 241–252. ACM, New York (2003)
Google Scholar
Faloutsos, C., Christodoulakis, S.: Signature files: an access method for documents and its analytical performance evaluation. ACM Trans. Inf. Syst. 2, 267–288 (1984)
Article Google Scholar
Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8, 281–293 (2000)
Article Google Scholar
Georgescu, B., Shimshoni, I., Meer, P.: Mean shift based clustering in high dimensions: A texture classification example. In: ICCV, pp. 456–463 (2003)
Google Scholar
Goel, A., Gupta, P.: Small subset queries and bloom filters using ternary associative memories, with applications. SIGMETRICS Perform. Eval. Rev. 38, 143–154 (2010)
Article Google Scholar
Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: WebDB (Informal Proceedings), pp. 129–134 (2000)
Google Scholar
Li, J., Loo, B., Hellerstein, J., Kaashoek, M., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 207–215. Springer, Heidelberg (2003)
Chapter Google Scholar
Matousek, J.: On restricted min-wise independence of permutations (2002)
Google Scholar
Mullin, J.: Optimal semijoins for distributed database systems. IEEE Transactions on Software Engineering 16(5), 558–560 (1990)
Article Google Scholar
Mullin, J.K., Margoliash, D.J.: A tale of three spelling checkers. Softw. Pract. Exper. 20, 625–630 (1990)
Article Google Scholar
Ouyang, Z., Memon, N.D., Suel, T., Trendafilov, D.: Cluster-based delta compression of a collection of files. In: WISE, pp. 257–268 (2002)
Google Scholar
Pagh, A., Pagh, R., Rao, S.S.: An optimal bloom filter replacement. In: SODA 2005, pp. 823–829 (2005)
Google Scholar
Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching. In: Endler, M., Schmidt, D.C. (eds.) Middleware 2003. LNCS, vol. 2672, pp. 21–40. Springer, Heidelberg (2003)
Chapter Google Scholar
Saks, M., Srinivasan, A., Zhou, S., Zuckerman, D.: Low discrepancy sets yield approximate min-wise independent permutation families. Information Processing Letters 73(1-2), 29–32 (2000)
Article MathSciNet MATH Google Scholar
Spafford, E.H.: Opus: Preventing weak password choices. Computers & Security 11(3), 273–278 (1992)
Article Google Scholar
Yang, C.: Macs: music audio characteristic sequence indexing for similarity retrieval. In: 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, pp. 123–126 (2001)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Cheriton School of Computer Science, University of Waterloo, Canada
Kamran Tirdad, Pedram Ghodsnia, J. Ian Munro & Alejandro López-Ortiz

Authors

Kamran Tirdad
View author publications
You can also search for this author in PubMed Google Scholar
Pedram Ghodsnia
View author publications
You can also search for this author in PubMed Google Scholar
J. Ian Munro
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro López-Ortiz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Università di Pisa, Italy
Roberto Grossi
Consiglio Nazionale delle Ricerche, Area della Ricerca di Pisa, Istituto di Scienza e Tecnologia dell’Informazione “Alessandro Faedo”, Via Giuseppe Moruzzi 1, 56124, Pisa, Italy
Fabrizio Sebastiani & Fabrizio Silvestri &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tirdad, K., Ghodsnia, P., Munro, J.I., López-Ortiz, A. (2011). COCA Filters: Co-occurrence Aware Bloom Filters. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds) String Processing and Information Retrieval. SPIRE 2011. Lecture Notes in Computer Science, vol 7024. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24583-1_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-24583-1_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24582-4
Online ISBN: 978-3-642-24583-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics