Advertisement

COCA Filters: Co-occurrence Aware Bloom Filters

  • Kamran Tirdad
  • Pedram Ghodsnia
  • J. Ian Munro
  • Alejandro López-Ortiz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7024)

Abstract

We propose an indexing data structure based on a novel variation of Bloom filters. Signature files have been proposed in the past as a method to index large text databases though they suffer from a high false positive error problem. In this paper we introduce COCA Filters, a new type of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique we can reduce the false positive error by up to 21.6 times for the same index size. Furthermore Bloom filters can be replaced by COCA filters wherever the co-occurrence of any two members of the universe is identifiable.

Keywords

Information Retrieval Bloom Filters Signature Files Locality Sensitive Hash Functions 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    http://www.google.com/programming-contest (2002) (accessed, January 2011)
  2. 2.
    http://www.wikipediaondvd.com/site.php (2007) (accessed, January 2011)
  3. 3.
    http://schools-wikipedia.org (2008) (accessed, January 2011)
  4. 4.
  5. 5.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970)CrossRefzbMATHGoogle Scholar
  6. 6.
    Bose, P., Guo, H., Kranakis, E., Maheshwari, A., Morin, P., Morrison, J., Smid, M.H.M., Tang, Y.: On the false-positive rate of bloom filters. Inf. Process. Lett. 108(4), 210–213 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Broder, A., Mitzenmacher, M.: Network applications of bloom filters: A survey. In: Internet Mathematics, pp. 636–646 (2002)Google Scholar
  8. 8.
    Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  9. 9.
    Broder, A.: Min-wise independent permutations: Theory and practice. In: Welzl, E., Montanari, U., Rolim, J.D.P. (eds.) ICALP 2000. LNCS, vol. 1853, p. 808. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  10. 10.
    Buhler, J., Tompa, M.: Finding motifs using random projections. Journal of Computational Biology 9(2), 225–242 (2002)CrossRefGoogle Scholar
  11. 11.
    Carterette, B., Can, F.: Comparing inverted files and signature files for searching a large lexicon. Inf. Process. Manage. 41(3), 613–633 (2005)CrossRefzbMATHGoogle Scholar
  12. 12.
    Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)Google Scholar
  13. 13.
    Chazelle, B., Kilian, J., Rubinfeld, R., Tal, A.: The bloomier filter: an efficient data structure for static support lookup tables. In: SODA, pp. 30–39 (2004)Google Scholar
  14. 14.
    Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 241–252. ACM, New York (2003)Google Scholar
  15. 15.
    Faloutsos, C., Christodoulakis, S.: Signature files: an access method for documents and its analytical performance evaluation. ACM Trans. Inf. Syst. 2, 267–288 (1984)CrossRefGoogle Scholar
  16. 16.
    Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8, 281–293 (2000)CrossRefGoogle Scholar
  17. 17.
    Georgescu, B., Shimshoni, I., Meer, P.: Mean shift based clustering in high dimensions: A texture classification example. In: ICCV, pp. 456–463 (2003)Google Scholar
  18. 18.
    Goel, A., Gupta, P.: Small subset queries and bloom filters using ternary associative memories, with applications. SIGMETRICS Perform. Eval. Rev. 38, 143–154 (2010)CrossRefGoogle Scholar
  19. 19.
    Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: WebDB (Informal Proceedings), pp. 129–134 (2000)Google Scholar
  20. 20.
    Li, J., Loo, B., Hellerstein, J., Kaashoek, M., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 207–215. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  21. 21.
    Matousek, J.: On restricted min-wise independence of permutations (2002)Google Scholar
  22. 22.
    Mullin, J.: Optimal semijoins for distributed database systems. IEEE Transactions on Software Engineering 16(5), 558–560 (1990)CrossRefGoogle Scholar
  23. 23.
    Mullin, J.K., Margoliash, D.J.: A tale of three spelling checkers. Softw. Pract. Exper. 20, 625–630 (1990)CrossRefGoogle Scholar
  24. 24.
    Ouyang, Z., Memon, N.D., Suel, T., Trendafilov, D.: Cluster-based delta compression of a collection of files. In: WISE, pp. 257–268 (2002)Google Scholar
  25. 25.
    Pagh, A., Pagh, R., Rao, S.S.: An optimal bloom filter replacement. In: SODA 2005, pp. 823–829 (2005)Google Scholar
  26. 26.
    Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching. In: Endler, M., Schmidt, D.C. (eds.) Middleware 2003. LNCS, vol. 2672, pp. 21–40. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  27. 27.
    Saks, M., Srinivasan, A., Zhou, S., Zuckerman, D.: Low discrepancy sets yield approximate min-wise independent permutation families. Information Processing Letters 73(1-2), 29–32 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Spafford, E.H.: Opus: Preventing weak password choices. Computers & Security 11(3), 273–278 (1992)CrossRefGoogle Scholar
  29. 29.
    Yang, C.: Macs: music audio characteristic sequence indexing for similarity retrieval. In: 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, pp. 123–126 (2001)Google Scholar
  30. 30.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Kamran Tirdad
    • 1
  • Pedram Ghodsnia
    • 1
  • J. Ian Munro
    • 1
  • Alejandro López-Ortiz
    • 1
  1. 1.Cheriton School of Computer ScienceUniversity of WaterlooCanada

Personalised recommendations