Compact Set Representation for Information Retrieval

  • J. Shane Culpepper
  • Alistair Moffat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4726)

Abstract

Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store the lists effectively, in a compressed form, and the desire to carry out intersection operations efficiently, using non-sequential processing modes. In this paper we evaluate intersection algorithms on compressed sets, comparing them to the best non-sequential array-based intersection algorithms. By adding a simple, low-cost, auxiliary index, we show that compressed storage need not hinder efficient and high-speed intersection operations.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004)Google Scholar
  2. Barbay, J., Kenyon, C.: Adaptive intersection and t-threshold problems. In: Eppstein, D. (ed.) SODA 2002, pp. 390–399 (January 2002)Google Scholar
  3. Barbay, J., López-Ortiz, A., Lu, T.: Faster adaptive set intersections for text searching. In: Àlvarez, C., Serna, M. (eds.) WEA 2006. LNCS, vol. 4007, pp. 146–157. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. Bentley, J., Yao, A.C-C.: An almost optimal algorithm for unbounded searching. Information Processing Letters 5(3), 82–87 (1976)MATHCrossRefMathSciNetGoogle Scholar
  5. Blandford, D.K., Blelloch, G.E.: Compact representations of ordered sets. In: Munro, J.I. (ed.) SODA 2004, pp. 11–19. ACM Press, New York (2004)Google Scholar
  6. Clark, D.: Compact PAT trees. PhD thesis, University of Waterloo (1996)Google Scholar
  7. Demaine, E.D., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: SODA 2000, pp. 743–752 (2000)Google Scholar
  8. Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed dictionaries: Space measures, data sets, and experiments. In: Àlvarez, C., Serna, M. (eds.) WEA 2006. LNCS, vol. 4007, pp. 158–169. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. Hwang, F.K., Lin, S.: A simple algorithm for merging two disjoint linearly ordered list. SIAM Journal on Computing 1, 31–39 (1973)CrossRefMathSciNetGoogle Scholar
  10. Jacobson, G.: Succinct static data structures. PhD thesis, Carnegie Mellon University (1988)Google Scholar
  11. Moffat, A., Stuiver, L.: Binary interpolative coding for effective index compression. Information Retrieval 3(1), 25–47 (2000)CrossRefGoogle Scholar
  12. Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4), 349–379 (1996)CrossRefGoogle Scholar
  13. Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) STACS. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)Google Scholar
  14. Pagh, R.: Low redundancy in static dictionaries with constant time query. SIAM Journal on Computing 31(2), 353–363 (2001), http://www.brics.dk/~pagh/papers/dict-jour.pdf MATHCrossRefMathSciNetGoogle Scholar
  15. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Munro, J.I. (ed.) SODA 2002. Society for Industrial and Applied Mathematics, pp. 233–242 (January 2002)Google Scholar
  16. Sanders, P., Transier, F.: Intersection in integer inverted indices. In: ALENEX 2007, pp. 71–83 (January 2007)Google Scholar
  17. Spink, A., Wolfram, D., Jansen, B.J., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3), 226–234 (2001)CrossRefGoogle Scholar
  18. Witten, I.H., Moffat, A., Bell, T.A.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)Google Scholar
  19. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2), 1–56 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • J. Shane Culpepper
    • 1
  • Alistair Moffat
    • 1
  1. 1.NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010Australia

Personalised recommendations