Indexing Shared Content in Information Retrieval Systems

  • Andrei Z. Broder
  • Nadav Eiron
  • Marcus Fontoura
  • Michael Herscovici
  • Ronny Lempel
  • John McPherson
  • Runping Qi
  • Eugene Shekita
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3896)

Abstract

Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)Google Scholar
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD 2003, pp. 39–48 (2003)Google Scholar
  3. 3.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: WWW 1998, pp. 107–117 (1998)Google Scholar
  4. 4.
    Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: CIKM 2003, pp. 426–434 (2003)Google Scholar
  6. 6.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: WWW 1997, pp. 1157–1166 (1997)Google Scholar
  7. 7.
    Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S., Soffer, A.: Static index pruning for information retrieval systems. In: SIGIR 2001, pp. 43–50 (2001)Google Scholar
  8. 8.
    Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding replicated web collections. In: SIGMOD 2000, pp. 355–366 (2000)Google Scholar
  9. 9.
    de Moura, E.S., dos Santos, C.F., Fernandes, D.R., Silva, A.S., Calado, P., Nascimento, M.A.: Improving web search efficiency via a locality based static pruning method. In: WWW 2005, pp. 235–244 (2005)Google Scholar
  10. 10.
    Fontoura, M., Shekita, E.J., Zien, J.Y., Rajagopalan, S., Neumann, A.: High performance index build algorithms for intranet search engines. In: VLDB 2004, pp. 1158–1169 (2004)Google Scholar
  11. 11.
    Garcia-Molina, H., Ullman, J., Widom, J.: Database System Implementation. Prentice Hall, Englewood Cliffs (2000)Google Scholar
  12. 12.
  13. 13.
    Heinz, S., Zobel, J.: Efficient single-pass index construction for text databases. JASIST 54(8) (2003)Google Scholar
  14. 14.
    Klimt, B., Yang, Y.: The Enron corpus: A new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  15. 15.
    Long, X., Suel, T.: Optimized query execution in large search engines with global page ordering. In: VLDB 2003, pp. 129–140 (2003)Google Scholar
  16. 16.
    Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the web. In: WWW 2001, pp. 396–406 (2001)Google Scholar
  17. 17.
    Moffat, A., Zobel, J.: Compression and fast indexing for multi-gigabyte text databases. Australian Computer Journal 26(1) (1994)Google Scholar
  18. 18.
    Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: SIGIR 2002, pp. 222–229 (2002)Google Scholar
  19. 19.
    Stata, R., Hunt, P., Thiruvalluvan, M.G.: The Bloomba personal content database. In: VLDB 2004, pp. 1214–1223 (2004)Google Scholar
  20. 20.
    Turtle, H., Flood, J.: Query evaluation: strategies and optimizations. Inf. Proc. Management 31(6), 831–850 (1995)CrossRefGoogle Scholar
  21. 21.
    Witten, I., Moffat, A., Bell, T.: Managing Gigabytes. Morgan Kaufmann, San Francisco (1999)Google Scholar
  22. 22.
    Zhang, Z.: The behavior of duplicate pages on the world wide web. Submitted to CIKM (2005)Google Scholar
  23. 23.
    Zobel, J., Moffat, A.: Adding compression to a full-text retrieval system. Software - Practice & Experience 25(8) (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Andrei Z. Broder
    • 1
  • Nadav Eiron
    • 2
  • Marcus Fontoura
    • 1
  • Michael Herscovici
    • 3
  • Ronny Lempel
    • 3
  • John McPherson
    • 4
  • Runping Qi
    • 1
  • Eugene Shekita
    • 5
  1. 1.Yahoo! Inc 
  2. 2.Google Inc 
  3. 3.IBM Haifa Research Lab 
  4. 4.IBM Silicon Valley Lab 
  5. 5.IBM Almaden Research Center 

Personalised recommendations