Advances in Database Technology - EDBT 2006
Volume 3896 of the series Lecture Notes in Computer Science pp 313-330
Indexing Shared Content in Information Retrieval Systems
- Andrei Z. BroderAffiliated withYahoo! Inc
- , Nadav EironAffiliated withGoogle Inc
- , Marcus FontouraAffiliated withYahoo! Inc
- , Michael HerscoviciAffiliated withIBM Haifa Research Lab
- , Ronny LempelAffiliated withIBM Haifa Research Lab
- , John McPhersonAffiliated withIBM Silicon Valley Lab
- , Runping QiAffiliated withYahoo! Inc
- , Eugene ShekitaAffiliated withIBM Almaden Research Center
Abstract
Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.
- Title
- Indexing Shared Content in Information Retrieval Systems
- Book Title
- Advances in Database Technology - EDBT 2006
- Book Subtitle
- 10th International Conference on Extending Database Technology, Munich, Germany, March 26-31, 2006
- Pages
- pp 313-330
- Copyright
- 2006
- DOI
- 10.1007/11687238_21
- Print ISBN
- 978-3-540-32960-2
- Online ISBN
- 978-3-540-32961-9
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- 3896
- Series ISSN
- 0302-9743
- Publisher
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Topics
- Industry Sectors
- eBook Packages
- Editors
-
- Yannis Ioannidis (16)
- Marc H. Scholl (17)
- Joachim W. Schmidt (18)
- Florian Matthes (19)
- Mike Hatzopoulos (20)
- Klemens Boehm (21)
- Alfons Kemper (22)
- Torsten Grust (23)
- Christian Boehm (24)
- Editor Affiliations
-
- 16. University of Athens
- 17. University of Konstanz
- 18. Sustainable Content Logistics Centre
- 19. Chair of Software Engineering for Business Information Systems, Technische Universität München
- 20. Department of Informatics, University of Athens Panepistimiopolis
- 21. IPD, Universität Karlsruhe
- 22. TU München
- 23. Technische Universität München
- 24. Institute for Computer Science, Ludwig-Maximilians Universität München
- Authors
-
- Andrei Z. Broder (25)
- Nadav Eiron (26)
- Marcus Fontoura (25)
- Michael Herscovici (27)
- Ronny Lempel (27)
- John McPherson (28)
- Runping Qi (25)
- Eugene Shekita (29)
- Author Affiliations
-
- 25. Yahoo! Inc,
- 26. Google Inc,
- 27. IBM Haifa Research Lab,
- 28. IBM Silicon Valley Lab,
- 29. IBM Almaden Research Center,
Continue reading...
To view the rest of this content please follow the download PDF link above.