Partial Collection Replication for Information Retrieval

Lu, Zhihong; McKinley, Kathryn S.

doi:10.1023/A:1023947204209

Partial Collection Replication for Information Retrieval

Published: April 2003

Volume 6, pages 159–198, (2003)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Partial Collection Replication for Information Retrieval

Download PDF

Zhihong Lu¹ &
Kathryn S. McKinley²

325 Accesses
7 Citations
Explore all metrics

Abstract

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanism. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the scenarios where partial replication performs better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to examine query locality using query similarity versus exact match. We show that searching replicas can improve locality (from 3 to 19%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4% in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.

Avoid common mistakes on your manuscript.

References

Baentsch M, Molter G and Sturm P (1996) Introducing application-level replication and naming into today's Web. In: Proceedings of Fifth International World Wide Web Conference. Paris, France; Available at http://www5conf.inria.fr/fich html/papers/P3/Overview.html. Computer Networks and ISDN System, 28(7–11):921–930.
Google Scholar
Bell D and Grimson J (1992) Distributed Database Systems. Addison-Wesley Publishers.
Bestavros A (1995) Demand-based document dissemination to reduce traffic and balance load in distributed information systems. In: Proceedings of SPDP'95: The 7th IEEE Symposium on Parallel and Distributed Processing, San Anotonio, Texas, pp. 338–345.
Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings ofWWW7 Brisbane, Australia, pp. 107–117. (www7.scu.edu.au/programme/fullpapers/1921/com1921.htm)
Brown EW and Chong HA (1997) The GURU system in TREC-6. In: Proceedings of the Sixth Text REtrieval Conference (trec-6), Gaithersburg, MD, pp. 535–540
Burkowski FJ (1990) Retrieval performance of a distributed text database utilizing a parallel process document server. In: 1990 International Symposium on Databases in Parallel and Distributed Systems, Trinity College, Dublin, Ireland, pp. 71–79.
Google Scholar
Burkowski F, Cormack G, Clarke C and Good R (1995) A global search architecture (Tech. Rep. No. CS-95-12). Waterloo, Canada: Department of Computer Science, University of Waterloo.
Google Scholar
Cahoon B and McKinley KS (1996) Performance evaluation of a distributed architecture for information retrieval. In: Proceedings of the Nineteenth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 110–118.
Cahoon B, McKinley KS and Lu Z (2000) Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transaction on Information Syetems, 18(1):1–43.
Google Scholar
Callan JP, Croft WB and Broglio J (1995) TREC and TIPSTER experiments with INQUERY. Information Processing & Management, 31(3):327–343.
Google Scholar
Callan JP, Croft WB and Harding SM (1992) The INQUERY retrieval system. In: Proceedings of the 3rd International Conference on Database and Expert System Applications, Valencia, Spain, pp. 78–93.
Callan JP, Lu Z and Croft WB (1995) Searching distributed collections with inference networks. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Seattle, WA, pp. 21–29.
Chakravarthy A and Haase K (1995) Netserf: Using semantic knowledge to find internet information archives. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, pp. 4–11.
Couvreur TR, Benzel RN, Miller SF, Zeitler DN, Lee DL, Singhai M, Shivaratri N and Wong WYP (1994) An analysis of performance and cost factors in searching large text databases using parallel search systems. Journal of the American Society for Information Science, 7(45):443–464.
Google Scholar
Croft WB, Cook R and Wilder D (1995) Providing government information on the Internet: Experiences with THOMAS. In: The second International Conference on the Theory and Practice of Digital Libraries, Austin, TX, pp. 19–24.
Danzig PB, Ahn J, Noll J and Obraczka K (1991) Distributed indexing: A scalable mechanism for distributed information retrieval. In: Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago, IL, pp. 221–229.
DeWitt D, Graefe G, Kumar KB, Gerber RH, Heytens ML and Muralikrishna M (1986) GAMMA—A high performance dataflow database machine. In: Proceedings of the Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, pp. 228–237.
DeWitt D and Gray J (1992) Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98.
Google Scholar
Excite. 1997. http://www.excite.com.
French JC, Powell AL, Callan J, Viles CL, anc JC, Emmeitt T, Prey KJ and Mou Y (1999) Comparing the performance of database selection algorithms. In: Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, pp. 238–245.
Fuhr N (1999) A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229–249.
Google Scholar
Gravano L and Garcia-Molina H (1995) Generalizing GLOSS to vector-space databases and broker hierarchies. In: Proceedings of the Twenty First International Conference on Very Large Data Bases, Zurich, Switchland, pp. 78–89.
Gravano L, Garcia-Molina H and Tomasic A (1994) The effectiveness of GLOSS for the text database discovery problem. In: Proceedings of the 1994 ACM Sigmod International Conference on Management of Data, Minneapolis, MN, pp. 126–137.
Hagmann RB and Ferrari D (1986) Performance analysis of several backend database architectures. ACM Transactions on Database Systems, 11(1):1–26.
Google Scholar
Harman D (Ed.). 1997. The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards and Technology Special Publication.
Google Scholar
Harman D, McCoy W, Toense R and Candela G (1991) Prototyping a distributed information retrieval system that uses statistical ranking. Information Processing & Management, 27(5):449–460.
Google Scholar
Hawking D (1997) Scalable text retrieval for large digital libraries. In: First European Conference on Research and Advanced Technology for Digital Libraries, Pisa, Italy: Springer, pp. 127–145.
Google Scholar
Hawking D, Craswell N and Thistlewaite P (1998) Overviewof TREC-7 very large collection track. In: Proceedings of the Seventh Text REtrieval Conference (TREC-7), Gaithersburg, MD, pp. 91–104.
Hawking D and Thistlewaite P (1997) Overview of the TREC-6 very large collection track. In: Proceedings of the Sixth Text REtrieval Conference (TREC-6), Gaithersburg, MD, pp. 93–106.
Holmedahl V, Smaith B and Yu T (1998) Cooperative caching of dynamic content on a distributed web server. In: Proceedings of HPDC-7, Chicago, IL, pp. 243–250.
Katz E, Butler M and McGrath R (1994) A scalable HTTP server: the NCSA prototype. Computer Networks and ISDN Systems, 27(2):155–164.
Google Scholar
Lu Z (1999) Scalable distributed architectures for information retrieval. Unpublished doctoral dissertation, University of Massachusetts at Amherst.
Lu Z and McKinley KS (1999) Partial replica selection based on relevance for information retrieval. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, pp. 97–104.
Lu Z and McKinley KS (2000) Partial replica selection versus caching for information retrieval. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greek, pp. 248–255.
Lu Z, McKinley KS and Cahoon B (1998) The hardware/software balancing act for information retrieval on symmetric multiprocessors. In: Proceedings of Europar'98. Southhampton, U.K.
Mackert LF and Lohman GM (1986) R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, pp. 149–159.
Markatos EP (1999) On caching search engine results (Tech. Rep. No. 241). Institute of Computer Science (ICS) Foundation for Research & Technology-Hellas (FORTH), Greece.
Google Scholar
Martin TP, Macleod IA, Russell JI, Lesse K and Foster B (1990) A case study of caching strategies for a distributed full text retrieval system. Information Processing & Management, 26(2):227–247.
Google Scholar
Martin TP and Russell JI (1991) Data caching strategies for distributed full text retrieval systems. Information Systems, 16(1):1–11.
Google Scholar
Saraiva PC, Moura ES de, Ziviani N, Meira W, Fonseca R and Ribeiro-Neto B (2001) Rank-preserving two-level caching for scalable search engines. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, LA, pp. 51–58.
Simpson P and Alonso R (1987) Data caching in information retrieval systems. In: Proceedings of the Tenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, pp. 296–305.
Stonebraker M, Woodfill J, Ranstrom J, Kalash J, Arnold K and Anderson E (1983) Performance analysis of distributed data base systems. In: Proceedings of the Third Symposium on Reliability in Distributed Software and Database Systems, Clearwater Beach, FL, pp. 135–138.
THOMAS (1998) legislative information on the internet. http://thomas.loc.gov.
Tomasic A and Garcia-Molina H (1992) Caching and database scaling in distributed shared-nothing information retrieval systems (Tech. Rep. No. STAN-CS-92-1456). Stanford University.
Turtle HR (1991) Inference networks for document retrieval. Unpublished doctoral dissertation, University of Massachusetts.
Voorhees EM, Gupta NK and Johnson-Laird B (1995) Learning collection fusion strategies. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, pp. 172–179.
Wang J (1999) Asurvey of web caching schemes for the internet. Computer Communication Review, 29(5):36–46.
Google Scholar
Xu J and Croft W (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the Twenty-Second Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, pp. 254–261.

Download references

Author information

Authors and Affiliations

AT&T Laboratories, 200 Laurel Avenue, Middletown, New Jersey, 07748, USA
Zhihong Lu
Department of Computer Sciences, University of Texas, Austin, Texas, 78712, USA
Kathryn S. McKinley

Authors

Zhihong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn S. McKinley
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, Z., McKinley, K.S. Partial Collection Replication for Information Retrieval. Information Retrieval 6, 159–198 (2003). https://doi.org/10.1023/A:1023947204209

Download citation

Issue Date: April 2003
DOI: https://doi.org/10.1023/A:1023947204209

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Partial Collection Replication for Information Retrieval

Abstract

Article PDF

Similar content being viewed by others

Hybrid Query Scheduling for a Replicated Search Engine

Quantifying retrieval bias in Web archive search

Efficient distributed selective search

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Partial Collection Replication for Information Retrieval

Abstract

Article PDF

Similar content being viewed by others

Hybrid Query Scheduling for a Replicated Search Engine

Quantifying retrieval bias in Web archive search

Efficient distributed selective search

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation