Abstract
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanism. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the scenarios where partial replication performs better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to examine query locality using query similarity versus exact match. We show that searching replicas can improve locality (from 3 to 19%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4% in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Baentsch M, Molter G and Sturm P (1996) Introducing application-level replication and naming into today's Web. In: Proceedings of Fifth International World Wide Web Conference. Paris, France; Available at http://www5conf.inria.fr/fich html/papers/P3/Overview.html. Computer Networks and ISDN System, 28(7–11):921–930.
Bell D and Grimson J (1992) Distributed Database Systems. Addison-Wesley Publishers.
Bestavros A (1995) Demand-based document dissemination to reduce traffic and balance load in distributed information systems. In: Proceedings of SPDP'95: The 7th IEEE Symposium on Parallel and Distributed Processing, San Anotonio, Texas, pp. 338–345.
Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings ofWWW7 Brisbane, Australia, pp. 107–117. (www7.scu.edu.au/programme/fullpapers/1921/com1921.htm)
Brown EW and Chong HA (1997) The GURU system in TREC-6. In: Proceedings of the Sixth Text REtrieval Conference (trec-6), Gaithersburg, MD, pp. 535–540
Burkowski FJ (1990) Retrieval performance of a distributed text database utilizing a parallel process document server. In: 1990 International Symposium on Databases in Parallel and Distributed Systems, Trinity College, Dublin, Ireland, pp. 71–79.
Burkowski F, Cormack G, Clarke C and Good R (1995) A global search architecture (Tech. Rep. No. CS-95-12). Waterloo, Canada: Department of Computer Science, University of Waterloo.
Cahoon B and McKinley KS (1996) Performance evaluation of a distributed architecture for information retrieval. In: Proceedings of the Nineteenth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 110–118.
Cahoon B, McKinley KS and Lu Z (2000) Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transaction on Information Syetems, 18(1):1–43.
Callan JP, Croft WB and Broglio J (1995) TREC and TIPSTER experiments with INQUERY. Information Processing & Management, 31(3):327–343.
Callan JP, Croft WB and Harding SM (1992) The INQUERY retrieval system. In: Proceedings of the 3rd International Conference on Database and Expert System Applications, Valencia, Spain, pp. 78–93.
Callan JP, Lu Z and Croft WB (1995) Searching distributed collections with inference networks. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Seattle, WA, pp. 21–29.
Chakravarthy A and Haase K (1995) Netserf: Using semantic knowledge to find internet information archives. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, pp. 4–11.
Couvreur TR, Benzel RN, Miller SF, Zeitler DN, Lee DL, Singhai M, Shivaratri N and Wong WYP (1994) An analysis of performance and cost factors in searching large text databases using parallel search systems. Journal of the American Society for Information Science, 7(45):443–464.
Croft WB, Cook R and Wilder D (1995) Providing government information on the Internet: Experiences with THOMAS. In: The second International Conference on the Theory and Practice of Digital Libraries, Austin, TX, pp. 19–24.
Danzig PB, Ahn J, Noll J and Obraczka K (1991) Distributed indexing: A scalable mechanism for distributed information retrieval. In: Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago, IL, pp. 221–229.
DeWitt D, Graefe G, Kumar KB, Gerber RH, Heytens ML and Muralikrishna M (1986) GAMMA—A high performance dataflow database machine. In: Proceedings of the Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, pp. 228–237.
DeWitt D and Gray J (1992) Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98.
Excite. 1997. http://www.excite.com.
French JC, Powell AL, Callan J, Viles CL, anc JC, Emmeitt T, Prey KJ and Mou Y (1999) Comparing the performance of database selection algorithms. In: Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, pp. 238–245.
Fuhr N (1999) A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229–249.
Gravano L and Garcia-Molina H (1995) Generalizing GLOSS to vector-space databases and broker hierarchies. In: Proceedings of the Twenty First International Conference on Very Large Data Bases, Zurich, Switchland, pp. 78–89.
Gravano L, Garcia-Molina H and Tomasic A (1994) The effectiveness of GLOSS for the text database discovery problem. In: Proceedings of the 1994 ACM Sigmod International Conference on Management of Data, Minneapolis, MN, pp. 126–137.
Hagmann RB and Ferrari D (1986) Performance analysis of several backend database architectures. ACM Transactions on Database Systems, 11(1):1–26.
Harman D (Ed.). 1997. The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards and Technology Special Publication.
Harman D, McCoy W, Toense R and Candela G (1991) Prototyping a distributed information retrieval system that uses statistical ranking. Information Processing & Management, 27(5):449–460.
Hawking D (1997) Scalable text retrieval for large digital libraries. In: First European Conference on Research and Advanced Technology for Digital Libraries, Pisa, Italy: Springer, pp. 127–145.
Hawking D, Craswell N and Thistlewaite P (1998) Overviewof TREC-7 very large collection track. In: Proceedings of the Seventh Text REtrieval Conference (TREC-7), Gaithersburg, MD, pp. 91–104.
Hawking D and Thistlewaite P (1997) Overview of the TREC-6 very large collection track. In: Proceedings of the Sixth Text REtrieval Conference (TREC-6), Gaithersburg, MD, pp. 93–106.
Holmedahl V, Smaith B and Yu T (1998) Cooperative caching of dynamic content on a distributed web server. In: Proceedings of HPDC-7, Chicago, IL, pp. 243–250.
Katz E, Butler M and McGrath R (1994) A scalable HTTP server: the NCSA prototype. Computer Networks and ISDN Systems, 27(2):155–164.
Lu Z (1999) Scalable distributed architectures for information retrieval. Unpublished doctoral dissertation, University of Massachusetts at Amherst.
Lu Z and McKinley KS (1999) Partial replica selection based on relevance for information retrieval. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, pp. 97–104.
Lu Z and McKinley KS (2000) Partial replica selection versus caching for information retrieval. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greek, pp. 248–255.
Lu Z, McKinley KS and Cahoon B (1998) The hardware/software balancing act for information retrieval on symmetric multiprocessors. In: Proceedings of Europar'98. Southhampton, U.K.
Mackert LF and Lohman GM (1986) R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, pp. 149–159.
Markatos EP (1999) On caching search engine results (Tech. Rep. No. 241). Institute of Computer Science (ICS) Foundation for Research & Technology-Hellas (FORTH), Greece.
Martin TP, Macleod IA, Russell JI, Lesse K and Foster B (1990) A case study of caching strategies for a distributed full text retrieval system. Information Processing & Management, 26(2):227–247.
Martin TP and Russell JI (1991) Data caching strategies for distributed full text retrieval systems. Information Systems, 16(1):1–11.
Saraiva PC, Moura ES de, Ziviani N, Meira W, Fonseca R and Ribeiro-Neto B (2001) Rank-preserving two-level caching for scalable search engines. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, LA, pp. 51–58.
Simpson P and Alonso R (1987) Data caching in information retrieval systems. In: Proceedings of the Tenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, pp. 296–305.
Stonebraker M, Woodfill J, Ranstrom J, Kalash J, Arnold K and Anderson E (1983) Performance analysis of distributed data base systems. In: Proceedings of the Third Symposium on Reliability in Distributed Software and Database Systems, Clearwater Beach, FL, pp. 135–138.
THOMAS (1998) legislative information on the internet. http://thomas.loc.gov.
Tomasic A and Garcia-Molina H (1992) Caching and database scaling in distributed shared-nothing information retrieval systems (Tech. Rep. No. STAN-CS-92-1456). Stanford University.
Turtle HR (1991) Inference networks for document retrieval. Unpublished doctoral dissertation, University of Massachusetts.
Voorhees EM, Gupta NK and Johnson-Laird B (1995) Learning collection fusion strategies. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, pp. 172–179.
Wang J (1999) Asurvey of web caching schemes for the internet. Computer Communication Review, 29(5):36–46.
Xu J and Croft W (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the Twenty-Second Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, pp. 254–261.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Lu, Z., McKinley, K.S. Partial Collection Replication for Information Retrieval. Information Retrieval 6, 159–198 (2003). https://doi.org/10.1023/A:1023947204209
Issue Date:
DOI: https://doi.org/10.1023/A:1023947204209