Information Retrieval

, Volume 10, Issue 3, pp 205–231 | Cite as

A pipelined architecture for distributed text query evaluation

  • Alistair Moffat
  • William Webber
  • Justin Zobel
  • Ricardo Baeza-Yates
Article

Abstract

Two principal query-evaluation methodologies have been described for cluster-based implementation of distributed information retrieval systems: document partitioning and term partitioning. In a document-partitioned system, each of the processors hosts a subset of the documents in the collection, and executes every query against its local sub-collection. In a term-partitioned system, each of the processors hosts a subset of the inverted lists that make up the index of the collection, and serves them to a central machine as they are required for query evaluation.

In this paper we introduce a pipelined query-evaluation methodology, based on a term-partitioned index, in which partially evaluated queries are passed amongst the set of processors that host the query terms. This arrangement retains the disk read benefits of term partitioning, but more effectively shares the computational load. We compare the three methodologies experimentally, and show that term distribution is inefficient and scales poorly. The new pipelined approach offers efficient memory utilization and efficient use of disk accesses, but suffers from problems with load balancing between nodes. Until these problems are resolved, document partitioning remains the preferred method.

Keywords

Distributed retrieval Text searching Index representations 

References

  1. Anh, V. N., de Kretser, O., & Moffat, A. (2001). Vector-space ranking with effective early termination. In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel (Eds.), Proc. 24th Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval (pp. 35–42), New Orleans, LA. New York: ACM Press.Google Scholar
  2. Anh, V. N., & Moffat, A. (2006). Pruned query evaluation using pre-computed impacts. In S. Dumais, E. N. Efthimiadis, D. Hawking and K. Järvelin (Eds.), Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 372–379). Seattle, Washington. New York: ACM Press. URL http://doi.acm.org/10.1145/1148170.1148235.
  3. Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., & Ziviani, N. (2001). Distributed query processing using partitioned inverted files. In G. Navarro (Ed.), Proc. Symp. String Processing and Information Retrieval (pp. 10–20). Laguna de San Rafael, Chile.Google Scholar
  4. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York, NY: ACM Press.Google Scholar
  5. Bailey, P., Craswell, N., & Hawking, D. (2003). Engineering a multi-purpose test collection for web retrieval experiments. Information Processing & Management, 39(6), 853–871.Google Scholar
  6. Barroso, L. A., Dean, J., & Hölzle, U. (2003). Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2), 22–28.Google Scholar
  7. Cacheda, F., Plachouras, V., & Ounis, I. (2004). Performance analysis of distributed architectures to index one terabyte of text. In S. McDonald and J. Tait (Eds.), Proc. 26th European Conf. on IR Research, volume 2997 of Lecture Notes in Computer Science (pp. 394–408), Sunderland, UK. Springer.Google Scholar
  8. Cahoon, B., McKinley, K. S., & Lu, Z. (2000). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems, 18(1), 1–43.Google Scholar
  9. Clarke, C. L. A., Tilker, P. L., Tran, A. Q.-L., Harris, K., & Cheng, A. S. (2003). A reliable storage management layer for distributed information retrieval systems. In Proc. 2003 CIKM Int. Conf. Information and Knowledge Management (pp. 207–215), New York, NY, USA. New York: ACM Press.Google Scholar
  10. de Kretser, O., Moffat, A., Shimmin, T., & Zobel, J. (1998). Methodologies for distributed information retrieval. In M. P. Papazoglou, M. Takizawa, B. Krämer, and S. Chanson (Eds.), Proc. 18th International Conf. on Distributed Computing Systems (pp. 66–73), Amsterdam, The Netherlands, IEEE.Google Scholar
  11. Harman, D., McCoy, W., Toense, R., & Candela, G. (1991). Prototyping a distributed information retrieval system using statistical ranking. Information Processing & Management, 27(5), 449–460.Google Scholar
  12. Hawking, D. (1998). Efficiency/effectiveness trade-offs in query processing. ACM SIGIR Forum, 32(2), 16–22.Google Scholar
  13. Jeong, B.-S., & Omiecinski, E. (1995). Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems, 6(2), 142–153.Google Scholar
  14. Lester, N., Moffat, A., Webber, W., & Zobel, J. (2005a). Space-limited ranked query evaluation using adaptive pruning. In A. H. H. Ngu, M. Kitsuregawa, E. J. Neuhold, J.-Y. Chung, and Q. Z. Sheng (Eds.), Proc. 6th International Conf. on Web Information Systems Engineering (pp. 470–477), New York, LNCS 3806, Springer.Google Scholar
  15. Lester, N., Moffat, A., & Zobel, J. (2005b). Fast on-line index construction by geometric partitioning. In A. Chowdhury, N. Fuhr, M. Ronthaler, H.-J. Schek, and W. Teiken (Eds.), Proc. 2005 CIKM Int. Conf. Information and Knowledge Management (pp. 776–783), Bremen, Germany, New York: ACM Press.Google Scholar
  16. MacFarlane, A., McCann, J. A., & Robertson, S. E. (2000). Parallel search using partitioned inverted files. In P. de la Fuente (Ed.), Proc. Symp. String Processing and Information Retrieval (pp. 209–220) A Coruña, Spain.Google Scholar
  17. Moffat, A., Webber, W., & Zobel, J. (2006). Load balancing for term-distributed parallel retrieval. In S. Dumais, E. N. Efthimiadis, D. Hawking and K. Järvelin (Eds.), Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 348–355). Seattle, Washington. New York: ACM Press. URL http://doi.acm.org/10.1145/1148170.1148232.
  18. Moffat, A., & Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), 349–379.Google Scholar
  19. Moffat, A., & Zobel, J. (2004). What does it mean to “measure performance”? In X. Zhou, S. Su, M. P. Papazoglou, M. E. Owlowska, and K. Jeffrey (Eds.), Proc. 5th Int. Conf. on Web Informations Systems (pp. 1–12), Brisbane, Australia. LNCS 3306, Springer.Google Scholar
  20. Orlando, S., Perego, R., & Silvestri, F. (2001). Design of a parallel and distributed web search engine. In Proc. 2001 Parallel Computing Conf. (pp. 197–204), Naples, Italy, Imperial College Press.Google Scholar
  21. Persin, M., Zobel, J., & Sacks-Davis, R. (1996). Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10), 749–764.Google Scholar
  22. Ribeiro-Neto, B., de Moura, E. S., Neubert, M. S., & Ziviani, N. (1999). Efficient distributed algorithms to build inverted files. In M. Hearst, F. Gey, and R. Tong (Eds.), Proc. 22nd Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval (pp. 105–112), San Francisco, CA, New York: ACM Press.Google Scholar
  23. Ribeiro-Neto, B. A., & Barbosa, R. R. (1998). Query performance for tightly coupled distributed digital libraries. In Proc. 3rd ACM Conf. on Digital Libraries (pp. 182–190), Pittsburgh, PA, New York: ACM Press.Google Scholar
  24. Scholer, F., Williams, H. E., Yiannis, J., & Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In M. Beaulieu, R. Baeza-Yates, S. H. Myaeng, and K. Järvelin (Eds.), Proc. 25th Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval (pp. 222–229), Tampere, Finland, New York: ACM Press.Google Scholar
  25. Sornil, O. (2001). Parallel inverted index for large-scale, dynamic digital libraries. PhD thesis, Virginia Tech., USA.Google Scholar
  26. Spink, A., Wolfram, D., Jansen, B. J., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the American Society for Information Science, 52(3), 226–234.Google Scholar
  27. Tomasic, A., & García-Molina, H. (1993). Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In M. J. Carey and P. Valduriez (Eds.), Proc. 2nd International Conf. On Parallel and Distributed Information Systems (pp. 8–17), Los Alamitos, CA, IEEE Computer Society Press.Google Scholar
  28. Webber, W., & Moffat, A. (2005). In search of reliable retrieval experiments. In J. Kay, A. Turpin, and R. Wilkinson (Eds.), Proc. 10th Australasian Document Computing Symposium (pp. 26–33), Sydney, Australia, University of Sydney.Google Scholar
  29. Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes: Compressing and indexing documents and images, 2nd ed. San Francisco, Morgan Kaufmann.Google Scholar
  30. Xi, W., Sornil, O., Luo, M., & Fox, E. A. (2002). Hybrid partition inverted files: Experimental validation. In M. Agosti and C. Thanos (Eds.), Proc. European Conf. on Digital Libraries (pp. 422–431), Rome. LNCS volume 2458, Springer.Google Scholar
  31. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214.Google Scholar
  32. Zobel, J., & Moffat, A. (2006). Inverted files for text search engines. ACM Computing Surveys, 38(2). URL http://doi.acm.org/10.1145/1132956.1132959.

Copyright information

© Springer Science+Business Media, LLC 2006

Authors and Affiliations

  • Alistair Moffat
    • 1
  • William Webber
    • 1
    • 2
  • Justin Zobel
    • 2
  • Ricardo Baeza-Yates
    • 3
    • 4
  1. 1.Department of Computer Science and Software EngineeringThe University of MelbourneMelbourneAustralia
  2. 2.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia
  3. 3.Center for Web Research, Department of Computer ScienceUniversity of ChileSantiagoChile
  4. 4.Yahoo! ResearchBarcelonaSpain

Personalised recommendations