Information Retrieval

, Volume 13, Issue 2, pp 157–187 | Cite as

Utilizing passage-based language models for ad hoc document retrieval

  • Michael Bendersky
  • Oren KurlandEmail author


To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval.


Ad hoc document retrieval Passage-based language models Document homogeneity Relevance models Passage-based relevance models 



This paper is based upon work done in part while the first author was at the Technion and the second author was at Cornell University. The work presented here was supported in part by Google’s and IBM’s faculty research awards, by the Center for Intelligent Information Retrieval, and by the National Science Foundation under grant no. IIS-0329064. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsoring institutions.


  1. Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Larkey, L., Li, X. et al. (2004). UMASS at TREC 2004—novelty and hard. In Proceedings of the thirteenth text retrieval conference (TREC-13).Google Scholar
  2. Allan, J. (2003). HARD track overview in TREC 2003: High accuracy retrieval from documents. In Proceedings of the twelfth text retrieval conference (TREC-12) (pp. 24–37).Google Scholar
  3. Allan, J., Connell, M. E., Croft, W. B., Feng, F.-F., Fisher, D., & Li, X. (2000). INQUERY and TREC-9. In Proceedings of the ninth text retrieval conference (TREC-9) (pp. 551–562). nIST Special Publication 500-249.Google Scholar
  4. Ashoori, E., Lalmas, M., & Tsikrika, T. (2007). Examining topic shifts in content-oriented XML retrieval. International Journal on Digital Libraries, 8(1), 39–60.CrossRefGoogle Scholar
  5. Barzilay, R., & Lee, L. (2004). Catching the drift: Probabilistic content models, with applications to generation and summarization. In HLT-NAACL 2004: Proceedings of the main conference (pp. 113–120).Google Scholar
  6. Bendersky, M. (2007). Passage language models in ad hoc document retrieval. Master’s thesis, Technion—Israel Institute of Technology.Google Scholar
  7. Bendersky, M., & Kurland, O. (2008a). Re-ranking search results using document-passage graphs. In Proceedings of SIGIR (pp. 853–854). Poster.Google Scholar
  8. Bendersky, M., & Kurland, O. (2008b). Utilizing passage-based language models for document retrieval. In Proceedings of ECIR (pp. 162–174).Google Scholar
  9. Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC3. In Proceedings of the third text retrieval conference (TREC-3) (pp. 69–80).Google Scholar
  10. Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2004). Block-based web search. In Proceedings of SIGIR (pp. 456–463).Google Scholar
  11. Callan, J. P. (1994). Passage-level evidence in document retrieval. In Proceedings of SIGIR (pp. 302–310).Google Scholar
  12. Corrada-Emmanuel, A., Croft, W. B., & Murdock, V. (2003). Answer passage retrieval for question answering. Technical report IR-283. Center for Intelligent Information Retrieval, University of Massachusetts.Google Scholar
  13. Croft, W. B., & Lafferty, J. (Eds.) (2003). Language modeling for information retrieval. No.13 in information retrieval book series. Dordrecht: Kluwer.Google Scholar
  14. Dang, K., Zhao, T., Qi, H., & Zheng, D. (2007). Incorporating passage feature within language model framework for retrieval. In Proceedings of the 8th international conference on intellighent text processing and computational linguistics (CICLing) (pp. 476–484).Google Scholar
  15. Denoyer, L., Zaragoza, H., & Gallinari, P. (2001). HMM-based passage models for document classification and ranking. In Proceedings of ECIR (pp. 126–135).Google Scholar
  16. Diaz, F. (2005). Regularizing ad hoc retrieval scores. In Proceedings of the fourteenth international conference on information and knowledge managment (CIKM) (pp. 672–679).Google Scholar
  17. Fang, H., & Zhai, C. (2005). A formal study of information retrieval heuristics. In Proceedings of SIGIR (pp. 49–56).Google Scholar
  18. Hearst, M. A., & Plaunt, C. (1993). Subtopic structuring for full-length document access. In Proceedings of SIGIR (pp. 56–89).Google Scholar
  19. Hiemstra, D. (2002). Term-specific smoothing for the language modeling approach to information retrieval: The importance of a query term. In Proceedings of SIGIR (pp. 35–41).Google Scholar
  20. Hiemstra, D., & Kraaij, W. (1999). Twenty-One at TREC7: Ad hoc and cross-language track. In Proceedings of the seventh text retrieval conference (TREC-7) (pp. 227–238).Google Scholar
  21. Hu, X., Bandhakavi, S., & Zhai, C. (2003). Error analysis of difficult TREC topics. In Proceedings of SIGIR (pp. 407–408). Poster.Google Scholar
  22. Hussain, M. (2004). Language modeling based passage retrieval for question answering systems. Master’s thesis, Saarland University.Google Scholar
  23. Jiang, J., & Zhai, C. (2004). UIUC in HARD 2004—passage retrieval using HMMs. In Proceedings of the thirteenth text retrieval conference (TREC-13).Google Scholar
  24. Kaszkiel, M., & Zobel, J. (1997). Passage retrieval revisited. In Proceedings of SIGIR (pp. 178–185).Google Scholar
  25. Kaszkiel, M., & Zobel, J. (2001). Effective ranking with arbitrary passages. Journal of the American Society for Information Science, 52(4), 344–364.CrossRefGoogle Scholar
  26. Kurland, O. (2006). Inter-document similarities, language models, and ad hoc retrieval. Ph.D. thesis, Cornell University.Google Scholar
  27. Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In Proceedings of SIGIR (pp. 194–201).Google Scholar
  28. Kurland, O., & Lee, L. (2005). PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR (pp. 306–313).Google Scholar
  29. Kurland, O., Lee, L., & Domshlak, C. (2005). Better than the real thing? Iterative pseudo-query processing using cluster-based language models. In Proceedings of SIGIR (pp. 19–26).Google Scholar
  30. Lavrenko, V. (2004). A generative theory of relevance. Ph.D. thesis, University of Massachusetts Amherst.Google Scholar
  31. Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of SIGIR (pp. 120–127).Google Scholar
  32. Lavrenko, V., & Croft, W. B. (2003). Relevance models in information retrieval. In B. Croft & J. Lafferty (Eds.), No.13 in information retrieval book series (pp. 11–56).Google Scholar
  33. Li, X., & Zhu, Z. (2008). Enhancing relevance models with adaptive passage retrieval. In Proceedings of ECIR (pp. 463–471).Google Scholar
  34. Lin, J., Quan, D., Sinha, V., Bakshi, K., Katz, B., Huynh, D., et al. (2003). What makes a good answer? The role of context in question answering. In Proceedings of the ninth IFIP TC13 international conference on human-computer interaction (INTERACT-2003) (pp. 25–32).Google Scholar
  35. Liu, X., & Croft, W. B. (2002). Passage retrieval based on language models. In Proceedings of the 11th international conference on information and knowledge managment (CIKM) (pp. 375–382).Google Scholar
  36. Mittendorf, E., & Schäuble, P. (1994). Document and passage retrieval based on hidden Markov models. In Proceedings of SIGIR (pp. 318–327).Google Scholar
  37. Murdock, V., & Croft, W. B. (2005). A translation model for sentence retrieval. In Proceedings of HLT/EMNLP (pp. 684–695).Google Scholar
  38. Na, S.-H., Kang, I.-S., Lee, Y.-H., & Lee, J.-H. (2008). Completely arbitrary passage retrieval in language modeling approach. In Proceedings of the 4th Asia information retrieval symposium (AIRS) (pp. 22–33).Google Scholar
  39. Ogilvie, P., & Callan, J. (2004). Hierarchical language models for XML component retrieval. In Proceedings of INEX.Google Scholar
  40. Otterbacher, J., Erkan, G., & Radev, D. R. (2005). Using random walks for question-focused sentence retrieval. In Proceedings of human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP) (pp. 915–922).Google Scholar
  41. Ponte, J. M., & Croft, W. B. (1997). Text segmentation by topic. In Proceedings of the European conference on research and advanced technology for digital libraries (pp. 113–125).Google Scholar
  42. Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of SIGIR (pp. 275–281).Google Scholar
  43. Salton, G. (1968). Automatic information organization and retrieval. McGraw-Hill computer science series. New York: McGraw-Hill.Google Scholar
  44. Salton, G., Allan, J., & Buckley, C. (1993). Approaches to passage retrieval in full text information systems. In Proceedings of SIGIR (pp. 49–58).Google Scholar
  45. Salton, G., & Buckley, C. (1991). Automatic text structuring and retrieval-experiments in automatic encyclopedia searching. In Proceedings of SIGIR (pp. 21–30).Google Scholar
  46. Salton, J., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.zbMATHCrossRefGoogle Scholar
  47. Sigurbjörnsson, B., & Kamps, J. (2005). The effect of structured queries and selective indexing on XML retrieval. In Proceedings of INEX (pp. 104–118).Google Scholar
  48. Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In Proceedings of SIGIR (pp. 21–29).Google Scholar
  49. Tellex, S., Katz, B., Lin, J., Fernandes, A., & Marton, G. (2003). Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of SIGIR (pp. 41–47).Google Scholar
  50. Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiments and evlautaion in information retrieval. Cambridge, MA: MIT.Google Scholar
  51. Wade, C., & Allan, J. (2005). Passage retrieval and evaluation. Technical report IR-396, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts.Google Scholar
  52. Wan, X., Yang, J., & Xiao, J. (2008). Towards a unified approach to document similarity search using manifold-ranking of blocks. Information Processing and Management, 44(3), 1032–1048.CrossRefGoogle Scholar
  53. Wang, M., & Si, L. (2008). Discriminative probabilistic models for passage based retrieval. In Proceedings of SIGIR (pp. 419–426).Google Scholar
  54. Wilkinson, R. (1994). Effective retrieval of structured documents. In Proceedings of SIGIR (pp. 311–317).Google Scholar
  55. Zhai, C. (2008). Statistical language models for information retrieval: A critical review. Foundations and Trends in Information Retrieval, 2(3), 137–213.CrossRefGoogle Scholar
  56. Zhai, C., & Lafferty, J. (2002). Two-stage language models for information retrieval. In Proceedings of SIGIR (pp. 49–56).Google Scholar
  57. Zhai, C., & Lafferty, J. D. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR (pp. 334–342).Google Scholar
  58. Zhang, D., & Lee, W. S. (2004). A language modeling approach to passage question answering. In Proceedings of the twelfth text retrieval conference (TREC-12) (pp. 489–495).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of Computer Science, Center for Intelligent Information RetrievalUniversity of Massachusetts AmherstAmherstUSA
  2. 2.Faculty of Industrial Engineering and ManagementTechnion—Israel Institute of TechnologyHaifaIsrael

Personalised recommendations