Information Retrieval

, 14:593

A study of the integration of passage-, document-, and cluster-based information for re-ranking search results



Cluster-based and passage-based document retrieval paradigms were shown to be effective. While the former are based on utilizing query-related corpus context manifested in clusters of similar documents, the latter address the fact that a document can be relevant even if only a very small part of it contains query-pertaining information. Hence, cluster-based approaches could be viewed as based on “expanding” the document representation, while passage-based approaches can be thought of as utilizing a “contracted” document representation. We present a study of the relative benefits of using each of these two approaches, and of the potential merits of their integration. To that end, we devise two methods that integrate whole-document-based, cluster-based and passage-based information. The methods are applied for the re-ranking task, that is, re-ordering documents in an initially retrieved list so as to improve precision at the very top ranks. Extensive empirical evaluation attests to the potential merits of integrating these information types. Specifically, the resultant performance substantially transcends that of the initial ranking; and, is often better than that of a state-of-the-art pseudo-feedback-based query expansion approach.


Ad hoc retrieval Re-ranking Clusters Cluster-based language models Passages Passage-based language models 


  1. Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Larkey, L., Li, X., et al. (2004). UMASS at TREC 2004—novelty and hard. In Proceedings of TREC-13 (pp. 715–725).Google Scholar
  2. Baliński, J., & Daniłowicz, C. (2005). Re-ranking method based on inter-document distances. Information Processing and Management, 41(4), 759–775.MATHCrossRefGoogle Scholar
  3. Beigbeder, M., Imafouo, A., & Mercier, A. (2009). Ensm-se at inex 2009: Scoring with proximity and semantic tag information. In Proceedings of INEX (pp. 49–58).Google Scholar
  4. Bendersky, M., & Kurland, O. (2008). Utilizing passage-based language models for document retrieval. In Proceedings of ECIR (pp. 162–174).Google Scholar
  5. Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC3. In Proceedings of TREC-3 (pp. 69–80).Google Scholar
  6. Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2004). Block-based web search. In Proceedings of SIGIR (pp. 456–463).Google Scholar
  7. Callan, J. P. (1994). Passage-level evidence in document retrieval. In Proceedings of SIGIR (pp. 302–310).Google Scholar
  8. Craswell, N., Hawking, D., & Thistlewaite, P. B. (1999). Merging results from isolated search engines. In Proceedings of the Australian Database Conference (pp. 189–200).Google Scholar
  9. Croft, W. B. (1980). A model of cluster searching based on classification. Information Systems, 5, 189–195.CrossRefGoogle Scholar
  10. Croft, W. B. (Ed.). (2000a). Advances in information retrieval: Recent research from the center for intelligent information retrieval. No. 7 in The Kluwer International Series on Information Retrieval. Kluwer.Google Scholar
  11. Croft, W. B. (2000b). Combining approaches to information retrieval. In Croft (2000a), No. 7 in The Kluwer International Series on Information Retrieval, Ch. 1 (pp. 1–36).Google Scholar
  12. Croft, W. B., & Lafferty, J. (Eds.) (2003). Language modeling for information retrieval. No. 13 in Information Retrieval Book Series. Kluwer.Google Scholar
  13. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.CrossRefGoogle Scholar
  14. Denoyer, L., Zaragoza, H., & Gallinari, P. (2001). HMM-based passage models for document classification and ranking. In Proceedings of ECIR (pp. 126–135).Google Scholar
  15. Diaz, F. (2005). Regularizing ad hoc retrieval scores. In Proceedings of CIKM (pp. 672–679).Google Scholar
  16. Diaz, F. (2008). A method for transferring retrieval scores between collections with non overlapping vocabularies. In Proceedings of SIGIR (pp. 805–806), poster.Google Scholar
  17. Elsas, J. L., & Dumais, S. T. (2010). Leveraging temporal dynamics of document content in relevance ranking. In Proceedings of WSDM (pp. 1–10).Google Scholar
  18. Fisher, H. L., & Elchesen, D. R. (1972). Effectiveness of combining title words and index terms in machine retrieval searches. The Computer Journal, 35(3), 243–255.Google Scholar
  19. Griffiths, A., Luckhurst, H. C., & Willett, P. (1986). Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science (JASIS), 37(1), 3–11.Google Scholar
  20. Hearst, M. A., & Plaunt, C. (1993). Subtopic structuring for full-length document access. In Proceedings of SIGIR (pp. 56–89).Google Scholar
  21. Hersh, W. R., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., et al. (2000). Do batch and user evaluation give the same results? In Proceedings of SIGIR (pp. 17–24).Google Scholar
  22. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of SIGIR (pp. 50–57).Google Scholar
  23. Hussain, M. (2004). Language modeling based passage retrieval for question answering systems. Master’s thesis, Saarland University.Google Scholar
  24. Ingwersen, P. (1994). Polyrepresentation of information needs and semantic entities: Elements of a cognitive theory for information retrieval interaction. In Proceedings of SIGIR (pp. 101–110).Google Scholar
  25. Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5), 217–240.CrossRefGoogle Scholar
  26. Jiang, J., & Zhai, C. (2004). UIUC in HARD 2004—passage retrieval using HMMs. In Proceedings of the 13th text retrieval conference (TREC-13).Google Scholar
  27. Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of SIGKDD (pp. 133–142).Google Scholar
  28. Joachims, T. (2006). Training linear SVMs in linear time. In Proceedings of KDD (pp. 217–226).Google Scholar
  29. Joyce, T., & Needham, R. M. (1958). The thesaurus approach to information retrieval. American Documentation, 9(3), 192–197.CrossRefGoogle Scholar
  30. Kalmanovich, I. G., & Kurland, O. (2009). Cluster-based query expansion. In Proceedings of SIGIR (pp. 646–647), poster.Google Scholar
  31. Kaszkiel, M., & Zobel, J. (1997). Passage retrieval revisited. In Proceedings of SIGIR (pp. 178–185).Google Scholar
  32. Kaszkiel, M., & Zobel, J. (2001). Effective ranking with arbitrary passages. Journal of the American Society for Information Science, 52(4), 344–364.CrossRefGoogle Scholar
  33. Katzer, J., McGill, M., Tessier, J., Frakes, W., & Dasgupta, P. (1982). A study of the overlap among document representations. Information Technology: Research and Development, 1(2), 261–274.Google Scholar
  34. Kraaij, W., Westerveld, T., & Hiemstra, D. (2002). The importance of prior probabilities for entry page search. In Proceedings of SIGIR (pp. 27–34).Google Scholar
  35. Krikon, E., Kurland, O., & Bendersky, M. (2009). Utilizing inter-passage and inter-document similarities for re-ranking search results. In Proceedings of CIKM (pp. 1597–1600).Google Scholar
  36. Kurland, O. (2006). Inter-document similarities, language models, and ad hoc retrieval. Ph.D. thesis, Cornell University.Google Scholar
  37. Kurland, O. (2009). Re-ranking search results using language models of query-specific clusters. Journal of Information Retrieval, 12(4), 437–460.CrossRefGoogle Scholar
  38. Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In Proceedings of SIGIR (pp. 194–201).Google Scholar
  39. Kurland, O., & Lee, L. (2005). PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR (pp. 306–313).Google Scholar
  40. Kurland, O., & Lee, L. (2006). Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In Proceedings of SIGIR (pp. 83–90).Google Scholar
  41. Kwok, K. L. (1975). The use of title and cited titles as document representation for automatic classification. Informationg Processing and Management, 11(12), 201–206.MathSciNetCrossRefGoogle Scholar
  42. Lafferty, J. D., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR (pp. 111–119).Google Scholar
  43. Lavrenko, V. (2004). A generative theory of relevance. Ph.D. thesis, University of Massachusetts Amherst.Google Scholar
  44. Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the human language technology conference (HLT) (pp. 104–110).Google Scholar
  45. Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of SIGIR (pp. 120–127).Google Scholar
  46. Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.CrossRefGoogle Scholar
  47. Liu, X., & Croft, W. B. (2002). Passage retrieval based on language models. In Proceedings of CIKM (pp. 375–382).Google Scholar
  48. Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR (pp. 186–193).Google Scholar
  49. Liu, X., & Croft, W. B. (2006a). Experiments on retrieval of optimal clusters. Tech. Rep. IR-478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts.Google Scholar
  50. Liu, X., & Croft, W. B. (2006b). Representing clusters for retrieval. In Proceedings of SIGIR (pp. 671–672), poster.Google Scholar
  51. Liu, X., & Croft, W. B. (2008). Evaluating text representations for retrieval of the best group of documents. In Proceedings of ECIR (pp. 454–462).Google Scholar
  52. Lv, Y., & Zhai, C. (2009). Adaptive relevance feedback in information retrieval. In Proceedings of CIKM (pp. 255–264).Google Scholar
  53. McBryan, O. A. (1994). GENVL and WWWW: Tools for taming the Web. In Proceedings of WWW.Google Scholar
  54. McGill, M., Koll, M., & Noreault, T. (1979). An evaluation of factors affecting document ranking by information retrieval systems. Final report for grant nsf-ist-78-10454 to the National Science Foundation. Tech. rep., Syracuse University.Google Scholar
  55. Meister, L., Kurland, O., & Kalmanovich, I. G. (2010). Re-ranking search results using an additional retrieved list. Information Retrieval.Google Scholar
  56. Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. In Proceedings of SIGIR (pp. 472–479).Google Scholar
  57. Metzler, D., Novak, J., Cui, H., & Reddy, S. (2009). Building enriched document representations using aggregated anchor text. In Proceedings of SIGIR (pp. 219–226).Google Scholar
  58. Mittendorf, E., & Schäuble, P. (1994). Document and passage retrieval based on hidden Markov models. In Proceedings of SIGIR (pp. 318–327).Google Scholar
  59. Murdock, V., & Croft, W. B. (2005). A translation model for sentence retrieval. In Proceedings of HLT/EMNLP (pp. 684–695).Google Scholar
  60. Na, S.-H., Kang, I.-S., Lee, Y.-H., & Lee, J.-H. (2008). Completely-arbitrary passage retrieval in language modeling approach. In Proceedings of AIRS (pp. 22–33).Google Scholar
  61. Ogilvie, P., & Callan, J. (2003). Combining document representations for known item search. In Proceedings of SIGIR (pp. 143–150).Google Scholar
  62. Peng, J., Macdonald, C., & Ounis, I. (2010). Learning to select a ranking function. In ECIR (pp. 114–126).Google Scholar
  63. Ponte, J. M., & Croft, W. B. (1997). Text segmentation by topic. In Proceedings of the European Conference on research and advanced technology for digital libraries (pp. 113–125).Google Scholar
  64. Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of SIGIR (pp. 275–281).Google Scholar
  65. Radev, D. R., Hovy, E. H., & McKeown, K. (2002). Introduction to the special issue on summarization. Computational Linguistics, 28(4), 399–408.CrossRefGoogle Scholar
  66. Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings CIKM (pp. 42–49).Google Scholar
  67. Robertson, S. E., Walker, S., & Hancock-Beaulieu, M. (2000). Experimentation as a way of life: Okapi at trec. Information Processing and Management, 36(1), 95–108.CrossRefGoogle Scholar
  68. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1994). Okapi at trec-3. In Proceedings of TREC-3.Google Scholar
  69. Salton, G. (1963). Associative document retrieval techniques using bibliographic information. Journal of the ACM, 10(4), 440–457.Google Scholar
  70. Salton, G., Allan, J., & Buckley, C. (1993). Approaches to passage retrieval in full text information systems. In Proceedings of SIGIR (pp. 49–58).Google Scholar
  71. Salton, G., & Buckley, C. (1991). Automatic text structuring and retrieval-experiments in automatic encyclopedia searching. In Proceedings of SIGIR (pp. 21–30).Google Scholar
  72. Salton, G., & Lesk, M. (1968). Computer evaluation of indexing and text processing. Journal of the ACM, 15(1), 8–36.MATHCrossRefGoogle Scholar
  73. Salton, J., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.MATHCrossRefGoogle Scholar
  74. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity and reliability. In Proceedings of SIGIR (pp. 162–169).Google Scholar
  75. Singhal, A., & Pereira, F. (1999). Document expansion for speech retrieval. In Proceedings of SIGIR (pp. 34–41).Google Scholar
  76. Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of CIKM (pp. 623–632).Google Scholar
  77. Smucker, M. D., & Jethani, C. P. (2010). Human performance and retrieval precision revisited. In Proceedings of SIGIR (pp. 595–602).Google Scholar
  78. Tao, T., Wang, X., Mei, Q., & Zhai, C. (2006). Language model information retrieval with document expansion. In Proceedings of HLT/NAACL (pp. 407–414).Google Scholar
  79. Tombros, A., & Sanderson, M. (1998). Advantages of query biased summaries in information retrieval. In Proceedings of SIGIR (pp. 2–10).Google Scholar
  80. Turpin, A., & Hersh, W. R. (2001). Why batch and user evaluations do not give the same results. In Proceedings of SIGIR (pp. 225–231).Google Scholar
  81. Van, T.-T., & Beigbeder, M. (2008). A comparison of re-ranking methods in digital libraries using user profiles. In Proceedings of web intelligence (pp. 751–754).Google Scholar
  82. Voorhees, E. M. (1985). The cluster hypothesis revisited. In Proceedings of SIGIR (pp. 188–196).Google Scholar
  83. Voorhees, E. M. (2002). Overview of the TREC 2002 question answering track. In The 11th text retrieval conference TREC-11 (pp. 115–123).Google Scholar
  84. Voorhees, E. M. (2005). Overview of the TREC 2005 robust retrieval task. In Proceedings of TREC-14.Google Scholar
  85. Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiments and evaluation in information retrieval. Cambridge: The MIT Press.Google Scholar
  86. Wade, C., & Allan, J. (2005). Passage retrieval and evaluation. Tech. Rep. IR-396, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts.Google Scholar
  87. Wang, M., & Si, L. (2008). Discriminative probabilistic models for passage based retrieval. In Proceedings of SIGIR (pp. 419–426).Google Scholar
  88. Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In Proceedings of SIGIR (pp. 178–185).Google Scholar
  89. Wilkinson, R. (1994). Effective retrieval of structured documents. In Proceedings of SIGIR (pp. 311–317).Google Scholar
  90. Willett, P. (1985). Query specific automatic document classification. International Forum on Information and Documentation, 10(2), 28–32.MathSciNetGoogle Scholar
  91. Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of SIGIR (pp. 4–11).Google Scholar
  92. Yang, L., Ji, D., Zhou, G., Nie, Y., & Xiao, G. (2006). Document re-ranking using cluster validation and label propagation. In Proceedings of CIKM (pp. 690–697).Google Scholar
  93. Yi, X., & Allan, J. (2009). A comparative study of utilizing topic models for information retrieval. In Proceedings of ECIR (pp. 29–41).Google Scholar
  94. Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In Proceedings of SIGIR (pp. 271–278).Google Scholar
  95. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of SIGIR (pp. 46–54).Google Scholar
  96. Zhai, C., & Lafferty, J. D. (2001a). Model-based feedback in the language modeling approach to information retrieval. In Proceedings of CIKM (pp. 403–410).Google Scholar
  97. Zhai, C., & Lafferty, J. D. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR (pp. 334–342).Google Scholar
  98. Zhao, J., & Yun, Y. (2009). A proximity language model for information retrieval. In Proceedings of SIGIR (pp. 291–298).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Faculty of Industrial Engineering and ManagementTechnion, Israel Institute of TechnologyHaifaIsrael

Personalised recommendations