The Cluster Hypothesis in Information Retrieval

Kurland, Oren

doi:10.1007/978-3-319-06028-6_105

Oren Kurland²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

European Conference on Information Retrieval

2978 Accesses
6 Citations

Abstract

The cluster hypothesis states that “closely associated documents tend to be relevant to the same requests” [45]. This is one of the most fundamental and influential hypotheses in the field of information retrieval and has given rise to a huge body of work. In this tutorial we will present the research topics that have emerged based on the cluster hypothesis. Specific focus will be placed on cluster-based document retrieval, the use of topic models for ad hoc IR, and the use of graph-based methods that utilize inter-document similarities. Furthermore, we will provide an in-depth survey of the suite of retrieval methods that rely, either explicitly or implicitly, on the cluster hypothesis and which are used for a variety of different tasks; e.g., query expansion, query-performance prediction, fusion and federated search, and search results diversification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Azzopardi, L., Girolami, M., van Rijsbergen, K.: Topic based language models for ad hoc information retrieval. In: Proceedings of IJCNN (2004)
Google Scholar
Baliński, J., Daniłowicz, C.: Re-ranking method based on inter-document distances. Information Processing and Management 41(4), 759–775 (2005)
Article MATH Google Scholar
Bendersky, M., Kurland, O.: Re-ranking search results using document-passage graphs. In: Proceedings of SIGIR, pp. 853–854 (2008) (poster)
Google Scholar
Bruce Croft, W.: A model of cluster searching based on classification. Information Systems 5, 189–195 (1980)
Article Google Scholar
Daniłowicz, C., Baliński, J.: Document ranking based upon Markov chains. Information Processing and Management 41(4), 759–775 (2000)
Google Scholar
Diaz, F.: Regularizing ad hoc retrieval scores. In: Proceedings of CIKM, pp. 672–679 (2005)
Google Scholar
Diaz, F.: Performance prediction using spatial autocorrelation. In: Proceedings of SIGIR, pp. 583–590 (2007)
Google Scholar
Diaz, F.: A method for transferring retrieval scores between collections with non overlapping vocabularies. In: Proceedings of SIGIR, pp. 805–806 (2008) (poster)
Google Scholar
El-Hamdouchi, A., Willett, P.: Hierarchic document clustering using Ward’s method. In: Proceedings of SIGIR, pp. 149–156 (1986)
Google Scholar
El-Hamdouchi, A., Willett, P.: Techniques for the measurement of clustering tendency in document retrieval systems. Journal of Information Science 13, 361–365 (1987)
Article Google Scholar
Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Journal of Information Retrieval 15(2), 93–115 (2012)
Article Google Scholar
He, J., Meij, E., de Rijke, M.: Result diversification based on query-specific cluster ranking. JASIST 62(3), 550–571 (2011)
Google Scholar
Hearst, M.A., Karger, D.R., Pedersen, J.O.: Scatter/Gather as a tool for the navigation of retrieval results. In: Working Notes of the 1995 AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval (1995)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of SIGIR, pp. 50–57 (1999)
Google Scholar
Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7(5), 217–240 (1971)
Article Google Scholar
Kalmanovich, I.G., Kurland, O.: Cluster-based query expansion. In: Proceedings of SIGIR, pp. 646–647 (2009) (poster)
Google Scholar
Khalaman, S., Kurland, O.: Utilizing inter-document similarities in federated search. In: Proceedings of SIGIR, pp. 1169–1170 (2012)
Google Scholar
Kozorovitzky, A.K., Kurland, O.: Cluster-based fusion of retrieved lists. In: SIGIR, pp. 893–902 (2011)
Google Scholar
Kozorovitzky, A.K., Kurland, O.: From “identical” to “similar”: Fusing retrieved lists based on inter-document similarities. Journal of Artificial Intelligence Research (JAIR) 41, 267–296 (2011)
Google Scholar
Krikon, E., Kurland, O.: A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Journal of Information Retrieval 14(6), 593–616 (2011)
Article Google Scholar
Krikon, E., Kurland, O., Bendersky, M.: Utilizing inter-passage and inter-document similarities for re-ranking search results. ACM Transactions on Information Systems 29(1) (2010)
Google Scholar
Kurland, O.: The opposite of smoothing: A language model approach to ranking query-specific document clusters. In: Proceedings of SIGIR, pp. 171–178 (2008)
Google Scholar
Kurland, O.: Re-ranking search results using language models of query-specific clusters. Journal of Information Retrieval 12(4), 437–460 (2009)
Article Google Scholar
Kurland, O., Domshlak, C.: A rank-aggregation approach to searching for optimal query-specific clusters. In: Proceedings of SIGIR, pp. 547–554 (2008)
Google Scholar
Kurland, O., Krikon, E.: The opposite of smoothing: A language model approach to ranking query-specific document clusters. Journal of Artificial Intelligence Research (JAIR) 41, 367–395 (2011)
MathSciNet MATH Google Scholar
Kurland, O., Lee, L.: Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of SIGIR, pp. 194–201 (2004)
Google Scholar
Kurland, O., Lee, L.: PageRank without hyperlinks: Structural re-ranking using links induced by language models. In: Proceedings of SIGIR, pp. 306–313 (2005)
Google Scholar
Kurland, O., Lee, L.: Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In: Proceedings of SIGIR, pp. 83–90 (2006)
Google Scholar
Kurland, O., Raiber, F., Shtok, A.: Query-performance prediction and cluster ranking: two sides of the same coin. In: Proceedings of CIKM, pp. 2459–2462 (2012)
Google Scholar
Lee, K.-S., Croft, W.B., Allan, J.: A cluster-based resampling method for pseudo-relevance feedback. In: Proceedings of SIGIR, pp. 235–242 (2008)
Google Scholar
Lee, K.-S., Park, Y.-C., Choi, K.-S.: Re-ranking model based on document clusters. Information Processing and Management 37(1), 1–14 (2001)
Article MATH Google Scholar
Leuski, A.: Evaluating document clustering for interactive information retrieval. In: Proceedings of CIKM, pp. 33–40 (2001)
Google Scholar
Leouski, A., Allan, J.: Evaluating a visual navigation system for a digital library. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 535–554. Springer, Heidelberg (1998)
Chapter Google Scholar
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of SIGIR, pp. 186–193 (2004)
Google Scholar
Liu, X., Croft, W.B.: Experiments on retrieval of optimal clusters. Technical Report IR-478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts (2006)
Google Scholar
Liu, X., Croft, W.B.: Evaluating text representations for retrieval of the best group of documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 454–462. Springer, Heidelberg (2008)
Chapter Google Scholar
Raiber, F., Kurland, O.: Exploring the cluster hypothesis, and cluster-based retrieval, over the web. In: Proceedings of CIKM, pp. 2507–2510 (2012)
Google Scholar
Raiber, F., Kurland, O.: Ranking document clusters using markov random fields. In: Proceedings of SIGIR, pp. 333–342 (2013)
Google Scholar
Seo, J., Bruce Croft, W.: Geometric representations for multiple documents. In: Proceedings of SIGIR, pp. 251–258 (2010)
Google Scholar
Singhal, A., Pereira, F.: Document expansion for speech retrieval. In: Proceedings of SIGIR, pp. 34–41 (1999)
Google Scholar
Smucker, M.D., Allan, J.: A new measure of the cluster hypothesis. In: Proceedings of ICTIR, pp. 281–288 (2009)
Google Scholar
Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: Proceedings of HLT/NAACL, pp. 407–414 (2006)
Google Scholar
Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for information retrieval. The Knowledge Information Systems Journal 6(5), 617–642 (2004)
Article Google Scholar
Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management 38(4), 559–582 (2002)
Article MATH Google Scholar
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths (1979)
Google Scholar
Vinay, V., Cox, I.J., Milic-Frayling, N., Wood, K.R.: On ranking the effectiveness of searches. In: Proceedings of SIGIR, pp. 398–404 (2006)
Google Scholar
Voorhees, E.M.: The cluster hypothesis revisited. In: Proceedings of SIGIR, pp. 188–196 (1985)
Google Scholar
Wei, X., Bruce Croft, W.: LDA-based document models for ad-hoc retrieval. In: Proceedings of SIGIR, pp. 178–185 (2006)
Google Scholar
Willett, P.: Query specific automatic document classification. International Forum on Information and Documentation 10(2), 28–32 (1985)
MathSciNet Google Scholar
Yang, L., Ji, D., Zhou, G., Nie, Y., Xiao, G.: Document re-ranking using cluster validation and label propagation. In: Proceedings of CIKM, pp. 690–697 (2006)
Google Scholar
Yi, X., Allan, J.: Evaluating topic models for information retrieval. In: Proceedings of CIKM, pp. 1431–1432 (2008)
Google Scholar
Zhu, X., Goldberg, A.B., Van Gael, J., Andrzejewski, D.: Improving diversity in ranking using absorbing random walks. In: Proceedings of HLT-NAACL, pp. 97–104 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Technion, Israel Institute of Technology, Israel
Oren Kurland

Authors

Oren Kurland
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Maarten de Rijke & Tom Kenter &
Centrum Wiskunde en Informatica, Amsterdam, The Netherlands and Delft University of Technology, Delft, The Netherlands
Arjen P. de Vries
University of Illinois at Urbana-Champaign, Urbana, IL, USA
ChengXiang Zhai
University of Twente, Twente, The Netheralnds and Erasmus University Rotterdam, Rotterdam, The Netherlands
Franciska de Jong
SalesPredict, Haifa, Israel
Kira Radinsky
Microsoft Research, Cambridge, UK
Katja Hofmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kurland, O. (2014). The Cluster Hypothesis in Information Retrieval. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_105

Download citation

DOI: https://doi.org/10.1007/978-3-319-06028-6_105
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics