Abstract
The cluster hypothesis states that “closely associated documents tend to be relevant to the same requests” [45]. This is one of the most fundamental and influential hypotheses in the field of information retrieval and has given rise to a huge body of work. In this tutorial we will present the research topics that have emerged based on the cluster hypothesis. Specific focus will be placed on cluster-based document retrieval, the use of topic models for ad hoc IR, and the use of graph-based methods that utilize inter-document similarities. Furthermore, we will provide an in-depth survey of the suite of retrieval methods that rely, either explicitly or implicitly, on the cluster hypothesis and which are used for a variety of different tasks; e.g., query expansion, query-performance prediction, fusion and federated search, and search results diversification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Azzopardi, L., Girolami, M., van Rijsbergen, K.: Topic based language models for ad hoc information retrieval. In: Proceedings of IJCNN (2004)
Baliński, J., Daniłowicz, C.: Re-ranking method based on inter-document distances. Information Processing and Management 41(4), 759–775 (2005)
Bendersky, M., Kurland, O.: Re-ranking search results using document-passage graphs. In: Proceedings of SIGIR, pp. 853–854 (2008) (poster)
Bruce Croft, W.: A model of cluster searching based on classification. Information Systems 5, 189–195 (1980)
Daniłowicz, C., Baliński, J.: Document ranking based upon Markov chains. Information Processing and Management 41(4), 759–775 (2000)
Diaz, F.: Regularizing ad hoc retrieval scores. In: Proceedings of CIKM, pp. 672–679 (2005)
Diaz, F.: Performance prediction using spatial autocorrelation. In: Proceedings of SIGIR, pp. 583–590 (2007)
Diaz, F.: A method for transferring retrieval scores between collections with non overlapping vocabularies. In: Proceedings of SIGIR, pp. 805–806 (2008) (poster)
El-Hamdouchi, A., Willett, P.: Hierarchic document clustering using Ward’s method. In: Proceedings of SIGIR, pp. 149–156 (1986)
El-Hamdouchi, A., Willett, P.: Techniques for the measurement of clustering tendency in document retrieval systems. Journal of Information Science 13, 361–365 (1987)
Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Journal of Information Retrieval 15(2), 93–115 (2012)
He, J., Meij, E., de Rijke, M.: Result diversification based on query-specific cluster ranking. JASIST 62(3), 550–571 (2011)
Hearst, M.A., Karger, D.R., Pedersen, J.O.: Scatter/Gather as a tool for the navigation of retrieval results. In: Working Notes of the 1995 AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval (1995)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of SIGIR, pp. 50–57 (1999)
Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7(5), 217–240 (1971)
Kalmanovich, I.G., Kurland, O.: Cluster-based query expansion. In: Proceedings of SIGIR, pp. 646–647 (2009) (poster)
Khalaman, S., Kurland, O.: Utilizing inter-document similarities in federated search. In: Proceedings of SIGIR, pp. 1169–1170 (2012)
Kozorovitzky, A.K., Kurland, O.: Cluster-based fusion of retrieved lists. In: SIGIR, pp. 893–902 (2011)
Kozorovitzky, A.K., Kurland, O.: From “identical” to “similar”: Fusing retrieved lists based on inter-document similarities. Journal of Artificial Intelligence Research (JAIR) 41, 267–296 (2011)
Krikon, E., Kurland, O.: A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Journal of Information Retrieval 14(6), 593–616 (2011)
Krikon, E., Kurland, O., Bendersky, M.: Utilizing inter-passage and inter-document similarities for re-ranking search results. ACM Transactions on Information Systems 29(1) (2010)
Kurland, O.: The opposite of smoothing: A language model approach to ranking query-specific document clusters. In: Proceedings of SIGIR, pp. 171–178 (2008)
Kurland, O.: Re-ranking search results using language models of query-specific clusters. Journal of Information Retrieval 12(4), 437–460 (2009)
Kurland, O., Domshlak, C.: A rank-aggregation approach to searching for optimal query-specific clusters. In: Proceedings of SIGIR, pp. 547–554 (2008)
Kurland, O., Krikon, E.: The opposite of smoothing: A language model approach to ranking query-specific document clusters. Journal of Artificial Intelligence Research (JAIR) 41, 367–395 (2011)
Kurland, O., Lee, L.: Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of SIGIR, pp. 194–201 (2004)
Kurland, O., Lee, L.: PageRank without hyperlinks: Structural re-ranking using links induced by language models. In: Proceedings of SIGIR, pp. 306–313 (2005)
Kurland, O., Lee, L.: Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In: Proceedings of SIGIR, pp. 83–90 (2006)
Kurland, O., Raiber, F., Shtok, A.: Query-performance prediction and cluster ranking: two sides of the same coin. In: Proceedings of CIKM, pp. 2459–2462 (2012)
Lee, K.-S., Croft, W.B., Allan, J.: A cluster-based resampling method for pseudo-relevance feedback. In: Proceedings of SIGIR, pp. 235–242 (2008)
Lee, K.-S., Park, Y.-C., Choi, K.-S.: Re-ranking model based on document clusters. Information Processing and Management 37(1), 1–14 (2001)
Leuski, A.: Evaluating document clustering for interactive information retrieval. In: Proceedings of CIKM, pp. 33–40 (2001)
Leouski, A., Allan, J.: Evaluating a visual navigation system for a digital library. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 535–554. Springer, Heidelberg (1998)
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of SIGIR, pp. 186–193 (2004)
Liu, X., Croft, W.B.: Experiments on retrieval of optimal clusters. Technical Report IR-478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts (2006)
Liu, X., Croft, W.B.: Evaluating text representations for retrieval of the best group of documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 454–462. Springer, Heidelberg (2008)
Raiber, F., Kurland, O.: Exploring the cluster hypothesis, and cluster-based retrieval, over the web. In: Proceedings of CIKM, pp. 2507–2510 (2012)
Raiber, F., Kurland, O.: Ranking document clusters using markov random fields. In: Proceedings of SIGIR, pp. 333–342 (2013)
Seo, J., Bruce Croft, W.: Geometric representations for multiple documents. In: Proceedings of SIGIR, pp. 251–258 (2010)
Singhal, A., Pereira, F.: Document expansion for speech retrieval. In: Proceedings of SIGIR, pp. 34–41 (1999)
Smucker, M.D., Allan, J.: A new measure of the cluster hypothesis. In: Proceedings of ICTIR, pp. 281–288 (2009)
Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: Proceedings of HLT/NAACL, pp. 407–414 (2006)
Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for information retrieval. The Knowledge Information Systems Journal 6(5), 617–642 (2004)
Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management 38(4), 559–582 (2002)
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths (1979)
Vinay, V., Cox, I.J., Milic-Frayling, N., Wood, K.R.: On ranking the effectiveness of searches. In: Proceedings of SIGIR, pp. 398–404 (2006)
Voorhees, E.M.: The cluster hypothesis revisited. In: Proceedings of SIGIR, pp. 188–196 (1985)
Wei, X., Bruce Croft, W.: LDA-based document models for ad-hoc retrieval. In: Proceedings of SIGIR, pp. 178–185 (2006)
Willett, P.: Query specific automatic document classification. International Forum on Information and Documentation 10(2), 28–32 (1985)
Yang, L., Ji, D., Zhou, G., Nie, Y., Xiao, G.: Document re-ranking using cluster validation and label propagation. In: Proceedings of CIKM, pp. 690–697 (2006)
Yi, X., Allan, J.: Evaluating topic models for information retrieval. In: Proceedings of CIKM, pp. 1431–1432 (2008)
Zhu, X., Goldberg, A.B., Van Gael, J., Andrzejewski, D.: Improving diversity in ranking using absorbing random walks. In: Proceedings of HLT-NAACL, pp. 97–104 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Kurland, O. (2014). The Cluster Hypothesis in Information Retrieval. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_105
Download citation
DOI: https://doi.org/10.1007/978-3-319-06028-6_105
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)