Skip to main content

The Cluster Hypothesis in Information Retrieval

  • Conference paper
Advances in Information Retrieval (ECIR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

Abstract

The cluster hypothesis states that “closely associated documents tend to be relevant to the same requests” [45]. This is one of the most fundamental and influential hypotheses in the field of information retrieval and has given rise to a huge body of work. In this tutorial we will present the research topics that have emerged based on the cluster hypothesis. Specific focus will be placed on cluster-based document retrieval, the use of topic models for ad hoc IR, and the use of graph-based methods that utilize inter-document similarities. Furthermore, we will provide an in-depth survey of the suite of retrieval methods that rely, either explicitly or implicitly, on the cluster hypothesis and which are used for a variety of different tasks; e.g., query expansion, query-performance prediction, fusion and federated search, and search results diversification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Azzopardi, L., Girolami, M., van Rijsbergen, K.: Topic based language models for ad hoc information retrieval. In: Proceedings of IJCNN (2004)

    Google Scholar 

  2. Baliński, J., Daniłowicz, C.: Re-ranking method based on inter-document distances. Information Processing and Management 41(4), 759–775 (2005)

    Article  MATH  Google Scholar 

  3. Bendersky, M., Kurland, O.: Re-ranking search results using document-passage graphs. In: Proceedings of SIGIR, pp. 853–854 (2008) (poster)

    Google Scholar 

  4. Bruce Croft, W.: A model of cluster searching based on classification. Information Systems 5, 189–195 (1980)

    Article  Google Scholar 

  5. Daniłowicz, C., Baliński, J.: Document ranking based upon Markov chains. Information Processing and Management 41(4), 759–775 (2000)

    Google Scholar 

  6. Diaz, F.: Regularizing ad hoc retrieval scores. In: Proceedings of CIKM, pp. 672–679 (2005)

    Google Scholar 

  7. Diaz, F.: Performance prediction using spatial autocorrelation. In: Proceedings of SIGIR, pp. 583–590 (2007)

    Google Scholar 

  8. Diaz, F.: A method for transferring retrieval scores between collections with non overlapping vocabularies. In: Proceedings of SIGIR, pp. 805–806 (2008) (poster)

    Google Scholar 

  9. El-Hamdouchi, A., Willett, P.: Hierarchic document clustering using Ward’s method. In: Proceedings of SIGIR, pp. 149–156 (1986)

    Google Scholar 

  10. El-Hamdouchi, A., Willett, P.: Techniques for the measurement of clustering tendency in document retrieval systems. Journal of Information Science 13, 361–365 (1987)

    Article  Google Scholar 

  11. Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Journal of Information Retrieval 15(2), 93–115 (2012)

    Article  Google Scholar 

  12. He, J., Meij, E., de Rijke, M.: Result diversification based on query-specific cluster ranking. JASIST 62(3), 550–571 (2011)

    Google Scholar 

  13. Hearst, M.A., Karger, D.R., Pedersen, J.O.: Scatter/Gather as a tool for the navigation of retrieval results. In: Working Notes of the 1995 AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval (1995)

    Google Scholar 

  14. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of SIGIR, pp. 50–57 (1999)

    Google Scholar 

  15. Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7(5), 217–240 (1971)

    Article  Google Scholar 

  16. Kalmanovich, I.G., Kurland, O.: Cluster-based query expansion. In: Proceedings of SIGIR, pp. 646–647 (2009) (poster)

    Google Scholar 

  17. Khalaman, S., Kurland, O.: Utilizing inter-document similarities in federated search. In: Proceedings of SIGIR, pp. 1169–1170 (2012)

    Google Scholar 

  18. Kozorovitzky, A.K., Kurland, O.: Cluster-based fusion of retrieved lists. In: SIGIR, pp. 893–902 (2011)

    Google Scholar 

  19. Kozorovitzky, A.K., Kurland, O.: From “identical” to “similar”: Fusing retrieved lists based on inter-document similarities. Journal of Artificial Intelligence Research (JAIR) 41, 267–296 (2011)

    Google Scholar 

  20. Krikon, E., Kurland, O.: A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Journal of Information Retrieval 14(6), 593–616 (2011)

    Article  Google Scholar 

  21. Krikon, E., Kurland, O., Bendersky, M.: Utilizing inter-passage and inter-document similarities for re-ranking search results. ACM Transactions on Information Systems 29(1) (2010)

    Google Scholar 

  22. Kurland, O.: The opposite of smoothing: A language model approach to ranking query-specific document clusters. In: Proceedings of SIGIR, pp. 171–178 (2008)

    Google Scholar 

  23. Kurland, O.: Re-ranking search results using language models of query-specific clusters. Journal of Information Retrieval 12(4), 437–460 (2009)

    Article  Google Scholar 

  24. Kurland, O., Domshlak, C.: A rank-aggregation approach to searching for optimal query-specific clusters. In: Proceedings of SIGIR, pp. 547–554 (2008)

    Google Scholar 

  25. Kurland, O., Krikon, E.: The opposite of smoothing: A language model approach to ranking query-specific document clusters. Journal of Artificial Intelligence Research (JAIR) 41, 367–395 (2011)

    MathSciNet  MATH  Google Scholar 

  26. Kurland, O., Lee, L.: Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of SIGIR, pp. 194–201 (2004)

    Google Scholar 

  27. Kurland, O., Lee, L.: PageRank without hyperlinks: Structural re-ranking using links induced by language models. In: Proceedings of SIGIR, pp. 306–313 (2005)

    Google Scholar 

  28. Kurland, O., Lee, L.: Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In: Proceedings of SIGIR, pp. 83–90 (2006)

    Google Scholar 

  29. Kurland, O., Raiber, F., Shtok, A.: Query-performance prediction and cluster ranking: two sides of the same coin. In: Proceedings of CIKM, pp. 2459–2462 (2012)

    Google Scholar 

  30. Lee, K.-S., Croft, W.B., Allan, J.: A cluster-based resampling method for pseudo-relevance feedback. In: Proceedings of SIGIR, pp. 235–242 (2008)

    Google Scholar 

  31. Lee, K.-S., Park, Y.-C., Choi, K.-S.: Re-ranking model based on document clusters. Information Processing and Management 37(1), 1–14 (2001)

    Article  MATH  Google Scholar 

  32. Leuski, A.: Evaluating document clustering for interactive information retrieval. In: Proceedings of CIKM, pp. 33–40 (2001)

    Google Scholar 

  33. Leouski, A., Allan, J.: Evaluating a visual navigation system for a digital library. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 535–554. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  34. Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of SIGIR, pp. 186–193 (2004)

    Google Scholar 

  35. Liu, X., Croft, W.B.: Experiments on retrieval of optimal clusters. Technical Report IR-478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts (2006)

    Google Scholar 

  36. Liu, X., Croft, W.B.: Evaluating text representations for retrieval of the best group of documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 454–462. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  37. Raiber, F., Kurland, O.: Exploring the cluster hypothesis, and cluster-based retrieval, over the web. In: Proceedings of CIKM, pp. 2507–2510 (2012)

    Google Scholar 

  38. Raiber, F., Kurland, O.: Ranking document clusters using markov random fields. In: Proceedings of SIGIR, pp. 333–342 (2013)

    Google Scholar 

  39. Seo, J., Bruce Croft, W.: Geometric representations for multiple documents. In: Proceedings of SIGIR, pp. 251–258 (2010)

    Google Scholar 

  40. Singhal, A., Pereira, F.: Document expansion for speech retrieval. In: Proceedings of SIGIR, pp. 34–41 (1999)

    Google Scholar 

  41. Smucker, M.D., Allan, J.: A new measure of the cluster hypothesis. In: Proceedings of ICTIR, pp. 281–288 (2009)

    Google Scholar 

  42. Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: Proceedings of HLT/NAACL, pp. 407–414 (2006)

    Google Scholar 

  43. Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for information retrieval. The Knowledge Information Systems Journal 6(5), 617–642 (2004)

    Article  Google Scholar 

  44. Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management 38(4), 559–582 (2002)

    Article  MATH  Google Scholar 

  45. van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths (1979)

    Google Scholar 

  46. Vinay, V., Cox, I.J., Milic-Frayling, N., Wood, K.R.: On ranking the effectiveness of searches. In: Proceedings of SIGIR, pp. 398–404 (2006)

    Google Scholar 

  47. Voorhees, E.M.: The cluster hypothesis revisited. In: Proceedings of SIGIR, pp. 188–196 (1985)

    Google Scholar 

  48. Wei, X., Bruce Croft, W.: LDA-based document models for ad-hoc retrieval. In: Proceedings of SIGIR, pp. 178–185 (2006)

    Google Scholar 

  49. Willett, P.: Query specific automatic document classification. International Forum on Information and Documentation 10(2), 28–32 (1985)

    MathSciNet  Google Scholar 

  50. Yang, L., Ji, D., Zhou, G., Nie, Y., Xiao, G.: Document re-ranking using cluster validation and label propagation. In: Proceedings of CIKM, pp. 690–697 (2006)

    Google Scholar 

  51. Yi, X., Allan, J.: Evaluating topic models for information retrieval. In: Proceedings of CIKM, pp. 1431–1432 (2008)

    Google Scholar 

  52. Zhu, X., Goldberg, A.B., Van Gael, J., Andrzejewski, D.: Improving diversity in ranking using absorbing random walks. In: Proceedings of HLT-NAACL, pp. 97–104 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Kurland, O. (2014). The Cluster Hypothesis in Information Retrieval. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_105

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06028-6_105

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06027-9

  • Online ISBN: 978-3-319-06028-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics