Skip to main content

Evaluating Text Representations for Retrieval of the Best Group of Documents

  • Conference paper
Advances in Information Retrieval (ECIR 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4956))

Included in the following conference series:

Abstract

Cluster retrieval assumes that the probability of relevance of a document should depend on the relevance of other similar documents to the same query. The goal is to find the best group of documents. Many studies have examined the effectiveness of this approach, by employing different retrieval methods or clustering algorithms, but few have investigated text representations. This paper revisits the problem of retrieving the best group of documents, from the language-modeling perspective. We analyze the advantages and disadvantages of a range of representation techniques, derive features that characterize the good document groups, and experiment with a new probabilistic representation as a first step toward incorporating these features. Empirical evaluation demonstrates that the relationship between documents can be leveraged in retrieval when a good representation technique is available, and that retrieving the best group of documents can be more effective than retrieving individual documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Croft, W.B.: A model of cluster searching based on classification. Information Systems 5, 189–195 (1980)

    Article  Google Scholar 

  2. Griffiths, A., Luckhurst, H.C., Willett, P.: Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science 37, 3–11 (1986)

    Google Scholar 

  3. Hearst, M.A., Pedersen, J.O.: Re-examining the cluster hypothesis: Scatter/Gather on retrieval results. In: SIGIR 1996, pp. 76–84 (1996)

    Google Scholar 

  4. Jardine, N., van Rijsbergen, C.J.: The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217–240 (1971)

    Article  Google Scholar 

  5. Krovetz, R.: Viewing Morphology as an Inference Process. In: SIGIR 1993, pp. 191–203 (1993)

    Google Scholar 

  6. Kurland, O., Lee, L.: Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of SIGIR 2004 conference, pp. 194–201 (2004)

    Google Scholar 

  7. Leuski, A.: Evaluating Document Clustering for Interactive Information Retrieval. In: Proceedings of CIKM 2001 conference, pp. 33–40 (2001)

    Google Scholar 

  8. Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of SIGIR 2004 conference, pp. 186–193 (2004)

    Google Scholar 

  9. Liu, X.: Cluster-based retrieval from a language-modeling perspective. In: The Doctoral Consortium of SIGIR 2006 conference, pp. 737–738 (2006), Abstract in SIGIR 2006 Proceedings

    Google Scholar 

  10. Liu, X., Croft, W.B.: Representing clusters for retrieval. In: Proceedings of SIGIR 2006 conference, pp. 671–672 (2006)

    Google Scholar 

  11. Miller, D., Leek, T., Schwartz, R.: A hidden Markov model information retrieval system. In: SIGIR 1999, pp. 214–221 (1999)

    Google Scholar 

  12. Ponte, J., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR 1998, pp. 275–281 (1998)

    Google Scholar 

  13. Robertson, S.E.: The probability ranking principle in IR. Journal of Documentation 33, 294–304 (1977)

    Article  Google Scholar 

  14. Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: Proceedings of HLT/NAACL 2006 (2006)

    Google Scholar 

  15. Tombros, A., Villa, R., Van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management 38, 559–582 (2002)

    Article  MATH  Google Scholar 

  16. van Rijsbergen, C.J., Croft, W.B.: Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Information Processing & Management 11, 171–182 (1975)

    Article  Google Scholar 

  17. van Rijsbergen, C.J., Sparck Jones, K.: A test for the separation of relevant and non-relevant documents in experimental retrieval collections. Journal of Documentation 29, 251–257 (1973)

    Article  Google Scholar 

  18. Voorhees, E.M.: The cluster hypothesis revisited. In: SIGIR 1985, pp. 188–196 (1985)

    Google Scholar 

  19. Voorhees, E.M.: The TREC robust retrieval track. SIGIR Forum 39(1) (2005)

    Google Scholar 

  20. Willet, P.: Query specific automatic document classification. International Forum on Information and Documentation 10(2), 28–32 (1985)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Craig Macdonald Iadh Ounis Vassilis Plachouras Ian Ruthven Ryen W. White

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, X., Croft, W.B. (2008). Evaluating Text Representations for Retrieval of the Best Group of Documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78646-7_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78646-7_43

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78645-0

  • Online ISBN: 978-3-540-78646-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics