Evaluating Text Representations for Retrieval of the Best Group of Documents

  • Xiaoyong Liu
  • W. Bruce Croft
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4956)


Cluster retrieval assumes that the probability of relevance of a document should depend on the relevance of other similar documents to the same query. The goal is to find the best group of documents. Many studies have examined the effectiveness of this approach, by employing different retrieval methods or clustering algorithms, but few have investigated text representations. This paper revisits the problem of retrieving the best group of documents, from the language-modeling perspective. We analyze the advantages and disadvantages of a range of representation techniques, derive features that characterize the good document groups, and experiment with a new probabilistic representation as a first step toward incorporating these features. Empirical evaluation demonstrates that the relationship between documents can be leveraged in retrieval when a good representation technique is available, and that retrieving the best group of documents can be more effective than retrieving individual documents.


Text Representation Document Retrieval Cluster Retrieval Cluster Representation Representation Techniques 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Croft, W.B.: A model of cluster searching based on classification. Information Systems 5, 189–195 (1980)CrossRefGoogle Scholar
  2. 2.
    Griffiths, A., Luckhurst, H.C., Willett, P.: Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science 37, 3–11 (1986)Google Scholar
  3. 3.
    Hearst, M.A., Pedersen, J.O.: Re-examining the cluster hypothesis: Scatter/Gather on retrieval results. In: SIGIR 1996, pp. 76–84 (1996)Google Scholar
  4. 4.
    Jardine, N., van Rijsbergen, C.J.: The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217–240 (1971)CrossRefGoogle Scholar
  5. 5.
    Krovetz, R.: Viewing Morphology as an Inference Process. In: SIGIR 1993, pp. 191–203 (1993)Google Scholar
  6. 6.
    Kurland, O., Lee, L.: Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of SIGIR 2004 conference, pp. 194–201 (2004)Google Scholar
  7. 7.
    Leuski, A.: Evaluating Document Clustering for Interactive Information Retrieval. In: Proceedings of CIKM 2001 conference, pp. 33–40 (2001)Google Scholar
  8. 8.
    Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of SIGIR 2004 conference, pp. 186–193 (2004)Google Scholar
  9. 9.
    Liu, X.: Cluster-based retrieval from a language-modeling perspective. In: The Doctoral Consortium of SIGIR 2006 conference, pp. 737–738 (2006), Abstract in SIGIR 2006 ProceedingsGoogle Scholar
  10. 10.
    Liu, X., Croft, W.B.: Representing clusters for retrieval. In: Proceedings of SIGIR 2006 conference, pp. 671–672 (2006)Google Scholar
  11. 11.
    Miller, D., Leek, T., Schwartz, R.: A hidden Markov model information retrieval system. In: SIGIR 1999, pp. 214–221 (1999)Google Scholar
  12. 12.
    Ponte, J., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR 1998, pp. 275–281 (1998)Google Scholar
  13. 13.
    Robertson, S.E.: The probability ranking principle in IR. Journal of Documentation 33, 294–304 (1977)CrossRefGoogle Scholar
  14. 14.
    Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: Proceedings of HLT/NAACL 2006 (2006)Google Scholar
  15. 15.
    Tombros, A., Villa, R., Van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management 38, 559–582 (2002)zbMATHCrossRefGoogle Scholar
  16. 16.
    van Rijsbergen, C.J., Croft, W.B.: Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Information Processing & Management 11, 171–182 (1975)CrossRefGoogle Scholar
  17. 17.
    van Rijsbergen, C.J., Sparck Jones, K.: A test for the separation of relevant and non-relevant documents in experimental retrieval collections. Journal of Documentation 29, 251–257 (1973)CrossRefGoogle Scholar
  18. 18.
    Voorhees, E.M.: The cluster hypothesis revisited. In: SIGIR 1985, pp. 188–196 (1985)Google Scholar
  19. 19.
    Voorhees, E.M.: The TREC robust retrieval track. SIGIR Forum 39(1) (2005)Google Scholar
  20. 20.
    Willet, P.: Query specific automatic document classification. International Forum on Information and Documentation 10(2), 28–32 (1985)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Xiaoyong Liu
    • 1
  • W. Bruce Croft
    • 1
  1. 1.CIIR, Computer Science DepartmentUniversity of MassachusettsAmherstUSA

Personalised recommendations