Information Retrieval

, Volume 15, Issue 2, pp 93–115 | Cite as

The optimum clustering framework: implementing the cluster hypothesis

  • Norbert Fuhr
  • Marc Lechtenfeld
  • Benno Stein
  • Tim Gollub


Document clustering offers the potential of supporting users in interactive retrieval, especially when users have problems in specifying their information need precisely. In this paper, we present a theoretic foundation for optimum document clustering. Key idea is to base cluster analysis and evalutation on a set of queries, by defining documents as being similar if they are relevant to the same queries. Three components are essential within our optimum clustering framework, OCF: (1) a set of queries, (2) a probabilistic retrieval method, and (3) a document similarity metric. After introducing an appropriate validity measure, we define optimum clustering with respect to the estimates of the relevance probability for the query-document pairs under consideration. Moreover, we show that well-known clustering methods are implicitly based on the three components, but that they use heuristic design decisions for some of them. We argue that with our framework more targeted research for developing better document clustering methods becomes possible. Experimental results demonstrate the potential of our considerations.


Document clustering Cluster metric Probabilistic retrieval Probability ranking principle 



This work was supported in part by the German Science Foundation (DFG) under grants FU205/22-1 and STE1019/2-1.


  1. Ackerman, M., & Ben-David, S. (2008). Measures of clustering quality: A working set of axioms. In Proceedings NIPS 200 (pp. 121–128). MIT Press.Google Scholar
  2. Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic detection and tracking pilot study final report. In Proceedings of the DARPA broadcast news transcription and understanding workshop (pp. 194–218).Google Scholar
  3. Amigo, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461–486.CrossRefGoogle Scholar
  4. Arampatzis, A. T., Robertson, S., & Kamps, J. (2009). Score distributions in information retrieval. In ICTIR ’09 (pp. 139–151). Springer.Google Scholar
  5. Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In COLING-ACL (pp. 79–85).Google Scholar
  6. Barker, K., & Cornacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In Proceedings AI ’00 (pp. 40–52). London, UK: Springer.Google Scholar
  7. Bezdek, J., & Pal, N. (1995). Cluster validation with generalized Dunn’s indices. In Proceedings of 2nd conference on ANNES (pp. 190–193). Piscataway, NJ: IEEE Press.Google Scholar
  8. Blei, D., & Lafferty, J. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.MathSciNetzbMATHCrossRefGoogle Scholar
  9. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.zbMATHGoogle Scholar
  10. Chim, H., & Deng, X. (2008). Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering, 20, 1217–1229.CrossRefGoogle Scholar
  11. Cool, C., & Belkin, N. J. (2002). A classification of interactions with information. In H. Bruce, R. Fidel, P. Ingwersen, & P. Vakkari, (Eds.), Emerging frameworks and methods. Proceedings of the Fourth International Conference on Conceptions of Library and Information Science (COLIS4) (pp. 1–15). Greenwood Village. Libraries Unlimited.Google Scholar
  12. Cutting, D., Karger, D., Pedersen, J., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR ’92 (pp. 318–329). ACM Press.Google Scholar
  13. Daumé, H. III, & Marcu, D. (2005). A Bayesian model for supervised clustering with the dirichlet process prior. Journal of Machine Learning Research, 6, 1551–1577.zbMATHGoogle Scholar
  14. Diaz, F. (2005). Regularizing ad hoc retrieval scores. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management (pp. 672–679). New York, NY: ACM.Google Scholar
  15. El-Hamdouchi, A., & Willett, P. (1989). Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal, 22(3), 220–227.CrossRefGoogle Scholar
  16. Fudenberg, D., & Tirole, J. (1983). Game Theory. Cambridge: MIT Press.Google Scholar
  17. Fuhr, N. (2008). A probability ranking principle for interactive information retrieval. Information Retrieval, 11(3), 251–265.
  18. Fuhr, N., & Buckley, C. (1990). Probabilistic document indexing from relevance feedback data. In Proceedings of the 13th international conference on research and development in information retrieval (pp. 45–61). New York.Google Scholar
  19. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of IJCAI’07 (pp. 1606–1611). San Francisco, USA: Morgan Kaufmann Publishers.Google Scholar
  20. Gordon, G. (1996). Hierarchical classification. In P. Arabie, L. Hubert, & G. Soete (Eds.), Clustering and classification (pp. 65–121). Singapore: World Scientific.Google Scholar
  21. He, X., Cai, D., Liu, H., & Ma, W.-Y. (2001). Locality preserving indexing for document representation. In Proceedings SIGIR ’04.Google Scholar
  22. Hearst, M., & Pedersen, J. (1996). Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR ’96 (pp. 76–84). ACM Press.Google Scholar
  23. Hearst, M., Elliott, A., English, J., Sinha, R., Swearingen, K., & Yee, K.-P. (2002). Finding the flow in web site search. Communications of the ACM, 45, 42–49.CrossRefGoogle Scholar
  24. Hearst, M. A., & Stoica, E. (2009). Nlp support for faceted navigation in scholarly collections. In Proceedings of the 2009 workshop on text and citation analysis for scholarly digital libraries (pp. 62–70). Morristown, NJ, USA: Association for Computational Linguistics.Google Scholar
  25. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177–196.zbMATHCrossRefGoogle Scholar
  26. Ivie, E. L. (1966). Search Procedures Based On Measures Of Relatedness Between Documents. PhD thesis, Massachusetts Inst of Technology.Google Scholar
  27. Jackson, D. M. (1970). The construction of retrieval environments and pseudo-classifications based on external relevance. Information Storage and Retrieval, 6(2), 187–219.CrossRefGoogle Scholar
  28. Jardine, N., & van Rijsbergen, C. (1971). The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7(5), 217–240.CrossRefGoogle Scholar
  29. Ji, X., & Xu, W. (2006). Document clustering with prior knowledge. In E. N. Efthimiadis, S. T. Dumais, D. Hawking, & K. Järvelin (Eds.), SIGIR (pp. 405–412). ACM.Google Scholar
  30. Käki, M. (2005). Findex: search result categories help users when document ranking fails. In Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’05 (pp. 131–140). New York, NY: ACM.Google Scholar
  31. Kang, S.-S. (2003). Keyword-based document clustering. In Proceedings of IRAL ’03 (pp. 132–137). Morristown, NJ: Association for Computational Linguistics.Google Scholar
  32. Kanoulas, E., Pavlu, V., Dai, K., & Aslam, J. A. (2009). Modeling the score distributions of relevant and non-relevant documents. In ICTIR ’09 (pp. 152–163). Springer.Google Scholar
  33. Ke, W., Sugimoto, C. R., Mostafa, J. (2009). Dynamicity vs. effectiveness: studying online clustering for scatter/gather. In Proceedings of SIGIR ’09 (pp. 19–26). ACM.Google Scholar
  34. Kleinberg, J. (2002). An impossibility theorem for clustering. In Advances in neural information processing systems (NIPS) (pp. 446–453).Google Scholar
  35. Kurland, O. (2008). The opposite of smoothing: a language model approach to ranking query-specific document clusters. In Proceedings of SIGIR ’08 (pp. 171–178). ACM.Google Scholar
  36. Kurland, O., & Lee, L. (2006). Respect my authority!: Hits without hyperlinks, utilizing cluster-based language models. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 83–90). New York, NY: ACM.Google Scholar
  37. Kurland, O., & Domshlak, C. (2008). A rank-aggregation approach to searching for optimal query-specific clusters. In SIGIR ’08 (pp. 547–554). New York, NY: ACM.Google Scholar
  38. Lee, K. S., Croft, W. B., & Allan, J. (2008). A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of SIGIR’08 (pp. 235–242). ACM.Google Scholar
  39. Leuski, A. (2001). Evaluating document clustering for interactive information retrieval. In CIKM ’01: Proceedings of the tenth international conference on Information and knowledge management (pp. 33–40). New York, NY: ACM.Google Scholar
  40. Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.Google Scholar
  41. Li, W., Ng, W.-K., Liu, Y., & Ong, K.-L. (2007). Enhancing the effectiveness of clustering with spectra analysis. IEEE Transactions on Knowledge and Data Engineering, 19(7), 887–902.CrossRefGoogle Scholar
  42. Li, Y., Chung, S. M., & Holt, J. D. (2008). Text document clustering based on frequent word meaning sequences. Data Knowledge Engineering, 64(1), 381–404.CrossRefGoogle Scholar
  43. Liu, X., & Croft, W. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR ’04 (pp. 186–193). New York, NY: ACM.Google Scholar
  44. Liu, Y., Li, W., Lin, Y., & Jing, L. (2008). Spectral geometry for simultaneously clustering and ranking query search results. In Proceedings of SIGIR ’08 (pp. 539–546). ACM.Google Scholar
  45. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.zbMATHGoogle Scholar
  46. Nagamochi, H., Ono, T., & Ibaraki, T. (1994). Implementing an efficient minimum capacity cut algorithm. Mathematical Program, 67(3), 325–341.MathSciNetzbMATHCrossRefGoogle Scholar
  47. Nottelmann, H., & Fuhr, N. (2003). From retrieval status values to probabilities of relevance for advanced IR applications. Information Retrieval, 6(4).Google Scholar
  48. Radecki, T. (1977). Mathematical model of time-effective information retrieval system based on the theory of fuzzy sets. 13, 109–116.Google Scholar
  49. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(366), 846–850.CrossRefGoogle Scholar
  50. Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation, 33, 294–304.CrossRefGoogle Scholar
  51. Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.zbMATHCrossRefGoogle Scholar
  52. Salton, G. (ed) (1971). The SMART retrieval system–experiments in automatic document processing. Englewood, Cliffs: Prentice Hall.Google Scholar
  53. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. 24(5), 513–523.Google Scholar
  54. Smucker, M. D., & Allan, J. (2009). A new measure of the cluster hypothesis. In L. Azzopardi, G. Kazai, S. E. Robertson, S. M. Rüger, M. Shokouhi, D. Song, & E. Yilmaz, (Eds.), ICTIR, volume 5766 of lecture notes in computer science (pp. 281–288). Berlin: Springer.Google Scholar
  55. Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy. San Francisco: Freeman.zbMATHGoogle Scholar
  56. Stein, B., & Meyer zu Eißen, S. (2008). Retrieval models for genre classification. Scandinavian Journal of Information Systems (SJIS), 20(1), 91–117.Google Scholar
  57. Stein, B., Meyer zu Eißen, S., & Wißbrock, F. (2003). On cluster validity and the information need of users. In Proceedings of AIA 03 (pp. 216–221). ACTA Press.Google Scholar
  58. Tombros, A., & van Rijsbergen, C. J. (2004). Query-sensitive similarity measures for information retrieval. Knowledge and Information Systems, 6(5), 617–642.CrossRefGoogle Scholar
  59. Tombros, A., Villa, R., & Rijsbergen, C. V. (2002). The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management, 38(4), 559–582.zbMATHCrossRefGoogle Scholar
  60. van Rijsbergen, C. J. (1979). Information retrieval, 2 edn. London: Butterworths.Google Scholar
  61. Voorhees, E. (1985). The cluster hypothesis revisited. In Proceedings of SIGIR’85 (pp. 188–196). ACM Press.Google Scholar
  62. Wu, M., Fuller, M., & Wilkinsonm, R. (2001). Using clustering and classification approaches in interactive retrieval. Information Processing and Management, 37, 459–484.zbMATHCrossRefGoogle Scholar
  63. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In Proceedings of SIGIR ’03 (pp. 267–273). ACM Press.Google Scholar
  64. Yee, K.-P., Swearingen, K., Li, K., & Hearst, M. (2003). Faceted metadata for image search and browsing. In Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’03 (pp. 401–408). New York, NY: ACM.Google Scholar
  65. Yongming, G., Dehua, C., & Jiajin, L. (2008). Clustering XML documents by combining content and structure. In ISISE ’08 (pp. 583–587). IEEE Computer Society.Google Scholar
  66. Zadeh, R. B., & Ben-David, S. (2009). A uniqueness theorem for clustering. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, UAI ’09, (pp. 639–646). Arlington, Virginia, United States: AUAI Press.Google Scholar
  67. Zamir, O., & Etzioni, O. (1999). Grouper: A dynamic clustering interface to web search results. In Proceedings of the 8th international conference on word wide web (pp. 1361–1374).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Norbert Fuhr
    • 1
  • Marc Lechtenfeld
    • 1
  • Benno Stein
    • 2
  • Tim Gollub
    • 2
  1. 1.University of Duisburg-EssenDuisburgGermany
  2. 2.Bauhaus-Universität WeimarWeimarGermany

Personalised recommendations