# The optimum clustering framework: implementing the cluster hypothesis

- 337 Downloads
- 9 Citations

## Abstract

Document clustering offers the potential of supporting users in interactive retrieval, especially when users have problems in specifying their information need precisely. In this paper, we present a theoretic foundation for optimum document clustering. Key idea is to base cluster analysis and evalutation on a set of queries, by defining documents as being similar if they are relevant to the same queries. Three components are essential within our optimum clustering framework, OCF: (1) a set of queries, (2) a probabilistic retrieval method, and (3) a document similarity metric. After introducing an appropriate validity measure, we define optimum clustering with respect to the estimates of the relevance probability for the query-document pairs under consideration. Moreover, we show that well-known clustering methods are implicitly based on the three components, but that they use heuristic design decisions for some of them. We argue that with our framework more targeted research for developing better document clustering methods becomes possible. Experimental results demonstrate the potential of our considerations.

## Keywords

Document clustering Cluster metric Probabilistic retrieval Probability ranking principle## Notes

### Acknowledgments

This work was supported in part by the German Science Foundation (DFG) under grants FU205/22-1 and STE1019/2-1.

## References

- Ackerman, M., & Ben-David, S. (2008). Measures of clustering quality: A working set of axioms. In
*Proceedings NIPS 200*(pp. 121–128). MIT Press.Google Scholar - Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic detection and tracking pilot study final report. In
*Proceedings of the DARPA broadcast news transcription and understanding workshop*(pp. 194–218).Google Scholar - Amigo, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints.
*Information Retrieval, 12*(4), 461–486.CrossRefGoogle Scholar - Arampatzis, A. T., Robertson, S., & Kamps, J. (2009). Score distributions in information retrieval. In
*ICTIR ’09*(pp. 139–151). Springer.Google Scholar - Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In
*COLING-ACL*(pp. 79–85).Google Scholar - Barker, K., & Cornacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In
*Proceedings AI ’00*(pp. 40–52). London, UK: Springer.Google Scholar - Bezdek, J., & Pal, N. (1995). Cluster validation with generalized Dunn’s indices. In
*Proceedings of 2nd conference on ANNES*(pp. 190–193). Piscataway, NJ: IEEE Press.Google Scholar - Blei, D., & Lafferty, J. (2007). A correlated topic model of science.
*The Annals of Applied Statistics, 1*(1), 17–35.MathSciNetzbMATHCrossRefGoogle Scholar - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation.
*Journal of Machine Learning Research, 3*, 993–1022.zbMATHGoogle Scholar - Chim, H., & Deng, X. (2008). Efficient phrase-based document similarity for clustering.
*IEEE Transactions on Knowledge and Data Engineering, 20*, 1217–1229.CrossRefGoogle Scholar - Cool, C., & Belkin, N. J. (2002). A classification of interactions with information. In H. Bruce, R. Fidel, P. Ingwersen, & P. Vakkari, (Eds.),
*Emerging frameworks and methods. Proceedings of the Fourth International Conference on Conceptions of Library and Information Science (COLIS4)*(pp. 1–15). Greenwood Village. Libraries Unlimited.Google Scholar - Cutting, D., Karger, D., Pedersen, J., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In
*Proceedings of SIGIR ’92*(pp. 318–329). ACM Press.Google Scholar - Daumé, H. III, & Marcu, D. (2005). A Bayesian model for supervised clustering with the dirichlet process prior.
*Journal of Machine Learning Research, 6*, 1551–1577.zbMATHGoogle Scholar - Diaz, F. (2005). Regularizing ad hoc retrieval scores. In
*CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management*(pp. 672–679). New York, NY: ACM.Google Scholar - El-Hamdouchi, A., & Willett, P. (1989). Comparison of hierarchic agglomerative clustering methods for document retrieval.
*The Computer Journal, 22*(3), 220–227.CrossRefGoogle Scholar - Fudenberg, D., & Tirole, J. (1983).
*Game Theory*. Cambridge: MIT Press.Google Scholar - Fuhr, N. (2008). A probability ranking principle for interactive information retrieval.
*Information Retrieval, 11*(3), 251–265. http://dx.doi.org/10.1007/s10791-008-9045-0. - Fuhr, N., & Buckley, C. (1990). Probabilistic document indexing from relevance feedback data. In
*Proceedings of the 13th international conference on research and development in information retrieval*(pp. 45–61). New York.Google Scholar - Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In
*Proceedings of IJCAI’07*(pp. 1606–1611). San Francisco, USA: Morgan Kaufmann Publishers.Google Scholar - Gordon, G. (1996). Hierarchical classification. In P. Arabie, L. Hubert, & G. Soete (Eds.),
*Clustering and classification*(pp. 65–121). Singapore: World Scientific.Google Scholar - He, X., Cai, D., Liu, H., & Ma, W.-Y. (2001). Locality preserving indexing for document representation. In
*Proceedings SIGIR ’04*.Google Scholar - Hearst, M., & Pedersen, J. (1996). Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In
*Proceedings of SIGIR ’96*(pp. 76–84). ACM Press.Google Scholar - Hearst, M., Elliott, A., English, J., Sinha, R., Swearingen, K., & Yee, K.-P. (2002). Finding the flow in web site search.
*Communications of the ACM, 45*, 42–49.CrossRefGoogle Scholar - Hearst, M. A., & Stoica, E. (2009). Nlp support for faceted navigation in scholarly collections. In
*Proceedings of the 2009 workshop on text and citation analysis for scholarly digital libraries*(pp. 62–70). Morristown, NJ, USA: Association for Computational Linguistics.Google Scholar - Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis.
*Machine Learning, 42*, 177–196.zbMATHCrossRefGoogle Scholar - Ivie, E. L. (1966).
*Search Procedures Based On Measures Of Relatedness Between Documents*. PhD thesis, Massachusetts Inst of Technology.Google Scholar - Jackson, D. M. (1970). The construction of retrieval environments and pseudo-classifications based on external relevance.
*Information Storage and Retrieval, 6*(2), 187–219.CrossRefGoogle Scholar - Jardine, N., & van Rijsbergen, C. (1971). The use of hierarchical clustering in information retrieval.
*Information Storage and Retrieval, 7*(5), 217–240.CrossRefGoogle Scholar - Ji, X., & Xu, W. (2006). Document clustering with prior knowledge. In E. N. Efthimiadis, S. T. Dumais, D. Hawking, & K. Järvelin (Eds.),
*SIGIR*(pp. 405–412). ACM.Google Scholar - Käki, M. (2005). Findex: search result categories help users when document ranking fails. In
*Proceedings of the SIGCHI conference on human factors in computing systems*, CHI ’05 (pp. 131–140). New York, NY: ACM.Google Scholar - Kang, S.-S. (2003). Keyword-based document clustering. In
*Proceedings of IRAL ’03*(pp. 132–137). Morristown, NJ: Association for Computational Linguistics.Google Scholar - Kanoulas, E., Pavlu, V., Dai, K., & Aslam, J. A. (2009). Modeling the score distributions of relevant and non-relevant documents. In
*ICTIR ’09*(pp. 152–163). Springer.Google Scholar - Ke, W., Sugimoto, C. R., Mostafa, J. (2009). Dynamicity vs. effectiveness: studying online clustering for scatter/gather. In
*Proceedings of SIGIR ’09*(pp. 19–26). ACM.Google Scholar - Kleinberg, J. (2002). An impossibility theorem for clustering. In
*Advances in neural information processing systems (NIPS)*(pp. 446–453).Google Scholar - Kurland, O. (2008). The opposite of smoothing: a language model approach to ranking query-specific document clusters. In
*Proceedings of SIGIR ’08*(pp. 171–178). ACM.Google Scholar - Kurland, O., & Lee, L. (2006). Respect my authority!: Hits without hyperlinks, utilizing cluster-based language models. In
*SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval*(pp. 83–90). New York, NY: ACM.Google Scholar - Kurland, O., & Domshlak, C. (2008). A rank-aggregation approach to searching for optimal query-specific clusters. In
*SIGIR ’08*(pp. 547–554). New York, NY: ACM.Google Scholar - Lee, K. S., Croft, W. B., & Allan, J. (2008). A cluster-based resampling method for pseudo-relevance feedback. In
*Proceedings of SIGIR’08*(pp. 235–242). ACM.Google Scholar - Leuski, A. (2001). Evaluating document clustering for interactive information retrieval. In
*CIKM ’01: Proceedings of the tenth international conference on Information and knowledge management*(pp. 33–40). New York, NY: ACM.Google Scholar - Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research.
*Journal of Machine Learning Research, 5*, 361–397.Google Scholar - Li, W., Ng, W.-K., Liu, Y., & Ong, K.-L. (2007). Enhancing the effectiveness of clustering with spectra analysis.
*IEEE Transactions on Knowledge and Data Engineering, 19*(7), 887–902.CrossRefGoogle Scholar - Li, Y., Chung, S. M., & Holt, J. D. (2008). Text document clustering based on frequent word meaning sequences.
*Data Knowledge Engineering, 64*(1), 381–404.CrossRefGoogle Scholar - Liu, X., & Croft, W. (2004). Cluster-based retrieval using language models. In
*Proceedings of SIGIR ’04*(pp. 186–193). New York, NY: ACM.Google Scholar - Liu, Y., Li, W., Lin, Y., & Jing, L. (2008). Spectral geometry for simultaneously clustering and ranking query search results. In
*Proceedings of SIGIR ’08*(pp. 539–546). ACM.Google Scholar - Manning, C. D., Raghavan, P., & Schütze, H. (2008).
*Introduction to information retrieval*. New York: Cambridge University Press.zbMATHGoogle Scholar - Nagamochi, H., Ono, T., & Ibaraki, T. (1994). Implementing an efficient minimum capacity cut algorithm.
*Mathematical Program, 67*(3), 325–341.MathSciNetzbMATHCrossRefGoogle Scholar - Nottelmann, H., & Fuhr, N. (2003). From retrieval status values to probabilities of relevance for advanced IR applications.
*Information Retrieval, 6*(4).Google Scholar - Radecki, T. (1977). Mathematical model of time-effective information retrieval system based on the theory of fuzzy sets.
*13*, 109–116.Google Scholar - Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods.
*Journal of the American Statistical Association, 66*(366), 846–850.CrossRefGoogle Scholar - Robertson, S. E. (1977). The probability ranking principle in IR.
*Journal of Documentation, 33*, 294–304.CrossRefGoogle Scholar - Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.
*Journal of Computational and Applied Mathematics, 20*, 53–65.zbMATHCrossRefGoogle Scholar - Salton, G. (ed) (1971).
*The SMART retrieval system–experiments in automatic document processing*. Englewood, Cliffs: Prentice Hall.Google Scholar - Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval.
*24*(5), 513–523.Google Scholar - Smucker, M. D., & Allan, J. (2009). A new measure of the cluster hypothesis. In L. Azzopardi, G. Kazai, S. E. Robertson, S. M. Rüger, M. Shokouhi, D. Song, & E. Yilmaz, (Eds.),
*ICTIR, volume 5766 of lecture notes in computer science*(pp. 281–288). Berlin: Springer.Google Scholar - Sneath, P. H., & Sokal, R. R. (1973).
*Numerical taxonomy*. San Francisco: Freeman.zbMATHGoogle Scholar - Stein, B., & Meyer zu Eißen, S. (2008). Retrieval models for genre classification.
*Scandinavian Journal of Information Systems (SJIS), 20*(1), 91–117.Google Scholar - Stein, B., Meyer zu Eißen, S., & Wißbrock, F. (2003). On cluster validity and the information need of users. In
*Proceedings of AIA 03*(pp. 216–221). ACTA Press.Google Scholar - Tombros, A., & van Rijsbergen, C. J. (2004). Query-sensitive similarity measures for information retrieval.
*Knowledge and Information Systems, 6*(5), 617–642.CrossRefGoogle Scholar - Tombros, A., Villa, R., & Rijsbergen, C. V. (2002). The effectiveness of query-specific hierarchic clustering in information retrieval.
*Information Processing and Management, 38*(4), 559–582.zbMATHCrossRefGoogle Scholar - van Rijsbergen, C. J. (1979).
*Information retrieval, 2 edn*. London: Butterworths.Google Scholar - Voorhees, E. (1985). The cluster hypothesis revisited. In
*Proceedings of SIGIR’85*(pp. 188–196). ACM Press.Google Scholar - Wu, M., Fuller, M., & Wilkinsonm, R. (2001). Using clustering and classification approaches in interactive retrieval.
*Information Processing and Management, 37*, 459–484.zbMATHCrossRefGoogle Scholar - Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In
*Proceedings of SIGIR ’03*(pp. 267–273). ACM Press.Google Scholar - Yee, K.-P., Swearingen, K., Li, K., & Hearst, M. (2003). Faceted metadata for image search and browsing. In
*Proceedings of the SIGCHI conference on human factors in computing systems*, CHI ’03 (pp. 401–408). New York, NY: ACM.Google Scholar - Yongming, G., Dehua, C., & Jiajin, L. (2008). Clustering XML documents by combining content and structure. In
*ISISE ’08*(pp. 583–587). IEEE Computer Society.Google Scholar - Zadeh, R. B., & Ben-David, S. (2009). A uniqueness theorem for clustering. In
*Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence*, UAI ’09, (pp. 639–646). Arlington, Virginia, United States: AUAI Press.Google Scholar - Zamir, O., & Etzioni, O. (1999). Grouper: A dynamic clustering interface to web search results. In
*Proceedings of the 8th international conference on word wide web*(pp. 1361–1374).Google Scholar