Topic-Constrained Hierarchical Clustering for Document Datasets

  • Ying Zhao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6440)


In this paper, we propose the topic-constrained hierarchical clustering, which organizes document datasets into hierarchical trees consistant with a given set of topics. The proposed algorithm is based on a constrained agglomerative clustering framework and a semi-supervised criterion function that emphasizes the relationship between documents and topics and the relationship among documents themselves simultaneously. The experimental evaluation show that our algorithm outperformed the traditional agglomerative algorithm by 7.8% to 11.4%.


Constrained hierarchical clustering Semi-supervised learning Criterion functions 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: 18th International Conference on Machine Learning (ICML 2001), pp. 577–584 (2001)Google Scholar
  2. 2.
    Basu, S., Bilenko, M., Monney, R.: A probabilistic framework for semi-supervised clustering. In: 10th International Conference on Knowledge Discovery and Data Mining (2004)Google Scholar
  3. 3.
    Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning with application to clustering with side-information. In: Advances in Neural Information Processing Systerms, vol. 15, pp. 505–512 (2003)Google Scholar
  4. 4.
    Davidson, I., Ravi, S.S.: Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In: 9th European Conference on Principles and Practice of Knowledge Discovery in Databeses, pp. 59–70 (2005)Google Scholar
  5. 5.
    Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 554–560 (2006)Google Scholar
  6. 6.
    Bade, K., Nurnberger, A.: Creating a cluster hierarchy under constraints of a partially known hierarchy. In: 2008 SIAM International Conference on Data Mining (SDM 2008) (2008)Google Scholar
  7. 7.
    Zhao, Y., Karypis, G.: Topic-driven Clustering for Document Datasets. In: 2005 SIAM International Conference on Data Mining (SDM 2005), pp. 358–369 (2005)Google Scholar
  8. 8.
    Bilenko, M., Basu, S., Monney, R.: Integrating constraints and metric learning in semi-supervised clustering. In: 21th International Conference on Machine Learning, ICML 2004 (2004)Google Scholar
  9. 9.
    Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
  10. 10.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  11. 11.
    Cutting, D.R., Pedersen, J.O., Karger, D.R., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)Google Scholar
  12. 12.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)Google Scholar
  13. 13.
    Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1/2), 143–175 (2001)CrossRefzbMATHGoogle Scholar
  14. 14.
    Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)CrossRefzbMATHGoogle Scholar
  15. 15.
    Zhao, Y., Karypis, G.: Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10(2), 144–168 (2005)MathSciNetCrossRefGoogle Scholar
  16. 16.
    TREC. Text REtrieval conference,
  17. 17.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Ying Zhao
    • 1
  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations