Co-clustering Documents and Words by Minimizing the Normalized Cut Objective Function

Article

Abstract

This paper follows a word-document co-clustering model independently introduced in 2001 by several authors such as I.S. Dhillon, H. Zha and C. Ding. This model consists in creating a bipartite graph based on word frequencies in documents, and whose vertices are both documents and words. The created bipartite graph is then partitioned in a way that minimizes the normalized cut objective function to produce the document clustering. The fusion-fission graph partitioning metaheuristic is applied on several document collections using this word-document co-clustering model. Results demonstrate a real problem in this model partitions found almost always have a normalized cut value lowest than the original document collection clustering. Moreover, measures of the goodness of solutions seem to be relatively independent of the normalized cut values of partitions.

Keywords

Graph partitioning Normalized cut Fusion-Fission Document clustering Information retrieval Data mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anagnostopoulos, A., Dasgupta, A., Kumar, R.: Approximation algorithms for co-clustering. In: Proceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 201–210 (2008)Google Scholar
  2. 2.
    Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of the 21st ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 96–103 (1998)Google Scholar
  3. 3.
    Bichot, C.-E.: A metaheuristic based on fusion and fission for partitioning problems. In: Proceedings of NIDISC’08 in Conjunction with the 20th IEEE International Parallel and Distributed Processing Symposium (2006)Google Scholar
  4. 4.
    Bichot, C.-E.: A new Method, the fusion fission, for the relaxed k-way graph partitioning problem, and comparisons with some multilevel algorithms. J. Math. Model Algorithm 6(3), 319–344 (2007)MATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Bichot, C.-E.: A new meta-method for graph partitioning. In: Proceedings of the 2008 IEEE Congress on Evolutionary Computation, pp. 3498–3505 (2008)Google Scholar
  6. 6.
    Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decis. Support Syst. 27(3), 329–341 (1999)CrossRefGoogle Scholar
  7. 7.
    Costa, G., Manco, G., Ortale, R.: A hierarchical model-based approach to co-clustering high-dimensional data. In: Proceedings of the ACM Symposium on Applied Computing, pp. 886–890 (2008)Google Scholar
  8. 8.
    Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–274 (2001)Google Scholar
  9. 9.
    Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors: a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1944–1957 (2007)CrossRefGoogle Scholar
  10. 10.
    Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoric co-clustering. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98 (2003)Google Scholar
  11. 11.
    Ding, C., He, X., Zha, H., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings of the 1st IEEE International Conference on Data Mining, pp. 107–114 (2001)Google Scholar
  12. 12.
    Fiduccia, C.M., Mattheyses, R.M.: A linear-time heuristic for improving network partitions. In: Proceedings of 19th ACM/IEEE Design Automation Conference, pp. 175–181 (1982)Google Scholar
  13. 13.
    Hendrickson, B., Leland, R.W.: A multilevel algorithm for partitioning graphs. In: Proceedings of Supercomputing (1995)Google Scholar
  14. 14.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRefGoogle Scholar
  15. 15.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)CrossRefGoogle Scholar
  16. 16.
    Karypis, G., Kumar, V.: Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48(1), 96–129 (1998)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49(2), 291–307 (1970)Google Scholar
  18. 18.
    Long, B., Zhang, Z.M., Yu, P.S.: Co-clustering by block value decomposition. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 635–640 (2005)Google Scholar
  19. 19.
    Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinf. 1(1), 24–45 (2004)CrossRefGoogle Scholar
  20. 20.
    Mandhani, B., Joshi, S., Kummamuru, K.: A matrix density based algorithm to hierarchically co-cluster documents and words. In: Proceedings of the 12th International Conference on World Wide Web, pp. 658–665 (2003)Google Scholar
  21. 21.
    Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  22. 22.
    Pothen, A., Simon, H.D., Liou, K.-P.: Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl. 11(3), 430–452 (1990)MATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)CrossRefGoogle Scholar
  24. 24.
    Rege, M., Dong, M., Fotouhi, F.: Co-clustering documents and words using bipartite isoperimetric graph partitioning. In: Proceedings of the 6th IEEE International Conference on Data Mining, pp. 532–541 (2006)Google Scholar
  25. 25.
    Rege, M., Dong, M., Toouhi, F.: Co-clustering image features and semantic concepts. In: Proceedings of IEEE International Conference on Image Processing, pp. 137–140 (2006)Google Scholar
  26. 26.
    Rokach, L., Maimon, O.: Data mining and knowledge discovery handbook, Chapt. Clustering methods. Springer (2005)Google Scholar
  27. 27.
    Slonim, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd ACM SIGIR International Conference on Research and Development in Informaion Retrieval, pp. 208–215 (2000)Google Scholar
  28. 28.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of ACM SIGKDD Workshop on Text Mining (2000)Google Scholar
  29. 29.
    Tagarelli, A., Karypis, G.: A segment-based approach To clustering multi-topic documents. In: Text Mining Workshop, SIAM Datamining Conference (2008)Google Scholar
  30. 30.
    Tjhi, W.-C., Chen, L.: A partitioning based algorithm to fuzzy co-cluster documents and words. Pattern Recogn. Lett. 27(3), 151–159 (2006)CrossRefGoogle Scholar
  31. 31.
    Walshaw, C.: Multilevel refinement for combinatorial optimisation problems. Ann. Oper. Res. 131, 325–372 (2004)MATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    Wu, X., Ngo, C.-W., Li, Q.: Co-Clustering of time-evolving news story with transcript and keyframe. In: Proceedings of IEEE International Conference on Multimedia and Expo, pp. 117–120 (2005)Google Scholar
  33. 33.
    Zha, H., He, X., Ding, C.H.Q., Gu, M., Simon, H.D.: Bipartite graph partitioning and data clustering. In: ACM Conference on Information and Knowledge Management, pp. 25–32 (2001)Google Scholar
  34. 34.
    Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)MATHCrossRefGoogle Scholar
  35. 35.
    Zhao, Y., Karypis, G., Fayyad, U.M.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10(2), 141–168 (2005)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Université de Lyon, CNRSLyonFrance

Personalised recommendations