Information-Theoretic K-means for Text Clustering

  • Junjie Wu
Chapter
Part of the Springer Theses book series (Springer Theses)

Abstract

Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While research efforts devoted to Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which lead to infinite KL-divergence values and create a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, we propose a Summation-based Incremental Learning (SAIL) algorithm for Info-Kmeans clustering in this chapter. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of the Shannon entropy, which successfully avoids the zero-value dilemma. To improve the clustering quality, we further introduce the Variable Neighborhood Search (VNS) meta-heuristic and propose the V-SAIL algorithm, which is then accelerated by a multithreading scheme in PV-SAIL. Experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help to improve the clustering quality at a low cost of computation.

Keywords

Shannon Entropy Cluster Performance Variable Neighborhood Search Normalize Mutual Information Cluster Quality 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Banerjee, A., Dhillon, I., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)MathSciNetMATHGoogle Scholar
  2. 2.
    Brand, L.: Advanced Calculus: An Introduction to Classical Analysis. Dover, New York (2006)Google Scholar
  3. 3.
    Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley-Interscience, New York (2006)MATHGoogle Scholar
  4. 4.
    Dhillon, I., Mallela, S., Kumar, R.: A divisive information-theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003a)MATHGoogle Scholar
  5. 5.
    Dhillon, I., Mallela, S., Modha, D.: Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98 (2003b)Google Scholar
  6. 6.
    Elkan, C.: Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296 (2006)Google Scholar
  7. 7.
    Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Webace: a web agent for document categorization and exploration. In: Proceedings of the 2nd International Conference on Autonomous Agents, pp. 408–415 (1998)Google Scholar
  8. 8.
    Hansen, P., Mladenovic, N.: Variable neighborhood search: principles and applications. Eur. J. Oper. Res. 130, 449–467 (2001)MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Hendricks, W., Robey, K.: The sampling distribution of the coefficient of variation. Ann. Math. Stat. 7(3), 129–132 (1936)MATHCrossRefGoogle Scholar
  10. 10.
    Hersh, W., Buckley, C., Leone, T., Hickam, D.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 192–201 (1994)Google Scholar
  11. 11.
    Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)MathSciNetMATHCrossRefGoogle Scholar
  12. 12.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  13. 13.
    Meila, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42, 9–29 (2001)MATHCrossRefGoogle Scholar
  14. 14.
    Mladenovic, N., Hansen, P.: Variable neighborhood search. Comput. Oper. Res. 24(11), 1097–1100 (1997)MathSciNetMATHCrossRefGoogle Scholar
  15. 15.
    Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  16. 16.
    Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of the 23rd European Colloquium on Information Retrieval Research (2001)Google Scholar
  17. 17.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of the KDD Workshop on Text Mining (2000)Google Scholar
  18. 18.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Upper Saddle River (2005)Google Scholar
  19. 19.
    Tishby, N., Pereira, F., Bialek, W.: The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (1999)Google Scholar
  20. 20.
    Wu, H., Luk, R., Wong, K., Kwok, K.: Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. 26(3), 1–37 (2008)CrossRefGoogle Scholar
  21. 21.
    Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Mach. Learn. 55(3), 311–331 (2004)MATHCrossRefGoogle Scholar
  22. 22.
    Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Junjie Wu
    • 1
  1. 1.Department of Information Systems, School of Economics and ManagementBeihang UniversityBeijing China

Personalised recommendations