A New Text Clustering Method Using Hidden Markov Model

  • Yan Fu
  • Dongqing Yang
  • Shiwei Tang
  • Tengjiao Wang
  • Aiqiang Gao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4592)


Being high-dimensional and relevant in semantics, text clustering is still an important topic in data mining. However, little work has been done to investigate attributes of clustering process, and previous studies just focused on characteristics of text itself. As a dynamic and sequential process, we aim to describe text clustering as state transitions for words or documents. Taking K-means clustering method as example, we try to parse the clustering process into several sequences. Based on research of sequential and temporal data clustering, we propose a new text clustering method using HMM(Hidden Markov Model). And through the experiments on Reuters-21578, the results show that this approach provides an accurate clustering partition, and achieves better performance rates compared with K-means algorithm.


Hide Markov Model Cluster Process Dynamic Time Warping Vector Space Model Document Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. TextMining Workshop, KDD 2000 (2000)Google Scholar
  2. 2.
    Ajmera, J., McCowan, I., Bourlard, H.: Robust HMM based speech/music segmentation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2002)Google Scholar
  3. 3.
    Ajmera, J., Bourlard, H., McCowan, I.: Unknown-multiple speaker clustering using HMM. In: International Conference on Spoken Language Processing (2002)Google Scholar
  4. 4.
    Rabiner, L.R.: A Tutorial of Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  5. 5.
    Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Magazine 3(1), 4–16 (1986)CrossRefGoogle Scholar
  6. 6.
    Manning, C.D., Schutze, H.: Chapter 9: Markov Models. In: Foundations of Statistical Natural Language Processing, Papers in Textlinguistics, pp. 317–379. MIT Press, Cambridge (1999)Google Scholar
  7. 7.
    Panuccio, A., Bicego, M., Murino, V.: A Hidden Markov Model-Based Approach to Sequential Data Clustering. LNCS, pp. 734–742. Springer, Berlin (2002)Google Scholar
  8. 8.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)Google Scholar
  9. 9.
    Cutting, D.R., Karger, D.R., Pedeson, J.O., Tukey, J.W.: Scatter/Gather: a cluster-based approach to browsing large document collections. In: Proceedings ACM/SIGIR, pp. 318–329 (1992)Google Scholar
  10. 10.
    Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. ACM SIGIR 1998 (1998)Google Scholar
  11. 11.
    Berkhin, P.: Survey of Clustering Data Mining Techniques, Technical report, Accure Softward, San Jose, CA (2002)Google Scholar
  12. 12.
    Rabiner, L.R., Lee, C.H., Juang, B.H., Wilpon, J.G.: HMM Clustering for Connected Word Recognition. In: Proceedings of IEEE ICASSP, pp. 405–408. IEEE Computer Society Press, Los Alamitos (1989)Google Scholar
  13. 13.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm, JRSS-B (1977)Google Scholar
  14. 14.
    Buckley, C., Lewit, A.F.: Optimizations of inverted vector searches. In: SIGIR 1985, pp. 97–110 (1985)Google Scholar
  15. 15.
    van Rijsbergen, C.J.: Information Retrieval, Buttersworth, London, 2nd edn. (1989)Google Scholar
  16. 16.
    Dubes, R.C., Jain, A.K.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  17. 17.
    Rocchio, J.J.: Document retrieval systems - optimization and evaluation. Ph.D. Thesis, Harvard University (1966)Google Scholar
  18. 18.
    Smyth, P.: Clustering Sequences with Hidden Markov Models, pp. 648–654, NIPS (1996)Google Scholar
  19. 19.
    Oates, T., Firoiu, L., Cohen, P.R.: Clustering Time Series with Hidden Markov Models and Dynamic Time Warping. In: Proc. of the IJCAI-99 Workshop on Neural, Symbolic and Reinforcement Learning Methods for Sequence Learning, pp. 17–21 (1999)Google Scholar
  20. 20.
    Zamir, O., Etzioni, O.: Web Document Clustering: a Feasibility Demonstration. In: Proceedings of the 19th International ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR 1998), Melboume (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Yan Fu
    • 1
  • Dongqing Yang
    • 1
  • Shiwei Tang
    • 2
  • Tengjiao Wang
    • 1
  • Aiqiang Gao
    • 1
  1. 1.School of Electronics Engineering and Computer Science, Peking University, Beijing 100871China
  2. 2.National Laboratory on Machine Perception, Peking University, Beijing 100871China

Personalised recommendations