Non-negative Matrix Factorization Based Text Mining: Feature Extraction and Classification

  • P. C. Barman
  • Nadeem Iqbal
  • Soo-Young Lee
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4233)


The unlabeled document or text collections are becoming larger and larger which is common and obvious; mining such data sets are a challenging task. Using the simple word-document frequency matrix as feature space the mining process is becoming more complex. The text documents are often represented as high dimensional about few thousand sparse vectors with sparsity about 95 to 99% which significantly affects the efficiency and the results of the mining process. In this paper, we propose the two-stage Non-negative Matrix Factorization (NMF): in the first stage we tried to extract the uncorrelated basis probabilistic document feature vectors by significantly reducing the dimension of the feature vectors of the word-document frequency from few thousand to few hundred, and in the second stage for clustering or classification. In our propose approach it has been observed that the clustering or classification performance with more than 98.5% accuracy. The dimension reduction and classification performance has observed for the Classic3 dataset.


Feature Vector Basis Vector Text Data Nonnegative Matrix Factorization Nonnegative Matrix Factorization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)CrossRefGoogle Scholar
  2. 2.
    Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing 13. Proc. NIPS 2000, MIT Press, Cambridge (2001)Google Scholar
  3. 3.
    Xu, W., Liu, X., Gong, Y.: Document-Clustering based on Non-Negative Matrix Factorization. In: Proceedings of SIGIR 2003, Toronto, CA, July 28-August 1, 2003, pp. 267–273 (2003)Google Scholar
  4. 4.
    Willett, P.: Document clustering using an inverted file approach. Journal of Information Science 2, 223–231 (1990)CrossRefGoogle Scholar
  5. 5.
    Baker, L., McCallum, A.: Distributional clustering of words for text classification. In: Proceedings of ACM SIGIR (1998)Google Scholar
  6. 6.
    Liu, X., Gong, Y.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of ACM SIGIR 2002, Tampere, Finland (2002)Google Scholar
  7. 7.
    Shahnaz, F., Berry, M.W.: Document Clustering Using Nonnegative Matrix Factorization. Journal on Information Processing & Management (2004)Google Scholar
  8. 8.
    Dhillon, I.S., Mallela, S., Modha, D.S.: Information-Theoretic Co-clustering. In: SIGKDD 2003, August 24-27, 2003, Washington (2003)Google Scholar
  9. 9.
    Zha, H., He, X., Ding, C., Gu, M., Simon, H.: Bipartite graph partitioning and data clustering. In: Proceedings of ACM CIKM (2001)Google Scholar
  10. 10.
    Lia, J., Zha, H.: Two-way Poisson mixture models for simultaneous document classification and word clustering. Computational Statistics & Data Analysis, Elsevier (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • P. C. Barman
    • 1
  • Nadeem Iqbal
    • 1
  • Soo-Young Lee
    • 1
  1. 1.Department of BioSystems, Korea Advanced Institute of Science and TechnologyBrain Science Research Center and Computational NeuroSystems LabDaejeonRepublic of Korea

Personalised recommendations