Text mining using nonnegative matrix factorization and latent semantic analysis

Hassani, Ali; Iranmanesh, Amir; Mansouri, Najme

doi:10.1007/s00521-021-06014-6

Text mining using nonnegative matrix factorization and latent semantic analysis

Original Article
Published: 21 April 2021

Volume 33, pages 13745–13766, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Ali Hassani¹,
Amir Iranmanesh¹ &
Najme Mansouri¹

678 Accesses
21 Citations
Explore all metrics

Abstract

Text clustering is considered one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process using conventional machine learning algorithms. Methods such as latent semantic analysis have helped mitigate this issue, but are nevertheless not completely stable in practice. As a result, we propose a new feature agglomeration method based on nonnegative matrix factorization, which is employed to separate the terms into groups, and then each group’s term vectors are agglomerated into a new feature vector. Together, these feature vectors create a new feature space much more suitable for clustering. In addition, we propose a new deterministic initialization for spherical K-means, which proves very useful for this specific type of data. In order to evaluate the proposed method, we compare it to some of the latest research done in this field, as well as some of the most practiced methods. In our experiments, we conclude that the proposed method either significantly improves clustering performance or maintains the performance of other methods, while improving stability in results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonnegative Matrix Factorization for Document Clustering: A Survey

The Clustering-Based Initialization for Non-negative Matrix Factorization in the Feature Transformation of the High-Dimensional Text Categorization System: A Viewpoint of Term Vectors

DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

Article 04 April 2017

References

Xie X, Fu Y, Jin H, Zhao Y, Cao W (2019) A novel text mining approach for scholar information extraction from web content in Chinese. Future Gener Comput Syst 111:859–872
Article Google Scholar
Krallinger M, Erhardt RAA, Valencia A (2005) Text-mining approaches in molecular biology and biomedicine. Drug Discover Today 10(6):439–445
Article Google Scholar
Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, Vongsangnak W, Shen B (2013) Biomedical text mining and its applications in cancer research. J Biomed Informatics 46(2):200–211
Article Google Scholar
Davoodi E, Kianmehr K, Afsharchi M (2013) A semantic social network-based expert recommender system. Appl Intell 39(1):1–13
Article Google Scholar
Adeva JJG, Atxa JMP (2007) Intrusion detection in web applications using text mining. Eng Appl Artif Intell 20(4):555–566
Article Google Scholar
Lin H, Sun B, Wu J, Xiong H (2016) Topic detection from short text: a term-based consensus clustering method. In: 2016 13th international conference on service systems and service management (ICSSSM), IEEE, pp 1–6
Aljaber B, Stokes N, Bailey J, Pei J (2010) Document clustering of scientific texts using citation contexts. Inf Retrieval 13(2):101–131
Article Google Scholar
Modha DS, Spangler WS (2004) Clustering hypertext with applications to web searching. US Patent 6,684,205
Thakran Y, Toshniwal D (2014) A novel agglomerative hierarchical approach for clustering in medical databases. Springer, Berlin, pp 245–252
Google Scholar
Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. Springer, Berlin, pp 267–287
Google Scholar
Garg N, Gupta R (2018) Performance evaluation of new text mining method based on GA and K-means clustering algorithm. Springer, Berlin, pp 23–30
Google Scholar
Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200
Article Google Scholar
Gulnashin F, Sharma I, Sharma H (2019) A new deterministic method of initializing spherical K-means for document clustering. Springer, Berlin, pp 149–155
Google Scholar
Kushwaha N, Pant M (2018) Link based bpso for feature selection in big data text clustering. Future Gener Comput Syst 82:190–199
Article Google Scholar
Sankesara H (2018) Medium articles. (kaggle). https://www.kaggle.com/hsankesara/medium-articles
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Document 28(1):11–21
Article Google Scholar
Shi J, Malik J (2000) Normalized cuts and image segmentation. Departmental Papers (CIS) p 107
Dumais ST (2004) Latent semantic analysis. Ann Rev Inf Sci Technol 38(1):188–230
Article Google Scholar
Wang J, Ma L (2011) Text clustering based on lsa-hgsom. In: International conference on web information systems and mining. Springer, pp 1–10
Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. Springer, Berlin, pp 383–390
Google Scholar
Yu B, Zb Xu, Li Ch (2008) Latent semantic analysis for text categorization using neural network. Knowl-Based Syst 21(8):900–904
Article Google Scholar
Yu B, Zhu Dh (2009) Combining neural networks and semantic feature space for email classification. Knowl-Based Syst 22(5):376–381
Article Google Scholar
Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, pp 163–172
Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on Machine learning. ACM, p 29
Korenius T, Laurikkala J, Juhola M (2007) On principal component analysis, cosine and euclidean measures in information retrieval. Inf Sci 177(22):4893–4905
Article MathSciNet Google Scholar
Boutsidis C, Gallopoulos E (2008) Svd based initialization: a head start for nonnegative matrix factorization. Pattern Recognit 41(4):1350–1362
Article Google Scholar
Casalino G, Del Buono N, Mencar C (2014) Subtractive clustering for seeding non-negative matrix factorizations. Inf Sci 257:369–387
Article MathSciNet Google Scholar
Pompili F, Gillis N, Absil PA, Glineur F (2014) Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing 141:15–25
Article Google Scholar
Zeng K, Yu J, Li C, You J, Jin T (2014) Image clustering by hyper-graph regularized non-negative matrix factorization. Neurocomputing 138:209–217
Article Google Scholar
Flenner J, Hunter B (2017) A deep non-negative matrix factorization neural network. https://www1.cmc.edu/pages/faculty/BHunter/papers/deep-negative-matrix.pdf
Huang X, Zheng X, Yuan W, Wang F, Zhu S (2011) Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization. Inf Sci 181(11):2293–2302
Article Google Scholar
Lu M, Zhao XJ, Zhang L, Li FZ (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98
Article MathSciNet Google Scholar
Song W, Park SC (2010) Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst 22(3):347–369
Article Google Scholar
Wang W, Yu B (2009) Text categorization based on combination of modified back propagation neural network and latent semantic analysis. Neural Comput Appl 18(8):875
Article Google Scholar
Zheng W, Qian Y, Lu H (2013) Text categorization based on regularization extreme learning machine. Neural Comput Appl 22(3–4):447–456
Article Google Scholar
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185
MathSciNet Google Scholar
Toussaint G (2005) Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining. Int J Comput Geom Appl 15(2):101–150
Article MathSciNet Google Scholar
Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 377–384
Lang K (1995) Newsweeder: learning to filter netnews. Elsevier, Amsterdam, pp 331–339
Google Scholar
Mueller AC (2020) Word cloud. https://github.com/amueller/word_cloud
Gulli A (2004) Ag’s corpus of news articles. http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
Sood G (2016) Parsed DMOZ data. https://doi.org/10.7910/DVN/OMV93V
Almeida TA, Gómez Hidalgo JM (2011) The sms spam collection v.1. http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Almeida T, Hidalgo JMG, Silva TP (2013) Towards sms spam filtering: results under a new dataset. Int J Inf Secur Sci 2(1):1–18
Google Scholar
Group CTL (1997) The 4 universities data set. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
Han EH, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) Webace: a web agent for document categorization and exploration. In: Proceedings of the second international conference on Autonomous agents. ACM, pp 408–415
Van Der Walt S, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830
MathSciNet MATH Google Scholar
Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(1):559–563
Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1(6):80–83
Article MathSciNet Google Scholar

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their useful feed-backs and comments.

Author information

Authors and Affiliations

Department of Computer Science, Shahid Bahonar University of Kerman, 76169-14111, Pajoohesh Square, Kerman, Iran
Ali Hassani, Amir Iranmanesh & Najme Mansouri

Authors

Ali Hassani
View author publications
You can also search for this author in PubMed Google Scholar
Amir Iranmanesh
View author publications
You can also search for this author in PubMed Google Scholar
Najme Mansouri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Najme Mansouri.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Parameter tuning

In this section, we present the parameters used in our experiments. K-means and spherical K-means required no specific parameters other than the number of clusters. In all experiments, the number of clusters was set to the number of classes in the supervised problem. The parameters for the rest of the methods are presented in the subsections below.

1.1 GAKM

The following parameters were set for all sets of data, as we observed that they fit all of them well and applying changes to these parameters does not result in significant improvement in the results (Tables 10, 11).

Table 10 Parameters of the genetic-based K-means algorithm [11]

Full size table

1.2 SCPSO

The following parameters were set for all sets of data, as we observed that they fit all of them well and applying changes to these parameters does not result in significant improvement in the results.

Table 11 Parameters of the PSO-based Spectral clustering algorithm [12]

Full size table

1.3 LSAKM

See Table 12.

Table 12 The number of components used in LSA for each dataset

Full size table

1.4 NMF-FR

For the proposed method, we set the number of components used in NMF and LSA, and figured out whether to use LSA afterwards or not, mainly through trial and error, just as we did in setting the parameters for LSAKM. The parameters are presented below (Table 13).

Table 13 Parameters of the proposed method

Full size table

Note that the number of components for LSA being set to 1 is equivalent to not using LSA after NMF, as mentioned in Sect. 4. We should add that the number of neighbors in our initialization method was set to 5 for all sets of data.

Parameter stress test

We also present the results of a stress test over the two parameters (number of components for NMF and LSA) over all datasets used in the experiments (See Figs. 11, 12, 13, 14). The metric for comparing the results is clustering accuracy. The three dimensional plots (NMF K, LSA K and Accuracy) of all datasets are presented below.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hassani, A., Iranmanesh, A. & Mansouri, N. Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Comput & Applic 33, 13745–13766 (2021). https://doi.org/10.1007/s00521-021-06014-6

Download citation

Received: 24 February 2020
Accepted: 07 April 2021
Published: 21 April 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s00521-021-06014-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text mining using nonnegative matrix factorization and latent semantic analysis

Abstract

Access this article

Similar content being viewed by others

Nonnegative Matrix Factorization for Document Clustering: A Survey

The Clustering-Based Initialization for Non-negative Matrix Factorization in the Feature Transformation of the High-Dimensional Text Categorization System: A Viewpoint of Term Vectors

DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

References

Acknowledgements