Deep text clustering using stacked AutoEncoder

Hosseini, Soodeh; Varzaneh, Zahra Asghari

doi:10.1007/s11042-022-12155-0

Deep text clustering using stacked AutoEncoder

Published: 16 February 2022

Volume 81, pages 10861–10881, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Soodeh Hosseini^1,2 &
Zahra Asghari Varzaneh^1,2

1135 Accesses
10 Citations
10 Altmetric
1 Mention
Explore all metrics

Abstract

Text data is a type of unstructured information, which is easily processed by a human, but it is hard for the computer to understand. Text mining techniques effectively discover meaningful information from text, which has received a great deal of attention in recent years. The aim of this study is to evaluate and analyze the comments and suggestions presented by Barez Iran Company. Barez is an unlabeled dataset. Extracting useful information from unlabeled large textual data by human to manually be very difficult and time consuming. Therefore, in this paper we analyze suggestions presented in Persian using BERTopic modeling for cluster analysis of the dataset. In BERTopic, each document belongs to a topic with a probability distribution. As a result, seven latent topics are found, covering a broad range of issues such as Installation, manufacture, correction, and device. Then we propose a novel deep text clustering based on hybrid of a stacked autoencoder and k-means clustering to organize text documents into meaningful groups for mining information from Barez data in an unsupervised method. Our data clustering has three main steps: 1) Text representation with a new pre-trained BERT model for language understanding called ParsBERT, 2) Text feature extraction based on based on a new architecture of stacked autoencoder to reduce the dimension of data to provide robust features for clustering, 3) Cluster the data by k-means clustering. We employ the Barez dataset to verify our work’s effectiveness; Silhouette Score is used to evaluate the resulting clusters with the best value of 0.60 with 3 clusters grouping. Experimental evaluations demonstrate that the proposed algorithm clearly outperforms other clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

How to Fine-Tune BERT for Text Classification?

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

References

Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. 73(11):4773–4795
Ali F, El-Sappagh S, Kwak D (2019) Fuzzy ontology and LSTM-based text mining: A transportation network monitoring system for assisting travel. 19(2):234
B. V. Barde and A. M. Bainwad, "An overview of topic modeling methods and tools," in 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), 2017, pp. 745–750: IEEE.
Y. Bengio, R. Ducharme, P. Vincent, and C. J Jauvin, "A neural probabilistic language model," vol. 3, no. Feb, pp. 1137–1155, 2003.
Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42(6):3105–3114
Article Google Scholar
Blei D, Carin L, Dunson D (2010) Probabilistic topic models 27(6):55–65
Google Scholar
Chauhan GS, Meena YK, Gopalani D, Nahta R (2020) A two-step hybrid unsupervised model with attention mechanism for aspect extraction. 161:113673
Choudhary AK, Oluikpe P, Harding JA, Carrillo PM (2009) The needs and benefits of Text Mining applications on Post-Project Reviews. 60(9):728–740
Choudhary AK, Oluikpe P, Harding JA, Carrillo PM (2009) The needs and benefits of Text Mining applications on Post-Project Reviews. 60(9):728–740
Dashtipour K, Gogate M, Li J, Jiang F, Kong B, Hussain AJN (2020) A hybrid Persian sentiment analysis framework: Integrating dependency grammar based rules and deep neural networks. 380:1–10
Da'u A, Salim N, Rabiu I, Osman A (2020) Weighted aspect-based opinion mining using deep learning for recommender system. 140:112871
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: pre-training of deep bidirectional transformers for language understanding," 2018.
Google Scholar
D.T. Dinh, T. Fujinami, and V. N. Huynh, "Estimating the optimal number of clusters in categorical data clustering by silhouette coefficient," in International Symposium on Knowledge and Systems Sciences, 2019, pp. 1–17: Springer.
M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, "ParsBERT: transformer-based model for Persian language understanding," 2020.
Google Scholar
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. 17(3):37–37
R. Feldman, and I. Dagan, "Knowledge Discovery in Textual Databases (KDT)." pp. 112–117.
C. Gravelines, "Deep learning via stacked sparse autoencoders for automated voxel-wise brain parcellation based on functional connectivity," 2014.
Google Scholar
Gupta V, Lehal GS (2009) A survey of text mining techniques and applications. 1(1):60–76
Habibi M, Weber L, Neves M, Wiegandt DL, Leser UJB (2017) Deep learning with word embeddings improves biomedical named entity recognition. 33(14):i37–i48
Hariri FR (2021) Implementation of fuzzy C-means for clustering the Majelis Ulama Indonesia (MUI) fatwa documents. Jurnal Online Informatika 6(1):79–87
Article Google Scholar
R. N. G. Indah et al., "DBSCAN algorithm: twitter text clustering of trend topic pilkada pekanbaru," in Journal of Physics: Conference Series, 2019, vol. 1363, no. 1, p. 012001: IOP Publishing.
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, "Bag of tricks for efficient text classification," 2016.
Google Scholar
W. B. A. Karaa, A. S. Ashour, D. B. Sassi, P. Roy, N. Kausar, and N. Dey, "Medline text mining: an enhancement genetic algorithm based approach for document clustering," Applications of Intelligent Optimization in Biology and Medicine, pp. 267–287: Springer, 2016.
T. Li, X. Liu, and S. Su, "Semi-supervised text regression with conditional generative adversarial networks," in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 5375–5377: IEEE.
Lima ACE, De Castro LN (2014) A multi-label, semi-supervised classification approach applied to personality prediction in social media. 58:122–130
J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, vol. 1, no. 14, pp. 281–297: Oakland, CA, USA.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," 2013.
Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. 26:3111–3119
Nasa D, Engineering S (2012) Text mining techniques-A survey. 2(4):50–54
Niharika S, Latha VS, Lavanya D, Technology (2012) A survey on text categorization. 3(1):39–45
Ning X, Duan P, Li W, Zhang SJISPL (2020) Real-time 3D face alignment using an encoder-decoder network with an efficient de-convolution layer. IEEE Signal Process Lett 27:1944–1948
Article Google Scholar
Ning X, Gong K, Li W, Zhang L, Bai X, Tian S (2020) Feature refinement and filter network for person re-identification. IEEE Trans Circ SystVideo Technol 31:3391–3402
Article Google Scholar
Ning X, Gong K, Li W, Zhang LJN (2021) JWSAA: joint weak saliency and attention aware for person re-identification. Neurocomputing 453:801–811
Article Google Scholar
Ombabi AH, Ouarda W, Alimi AM, Mining (2020) Deep learning CNN–LSTM framework for Arabic sentiment analysis using textual information shared in social networks. 10(1):1–13
K. Orkphol, W. J. Yang, and Applications, "Sentiment analysis on microblogging with K-means clustering and artificial bee colony," International Journal of Computational Intelligence and Applications, vol. 18, no. 03, p. 1950017, 2019.
Patel FN, Soni NR (2012) Text mining: A Brief survey. 2(4):243
J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
M. E. Peters et al., "Deep contextualized word representations," 2018.
Book Google Scholar
Rachman DAC, Goejantoro R, Amijaya FDT (2021) Implementasi Text Mining Pengelompokkan Dokumen Skripsi Menggunakan Metode K-Means Clustering. Jurnal Eksponensial 11(2):167–174
Google Scholar
P. Rousseeuw and A. Mathematics, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," vol. 20, pp. 53–65, 1987.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representations by error propagation," California Univ San Diego La Jolla Inst for Cognitive Science 1985.
S. Saumya and J. P. Singh, "Spam review detection using LSTM autoencoder: an unsupervised approach," pp. 1–21, 2020.
C. Silva and B. Ribeiro, "The importance of stop word removal on recall values in text categorization," in Proceedings of the International Joint Conference on Neural Networks, 2003, vol. 3, pp. 1661–1666: IEEE.
Thirumoorthy K, Muneeswaran KJESWA (2021) A hybrid approach for text document clustering using Jaya optimization algorithm. 178:115040
Trier D, Jain AK, Taxt T (1996) Feature extraction methods for character recognition-a survey. 29(4):641–662
R. C. Tryon, "Cluster analysis: correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality. Edwards brother, incorporated," 1939.
Google Scholar
J. Turian, L. Ratinov, and Y. Bengio, "Word representations: a simple and general method for semi-supervised learning," in Proceedings of the 48th annual meeting of the association for computational linguistics, 2010 pp. 384–394.
Vayansky I, Kumar SA (2020) A review of topic modeling methods. 94:101582
J. J. Webster and C. Kit, "tokenization as the initial phase in NLP," in COLING 1992 volume 4: the 15th international conference on computational linguistics, 1992.
Yousefi-Azar M, Hamey L (2017) Text summarization using unsupervised deep learning. 68:93–105

Download references

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Mathematics and Computer, Shahid Bahonar University of Kerman, Kerman, Iran
Soodeh Hosseini & Zahra Asghari Varzaneh
Mahani Mathematical research center, Shahid Bahonar University of Kerman, Kerman, Iran
Soodeh Hosseini & Zahra Asghari Varzaneh

Authors

Soodeh Hosseini
View author publications
You can also search for this author in PubMed Google Scholar
Zahra Asghari Varzaneh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soodeh Hosseini.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hosseini, S., Varzaneh, Z.A. Deep text clustering using stacked AutoEncoder. Multimed Tools Appl 81, 10861–10881 (2022). https://doi.org/10.1007/s11042-022-12155-0

Download citation

Received: 10 June 2021
Revised: 04 September 2021
Accepted: 03 January 2022
Published: 16 February 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11042-022-12155-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep text clustering using stacked AutoEncoder

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

How to Fine-Tune BERT for Text Classification?

TextConvoNet: a convolutional neural network based architecture for text classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep text clustering using stacked AutoEncoder

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

How to Fine-Tune BERT for Text Classification?

TextConvoNet: a convolutional neural network based architecture for text classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation