Skip to main content

Advertisement

Log in

Deep text clustering using stacked AutoEncoder

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Text data is a type of unstructured information, which is easily processed by a human, but it is hard for the computer to understand. Text mining techniques effectively discover meaningful information from text, which has received a great deal of attention in recent years. The aim of this study is to evaluate and analyze the comments and suggestions presented by Barez Iran Company. Barez is an unlabeled dataset. Extracting useful information from unlabeled large textual data by human to manually be very difficult and time consuming. Therefore, in this paper we analyze suggestions presented in Persian using BERTopic modeling for cluster analysis of the dataset. In BERTopic, each document belongs to a topic with a probability distribution. As a result, seven latent topics are found, covering a broad range of issues such as Installation, manufacture, correction, and device. Then we propose a novel deep text clustering based on hybrid of a stacked autoencoder and k-means clustering to organize text documents into meaningful groups for mining information from Barez data in an unsupervised method. Our data clustering has three main steps: 1) Text representation with a new pre-trained BERT model for language understanding called ParsBERT, 2) Text feature extraction based on based on a new architecture of stacked autoencoder to reduce the dimension of data to provide robust features for clustering, 3) Cluster the data by k-means clustering. We employ the Barez dataset to verify our work’s effectiveness; Silhouette Score is used to evaluate the resulting clusters with the best value of 0.60 with 3 clusters grouping. Experimental evaluations demonstrate that the proposed algorithm clearly outperforms other clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. 73(11):4773–4795

  2. Ali F, El-Sappagh S, Kwak D (2019) Fuzzy ontology and LSTM-based text mining: A transportation network monitoring system for assisting travel. 19(2):234

  3. B. V. Barde and A. M. Bainwad, "An overview of topic modeling methods and tools," in 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), 2017, pp. 745–750: IEEE.

  4. Y. Bengio, R. Ducharme, P. Vincent, and C. J Jauvin, "A neural probabilistic language model," vol. 3, no. Feb, pp. 1137–1155, 2003.

  5. Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42(6):3105–3114

    Article  Google Scholar 

  6. Blei D, Carin L, Dunson D (2010) Probabilistic topic models 27(6):55–65

    Google Scholar 

  7. Chauhan GS, Meena YK, Gopalani D, Nahta R (2020) A two-step hybrid unsupervised model with attention mechanism for aspect extraction. 161:113673

  8. Choudhary AK, Oluikpe P, Harding JA, Carrillo PM (2009) The needs and benefits of Text Mining applications on Post-Project Reviews. 60(9):728–740

  9. Choudhary AK, Oluikpe P, Harding JA, Carrillo PM (2009) The needs and benefits of Text Mining applications on Post-Project Reviews. 60(9):728–740

  10. Dashtipour K, Gogate M, Li J, Jiang F, Kong B, Hussain AJN (2020) A hybrid Persian sentiment analysis framework: Integrating dependency grammar based rules and deep neural networks. 380:1–10

  11. Da'u A, Salim N, Rabiu I, Osman A (2020) Weighted aspect-based opinion mining using deep learning for recommender system. 140:112871

  12. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: pre-training of deep bidirectional transformers for language understanding," 2018.

    Google Scholar 

  13. D.T. Dinh, T. Fujinami, and V. N. Huynh, "Estimating the optimal number of clusters in categorical data clustering by silhouette coefficient," in International Symposium on Knowledge and Systems Sciences, 2019, pp. 1–17: Springer.

  14. M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, "ParsBERT: transformer-based model for Persian language understanding," 2020.

    Google Scholar 

  15. Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. 17(3):37–37

  16. R. Feldman, and I. Dagan, "Knowledge Discovery in Textual Databases (KDT)." pp. 112–117.

  17. C. Gravelines, "Deep learning via stacked sparse autoencoders for automated voxel-wise brain parcellation based on functional connectivity," 2014.

    Google Scholar 

  18. Gupta V, Lehal GS (2009) A survey of text mining techniques and applications. 1(1):60–76

  19. Habibi M, Weber L, Neves M, Wiegandt DL, Leser UJB (2017) Deep learning with word embeddings improves biomedical named entity recognition. 33(14):i37–i48

  20. Hariri FR (2021) Implementation of fuzzy C-means for clustering the Majelis Ulama Indonesia (MUI) fatwa documents. Jurnal Online Informatika 6(1):79–87

    Article  Google Scholar 

  21. R. N. G. Indah et al., "DBSCAN algorithm: twitter text clustering of trend topic pilkada pekanbaru," in Journal of Physics: Conference Series, 2019, vol. 1363, no. 1, p. 012001: IOP Publishing.

  22. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, "Bag of tricks for efficient text classification," 2016.

    Google Scholar 

  23. W. B. A. Karaa, A. S. Ashour, D. B. Sassi, P. Roy, N. Kausar, and N. Dey, "Medline text mining: an enhancement genetic algorithm based approach for document clustering," Applications of Intelligent Optimization in Biology and Medicine, pp. 267–287: Springer, 2016.

  24. T. Li, X. Liu, and S. Su, "Semi-supervised text regression with conditional generative adversarial networks," in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 5375–5377: IEEE.

  25. Lima ACE, De Castro LN (2014) A multi-label, semi-supervised classification approach applied to personality prediction in social media. 58:122–130

  26. J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, vol. 1, no. 14, pp. 281–297: Oakland, CA, USA.

  27. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," 2013.

    Google Scholar 

  28. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. 26:3111–3119

  29. Nasa D, Engineering S (2012) Text mining techniques-A survey. 2(4):50–54

  30. Niharika S, Latha VS, Lavanya D, Technology (2012) A survey on text categorization. 3(1):39–45

  31. Ning X, Duan P, Li W, Zhang SJISPL (2020) Real-time 3D face alignment using an encoder-decoder network with an efficient de-convolution layer. IEEE Signal Process Lett 27:1944–1948

    Article  Google Scholar 

  32. Ning X, Gong K, Li W, Zhang L, Bai X, Tian S (2020) Feature refinement and filter network for person re-identification. IEEE Trans Circ SystVideo Technol 31:3391–3402

    Article  Google Scholar 

  33. Ning X, Gong K, Li W, Zhang LJN (2021) JWSAA: joint weak saliency and attention aware for person re-identification. Neurocomputing 453:801–811

    Article  Google Scholar 

  34. Ombabi AH, Ouarda W, Alimi AM, Mining (2020) Deep learning CNN–LSTM framework for Arabic sentiment analysis using textual information shared in social networks. 10(1):1–13

  35. K. Orkphol, W. J. Yang, and Applications, "Sentiment analysis on microblogging with K-means clustering and artificial bee colony," International Journal of Computational Intelligence and Applications, vol. 18, no. 03, p. 1950017, 2019.

  36. Patel FN, Soni NR (2012) Text mining: A Brief survey. 2(4):243

  37. J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

  38. J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

  39. M. E. Peters et al., "Deep contextualized word representations," 2018.

    Book  Google Scholar 

  40. Rachman DAC, Goejantoro R, Amijaya FDT (2021) Implementasi Text Mining Pengelompokkan Dokumen Skripsi Menggunakan Metode K-Means Clustering. Jurnal Eksponensial 11(2):167–174

    Google Scholar 

  41. P. Rousseeuw and A. Mathematics, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," vol. 20, pp. 53–65, 1987.

  42. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representations by error propagation," California Univ San Diego La Jolla Inst for Cognitive Science 1985.

  43. S. Saumya and J. P. Singh, "Spam review detection using LSTM autoencoder: an unsupervised approach," pp. 1–21, 2020.

  44. C. Silva and B. Ribeiro, "The importance of stop word removal on recall values in text categorization," in Proceedings of the International Joint Conference on Neural Networks, 2003, vol. 3, pp. 1661–1666: IEEE.

  45. Thirumoorthy K, Muneeswaran KJESWA (2021) A hybrid approach for text document clustering using Jaya optimization algorithm. 178:115040

  46. Trier D, Jain AK, Taxt T (1996) Feature extraction methods for character recognition-a survey. 29(4):641–662

  47. R. C. Tryon, "Cluster analysis: correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality. Edwards brother, incorporated," 1939.

    Google Scholar 

  48. J. Turian, L. Ratinov, and Y. Bengio, "Word representations: a simple and general method for semi-supervised learning," in Proceedings of the 48th annual meeting of the association for computational linguistics, 2010 pp. 384–394.

  49. Vayansky I, Kumar SA (2020) A review of topic modeling methods. 94:101582

  50. J. J. Webster and C. Kit, "tokenization as the initial phase in NLP," in COLING 1992 volume 4: the 15th international conference on computational linguistics, 1992.

  51. Yousefi-Azar M, Hamey L (2017) Text summarization using unsupervised deep learning. 68:93–105

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soodeh Hosseini.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hosseini, S., Varzaneh, Z.A. Deep text clustering using stacked AutoEncoder. Multimed Tools Appl 81, 10861–10881 (2022). https://doi.org/10.1007/s11042-022-12155-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12155-0

Keywords

Navigation