Abstract
This paper aims to classify the large unstructured documents into different topics without involving huge computational resources and a priori knowledge. The concept of granularity is employed here to extract contextual information from the documents by generating granules of words (GoWs), hierarchically. The proposed granularity-based word grouping (GBWG) algorithm in a computationally efficient way group the words at different layers by using co-occurrence measure between the words of different granules. The GBWG algorithm terminates when no new GoW is generated at any layer of the hierarchical structure. Thus multiple GoWs are obtained, each of which contains contextually related words, representing different topics. However, the GoWs may contain common words and creating ambiguity in topic identification. Louvain graph clustering algorithm has been employed to automatically identify the topics, containing unique words by using mutual information as an association measure between the words (nodes) of each GoW. A test document is classified into a particular topic based on the probability of its unique words belong to different topics. The performance of the proposed method has been compared with other unsupervised, semi-supervised, and supervised topic modeling algorithms. Experimentally, it has been shown that the proposed method is comparable or better than the state-of-the-art topic modeling algorithms which further statistically verified with the Wilcoxon Rank-sum Test.
Similar content being viewed by others
References
Almeida H, Guedes D, Meira W, Zaki MJ (2011) Is there a best quality metric for graph clusters? In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 44–59
Bafna P, Shirwaikar S, Pramod D (2019) Task recommender system using semantic clustering to identify the right personnel. VINE J Inf Knowl Manag Syst 2:181–199
Blagojević M, Micić Ž (2013) A web-based intelligent report e-learning system using data mining techniques. Comput Electr Eng 39(2):465–474
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008
Cai D, He X, Han J (2007) SRDA: an efficient algorithm for large-scale discriminant analysis. IEEE Trans Knowl Data Eng 20(1):1–12
Chen S-Y, Hung Y-C, Hung Y-H, Chien-Hsun W (2016) Application of a recurrent wavelet fuzzy-neural network in the positioning control of a magnetic-bearing mechanism. Comput Electr Eng 54:147–158
classic4 dataset. http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Dieng AB, Wang C, Gao J, Paisley JW (2016) Topicrnn: a recurrent neural network with long-range semantic dependency. CoRR. arXiv:1611.01702
Dörpinghaus J, Schaaf S, Jacobs M (2018) Soft document clustering using a novel graph covering approach. BioData Min 11(1):1–20
Duan T, Lou Q, Srihari SN, Xie X (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 68–80
Duan T, Pinto JP, Xie X (2019) Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures. Bioinformatics 35(6):953–961
Egghe L (2008) The measures precision, recall, fallout and miss as a function of the number of retrieved documents and their mutual interrelations. Inf Process Manag 44(2):856–876
Evaluation of clustering (2017). https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
Fang YC, Parthasarathy S, Schwartz F (2001) Using clustering to boost text classification. In: ICDM workshop on text mining (TextDM’01). Citeseer
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Fei J, Rui T, Song X, Zhou Y, Zhang S (2018) More discriminative convolutional neural network with inter-class constraint for classification. Comput Electr Eng 68:484–489
Feldman R, Sanger J (2006) Text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York
Fernández J, Antón Vargas JA, Villuendas-Rey Y, Cabrera-Venegas JF, Chávez Y, Argüelles-Cruz AJ (2016) Clustering techniques for document classification. Res Comput Sci 118:115–125
Gallagher RJ, Reing K, Kale D, Steeg GV (2017) Anchored correlation explanation: Topic modeling with minimal domain knowledge. Trans Assoc Comput Linguist 5:529–542
Gomez JC, Moens M-F (2012) PCA document reconstruction for email classification. Comput Stat Data Anal 56(3):741–751
Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd international conference on machine learning (ICML’06). ACM Press, pp 377–384
Hingmire S, Chougule S, Palshikar GK, Chakraborti S (2013) Document classification by topic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 877–880
Hirsch L, Di Nuovo A (2017) Document clustering with evolved search queries. In: 2017 IEEE congress on evolutionary computation (CEC). IEEE, pp 1239–1246
Huang R, Guan Yu, Wang Z, Zhang J, Shi L (2012) Dirichlet process mixture model for document clustering with feature partition. IEEE Trans Knowl Data Eng 25(8):1748–1759
Indurkhya N, Damerau FJ (2010) Handbook of natural language processing. Chapman and Hall/CRC, Boca Raton
Jagarlamudi J, Daumé III H, Udupa R (2012) Incorporating lexical priors into topic models. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, EACL ’12, pp 204–213, Stroudsburg, PA, USA. Association for Computational Linguistics
Jain VK, Kumar S, Fernandes SL (2017) Extraction of emotions from multilingual text using intelligent text processing and computational linguistics. J Comput Sci 21:316–326
Jan B, Farman H, Khan M, Imran M, Islam I, Ahmad A, Ali S, Jeon G (2017) Deep learning in big data analytics: a comparative study. Comput Electr Eng 12
Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15169–15211
Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. In Applications of intelligent optimization in biology and medicine. Springer, pp 267–287
Karypis MSG, Kumar V, Steinbach M (2000) A comparison of document clustering techniques. In: KDD workshop on text mining
Kim S-W, Gil J-M (2019) Research paper classification systems based on TF-IDF and LDA schemes. Hum Centric Comput Inf Sci 9(1):30
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1746–1751. Association for Computational Linguistics
Kong J, Scott A, Goerg GM (2016) Improving semantic topic clustering for search queries with word co-occurrence and bigraph co-clustering. Google Inc, Mountain View
Korshunova I, Xiong H, Fedoryszak M, Theis L (2019) Discriminative topic modeling with logistic LDA. In: Advances in neural information processing systems, pp 6770–6780
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence
Liu L, Liu K, Cong Z, Zhao J, Ji Y, He J (2018) Long length document classification by local convolutional feature aggregation. Algorithms 11(8):109
Liu Y, Niculescu-Mizil A, Gryc W (2009) Topic-link LDA: joint models of topic and author community. In: Proceedings of the 26th annual international conference on machine learning, ICML ’09. ACM, New York, NY, USA, pp 665–672
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, pp 545–552
Meng Y, Huang J, Wang G, Wang Z, Zhang C, Zhang Y, Han J (2020) Discriminative topic mining via category-name guided text embedding. In: Proceedings of the web conference 2020, pp 2121–2132
Meng Y, Zhang Y, Huang J, Zhang Y, Zhang C, Han J (2020) Hierarchical topic mining via joint spherical tree and text embedding. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1908–1917
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1
Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
Pasquali AR (2016) Automatic coherence evaluation applied to topic models
Pavlopoulos GA, Promponas VJ, Ouzounis CA, Iliopoulos I (2014) Biological information extraction and co-occurrence analysis. In: Biomedical literature mining, pp 77–92. Springer
Petz G, Karpowicz M, Fürschuß H, Auinger A, Stříteský V, Holzinger A (2013) Opinion mining on the web 2.0—characteristics of user generated content and their impacts. In: Holzinger A, Pasi G (eds) Human-computer interaction and knowledge discovery in complex, unstructured, big data. Springer, Berlin, pp 35–46
Popel M, Mareček D (2010) Perplexity of n-gram and dependency language models. In: Sojka P, Horák A, Kopeček I, Pala K (eds) Text, speech and dialogue. Springer, Berlin, pp 173–180
Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08. ACM, New York, USA, pp 569–577
Power R, Chen J, Karthik T, Subramanian L (2010) Document classification for focused topics. In: 2010 AAAI spring symposium series
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 1, EMNLP ’09. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 248–256
Rangrej A, Kulkarni S, Tendulkar AV (2011) Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th international conference companion on World wide web, pp 111–112
Rapečka A, Dzemyda G (2015) A new recommendation model for the user clustering-based recommendation system. Inf Technol Control 44(1):54–63
Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on Web search and data mining, pp 399–408
Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64
Siivola V, Pellom BL (2005) Growing an n-gram language model. In: Proceedings of 9th European conference on speech communication and technology, pp 1309–1312
Solka JL et al (2008) Text data mining: theory and methods. Stat Surv 2:94–112
Sontag D, Roy D (2011) Complexity of inference in latent dirichlet allocation. In: Advances in neural information processing systems, pp 1008–1016
Stanchev L (2016) Semantic document clustering using a similarity graph. In: 2016 IEEE tenth international conference on semantic computing (ICSC). IEEE, pp 1–8
Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 952–961
Sun X (2014) Textual document clustering using topic models. In: 2014 10th International conference on semantics, knowledge and grids. IEEE, pp 1–4
Suo Q, Ma F, Canino G, Gao J, Zhang A, Veltri P, Agostino G (2017) A multi-task framework for monitoring health conditions via attention-based recurrent neural networks. In: AMIA annual symposium proceedings, vol 2017, p 1665. American Medical Informatics Association
Tang P, Wang H (2017) Richer feature for image classification with super and sub kernels based on deep convolutional neural network. Comput Electr Eng 62:499–510
Theodosiou T, Darzentas N, Angelis L, Ouzounis CA (2008) Pured-MCL: a graph-based pubmed document clustering methodology. Bioinformatics 24(17):1935–1941
Tian F, Gao B, He D, Liu T-Y (2016) Sentence level recurrent topic model: letting topics speak for themselves. arXiv preprint arXiv:1604.02038
Tong Z, Zhang H (2016) A text mining research based on LDA topic modelling. In: Proceedings of the sixth international conference on computer science, engineering and information technology (CCSEIT), pp 21–22
Teh YW, Jordan M, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101:1566–1581
Wilcoxon F, Katti SK, Wilcox RA (1970) Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. Sel Tables Math Stat 1:171–259
Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13:1–13:37
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. arXiv preprint arXiv:1309.6874
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd international conference on data engineering (ICDE). IEEE, pp 625–636
Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 763–772
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ganguli, I., Sil, J. & Sengupta, N. Nonparametric method of topic identification using granularity concept and graph-based modeling. Neural Comput & Applic 35, 1055–1075 (2023). https://doi.org/10.1007/s00521-020-05662-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05662-4