Nonparametric method of topic identification using granularity concept and graph-based modeling

Ganguli, Isha; Sil, Jaya; Sengupta, Nandita

doi:10.1007/s00521-020-05662-4

Nonparametric method of topic identification using granularity concept and graph-based modeling

S.I. : 2019 India Intl. Congress on Computational Intelligence
Published: 13 January 2021

Volume 35, pages 1055–1075, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

358 Accesses
3 Citations
Explore all metrics

Abstract

This paper aims to classify the large unstructured documents into different topics without involving huge computational resources and a priori knowledge. The concept of granularity is employed here to extract contextual information from the documents by generating granules of words (GoWs), hierarchically. The proposed granularity-based word grouping (GBWG) algorithm in a computationally efficient way group the words at different layers by using co-occurrence measure between the words of different granules. The GBWG algorithm terminates when no new GoW is generated at any layer of the hierarchical structure. Thus multiple GoWs are obtained, each of which contains contextually related words, representing different topics. However, the GoWs may contain common words and creating ambiguity in topic identification. Louvain graph clustering algorithm has been employed to automatically identify the topics, containing unique words by using mutual information as an association measure between the words (nodes) of each GoW. A test document is classified into a particular topic based on the probability of its unique words belong to different topics. The performance of the proposed method has been compared with other unsupervised, semi-supervised, and supervised topic modeling algorithms. Experimentally, it has been shown that the proposed method is comparable or better than the state-of-the-art topic modeling algorithms which further statistically verified with the Wilcoxon Rank-sum Test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Group topic model: organizing topics into groups

Article 10 September 2014

Topic Representation using Semantic-Based Patterns

LDA+: An Extended LDA Model for Topic Hierarchy and Discovery

References

Almeida H, Guedes D, Meira W, Zaki MJ (2011) Is there a best quality metric for graph clusters? In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 44–59
Bafna P, Shirwaikar S, Pramod D (2019) Task recommender system using semantic clustering to identify the right personnel. VINE J Inf Knowl Manag Syst 2:181–199
Google Scholar
Blagojević M, Micić Ž (2013) A web-based intelligent report e-learning system using data mining techniques. Comput Electr Eng 39(2):465–474
Article Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008
Article MATH Google Scholar
Cai D, He X, Han J (2007) SRDA: an efficient algorithm for large-scale discriminant analysis. IEEE Trans Knowl Data Eng 20(1):1–12
Google Scholar
Chen S-Y, Hung Y-C, Hung Y-H, Chien-Hsun W (2016) Application of a recurrent wavelet fuzzy-neural network in the positioning control of a magnetic-bearing mechanism. Comput Electr Eng 54:147–158
Article Google Scholar
classic4 dataset. http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Dieng AB, Wang C, Gao J, Paisley JW (2016) Topicrnn: a recurrent neural network with long-range semantic dependency. CoRR. arXiv:1611.01702
Dörpinghaus J, Schaaf S, Jacobs M (2018) Soft document clustering using a novel graph covering approach. BioData Min 11(1):1–20
Article Google Scholar
Duan T, Lou Q, Srihari SN, Xie X (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 68–80
Duan T, Pinto JP, Xie X (2019) Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures. Bioinformatics 35(6):953–961
Article Google Scholar
Egghe L (2008) The measures precision, recall, fallout and miss as a function of the number of retrieved documents and their mutual interrelations. Inf Process Manag 44(2):856–876
Article Google Scholar
Evaluation of clustering (2017). https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
Fang YC, Parthasarathy S, Schwartz F (2001) Using clustering to boost text classification. In: ICDM workshop on text mining (TextDM’01). Citeseer
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Article MathSciNet Google Scholar
Fei J, Rui T, Song X, Zhou Y, Zhang S (2018) More discriminative convolutional neural network with inter-class constraint for classification. Comput Electr Eng 68:484–489
Article Google Scholar
Feldman R, Sanger J (2006) Text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York
Book Google Scholar
Fernández J, Antón Vargas JA, Villuendas-Rey Y, Cabrera-Venegas JF, Chávez Y, Argüelles-Cruz AJ (2016) Clustering techniques for document classification. Res Comput Sci 118:115–125
Article Google Scholar
Gallagher RJ, Reing K, Kale D, Steeg GV (2017) Anchored correlation explanation: Topic modeling with minimal domain knowledge. Trans Assoc Comput Linguist 5:529–542
Article Google Scholar
Gomez JC, Moens M-F (2012) PCA document reconstruction for email classification. Comput Stat Data Anal 56(3):741–751
Article MathSciNet Google Scholar
Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd international conference on machine learning (ICML’06). ACM Press, pp 377–384
Hingmire S, Chougule S, Palshikar GK, Chakraborti S (2013) Document classification by topic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 877–880
Hirsch L, Di Nuovo A (2017) Document clustering with evolved search queries. In: 2017 IEEE congress on evolutionary computation (CEC). IEEE, pp 1239–1246
Huang R, Guan Yu, Wang Z, Zhang J, Shi L (2012) Dirichlet process mixture model for document clustering with feature partition. IEEE Trans Knowl Data Eng 25(8):1748–1759
Article Google Scholar
Indurkhya N, Damerau FJ (2010) Handbook of natural language processing. Chapman and Hall/CRC, Boca Raton
Book Google Scholar
Jagarlamudi J, Daumé III H, Udupa R (2012) Incorporating lexical priors into topic models. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, EACL ’12, pp 204–213, Stroudsburg, PA, USA. Association for Computational Linguistics
Jain VK, Kumar S, Fernandes SL (2017) Extraction of emotions from multilingual text using intelligent text processing and computational linguistics. J Comput Sci 21:316–326
Article Google Scholar
Jan B, Farman H, Khan M, Imran M, Islam I, Ahmad A, Ali S, Jeon G (2017) Deep learning in big data analytics: a comparative study. Comput Electr Eng 12
Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15169–15211
Article Google Scholar
Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. In Applications of intelligent optimization in biology and medicine. Springer, pp 267–287
Karypis MSG, Kumar V, Steinbach M (2000) A comparison of document clustering techniques. In: KDD workshop on text mining
Kim S-W, Gil J-M (2019) Research paper classification systems based on TF-IDF and LDA schemes. Hum Centric Comput Inf Sci 9(1):30
Article Google Scholar
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1746–1751. Association for Computational Linguistics
Kong J, Scott A, Goerg GM (2016) Improving semantic topic clustering for search queries with word co-occurrence and bigraph co-clustering. Google Inc, Mountain View
Google Scholar
Korshunova I, Xiong H, Fedoryszak M, Theis L (2019) Discriminative topic modeling with logistic LDA. In: Advances in neural information processing systems, pp 6770–6780
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence
Liu L, Liu K, Cong Z, Zhao J, Ji Y, He J (2018) Long length document classification by local convolutional feature aggregation. Algorithms 11(8):109
Article MathSciNet MATH Google Scholar
Liu Y, Niculescu-Mizil A, Gryc W (2009) Topic-link LDA: joint models of topic and author community. In: Proceedings of the 26th annual international conference on machine learning, ICML ’09. ACM, New York, NY, USA, pp 665–672
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, pp 545–552
Meng Y, Huang J, Wang G, Wang Z, Zhang C, Zhang Y, Han J (2020) Discriminative topic mining via category-name guided text embedding. In: Proceedings of the web conference 2020, pp 2121–2132
Meng Y, Zhang Y, Huang J, Zhang Y, Zhang C, Han J (2020) Hierarchical topic mining via joint spherical tree and text embedding. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1908–1917
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1
Article Google Scholar
Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
MathSciNet Google Scholar
Pasquali AR (2016) Automatic coherence evaluation applied to topic models
Pavlopoulos GA, Promponas VJ, Ouzounis CA, Iliopoulos I (2014) Biological information extraction and co-occurrence analysis. In: Biomedical literature mining, pp 77–92. Springer
Petz G, Karpowicz M, Fürschuß H, Auinger A, Stříteský V, Holzinger A (2013) Opinion mining on the web 2.0—characteristics of user generated content and their impacts. In: Holzinger A, Pasi G (eds) Human-computer interaction and knowledge discovery in complex, unstructured, big data. Springer, Berlin, pp 35–46
Chapter Google Scholar
Popel M, Mareček D (2010) Perplexity of n-gram and dependency language models. In: Sojka P, Horák A, Kopeček I, Pala K (eds) Text, speech and dialogue. Springer, Berlin, pp 173–180
Chapter Google Scholar
Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08. ACM, New York, USA, pp 569–577
Power R, Chen J, Karthik T, Subramanian L (2010) Document classification for focused topics. In: 2010 AAAI spring symposium series
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 1, EMNLP ’09. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 248–256
Rangrej A, Kulkarni S, Tendulkar AV (2011) Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th international conference companion on World wide web, pp 111–112
Rapečka A, Dzemyda G (2015) A new recommendation model for the user clustering-based recommendation system. Inf Technol Control 44(1):54–63
Google Scholar
Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on Web search and data mining, pp 399–408
Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64
Article MATH Google Scholar
Siivola V, Pellom BL (2005) Growing an n-gram language model. In: Proceedings of 9th European conference on speech communication and technology, pp 1309–1312
Solka JL et al (2008) Text data mining: theory and methods. Stat Surv 2:94–112
Article MathSciNet MATH Google Scholar
Sontag D, Roy D (2011) Complexity of inference in latent dirichlet allocation. In: Advances in neural information processing systems, pp 1008–1016
Stanchev L (2016) Semantic document clustering using a similarity graph. In: 2016 IEEE tenth international conference on semantic computing (ICSC). IEEE, pp 1–8
Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 952–961
Sun X (2014) Textual document clustering using topic models. In: 2014 10th International conference on semantics, knowledge and grids. IEEE, pp 1–4
Suo Q, Ma F, Canino G, Gao J, Zhang A, Veltri P, Agostino G (2017) A multi-task framework for monitoring health conditions via attention-based recurrent neural networks. In: AMIA annual symposium proceedings, vol 2017, p 1665. American Medical Informatics Association
Tang P, Wang H (2017) Richer feature for image classification with super and sub kernels based on deep convolutional neural network. Comput Electr Eng 62:499–510
Article Google Scholar
Theodosiou T, Darzentas N, Angelis L, Ouzounis CA (2008) Pured-MCL: a graph-based pubmed document clustering methodology. Bioinformatics 24(17):1935–1941
Article Google Scholar
Tian F, Gao B, He D, Liu T-Y (2016) Sentence level recurrent topic model: letting topics speak for themselves. arXiv preprint arXiv:1604.02038
Tong Z, Zhang H (2016) A text mining research based on LDA topic modelling. In: Proceedings of the sixth international conference on computer science, engineering and information technology (CCSEIT), pp 21–22
Teh YW, Jordan M, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101:1566–1581
Article MathSciNet MATH Google Scholar
Wilcoxon F, Katti SK, Wilcox RA (1970) Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. Sel Tables Math Stat 1:171–259
MATH Google Scholar
Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13:1–13:37
Article Google Scholar
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. arXiv preprint arXiv:1309.6874
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd international conference on data engineering (ICDE). IEEE, pp 625–636
Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 763–772

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, Howrah, India
Isha Ganguli & Jaya Sil
Department of Information Technology, University College of Bahrain, Janabiyah, Bahrain
Nandita Sengupta

Authors

Isha Ganguli
View author publications
You can also search for this author in PubMed Google Scholar
Jaya Sil
View author publications
You can also search for this author in PubMed Google Scholar
Nandita Sengupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isha Ganguli.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ganguli, I., Sil, J. & Sengupta, N. Nonparametric method of topic identification using granularity concept and graph-based modeling. Neural Comput & Applic 35, 1055–1075 (2023). https://doi.org/10.1007/s00521-020-05662-4

Download citation

Received: 16 June 2020
Accepted: 27 December 2020
Published: 13 January 2021
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00521-020-05662-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonparametric method of topic identification using granularity concept and graph-based modeling

Abstract

Access this article

Similar content being viewed by others

Group topic model: organizing topics into groups

Topic Representation using Semantic-Based Patterns

LDA+: An Extended LDA Model for Topic Hierarchy and Discovery

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Nonparametric method of topic identification using granularity concept and graph-based modeling

Abstract

Access this article

Similar content being viewed by others

Group topic model: organizing topics into groups

Topic Representation using Semantic-Based Patterns

LDA+: An Extended LDA Model for Topic Hierarchy and Discovery

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation