Skip to main content
Log in

Nonparametric method of topic identification using granularity concept and graph-based modeling

  • S.I. : 2019 India Intl. Congress on Computational Intelligence
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This paper aims to classify the large unstructured documents into different topics without involving huge computational resources and a priori knowledge. The concept of granularity is employed here to extract contextual information from the documents by generating granules of words (GoWs), hierarchically. The proposed granularity-based word grouping (GBWG) algorithm in a computationally efficient way group the words at different layers by using co-occurrence measure between the words of different granules. The GBWG algorithm terminates when no new GoW is generated at any layer of the hierarchical structure. Thus multiple GoWs are obtained, each of which contains contextually related words, representing different topics. However, the GoWs may contain common words and creating ambiguity in topic identification. Louvain graph clustering algorithm has been employed to automatically identify the topics, containing unique words by using mutual information as an association measure between the words (nodes) of each GoW. A test document is classified into a particular topic based on the probability of its unique words belong to different topics. The performance of the proposed method has been compared with other unsupervised, semi-supervised, and supervised topic modeling algorithms. Experimentally, it has been shown that the proposed method is comparable or better than the state-of-the-art topic modeling algorithms which further statistically verified with the Wilcoxon Rank-sum Test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Almeida H, Guedes D, Meira W, Zaki MJ (2011) Is there a best quality metric for graph clusters? In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 44–59

  2. Bafna P, Shirwaikar S, Pramod D (2019) Task recommender system using semantic clustering to identify the right personnel. VINE J Inf Knowl Manag Syst 2:181–199

    Google Scholar 

  3. Blagojević M, Micić Ž (2013) A web-based intelligent report e-learning system using data mining techniques. Comput Electr Eng 39(2):465–474

    Article  Google Scholar 

  4. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  5. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008

    Article  MATH  Google Scholar 

  6. Cai D, He X, Han J (2007) SRDA: an efficient algorithm for large-scale discriminant analysis. IEEE Trans Knowl Data Eng 20(1):1–12

    Google Scholar 

  7. Chen S-Y, Hung Y-C, Hung Y-H, Chien-Hsun W (2016) Application of a recurrent wavelet fuzzy-neural network in the positioning control of a magnetic-bearing mechanism. Comput Electr Eng 54:147–158

    Article  Google Scholar 

  8. classic4 dataset. http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/

  9. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  10. Dieng AB, Wang C, Gao J, Paisley JW (2016) Topicrnn: a recurrent neural network with long-range semantic dependency. CoRR. arXiv:1611.01702

  11. Dörpinghaus J, Schaaf S, Jacobs M (2018) Soft document clustering using a novel graph covering approach. BioData Min 11(1):1–20

    Article  Google Scholar 

  12. Duan T, Lou Q, Srihari SN, Xie X (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 68–80

  13. Duan T, Pinto JP, Xie X (2019) Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures. Bioinformatics 35(6):953–961

    Article  Google Scholar 

  14. Egghe L (2008) The measures precision, recall, fallout and miss as a function of the number of retrieved documents and their mutual interrelations. Inf Process Manag 44(2):856–876

    Article  Google Scholar 

  15. Evaluation of clustering (2017). https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

  16. Fang YC, Parthasarathy S, Schwartz F (2001) Using clustering to boost text classification. In: ICDM workshop on text mining (TextDM’01). Citeseer

  17. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  18. Fei J, Rui T, Song X, Zhou Y, Zhang S (2018) More discriminative convolutional neural network with inter-class constraint for classification. Comput Electr Eng 68:484–489

    Article  Google Scholar 

  19. Feldman R, Sanger J (2006) Text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York

    Book  Google Scholar 

  20. Fernández J, Antón Vargas JA, Villuendas-Rey Y, Cabrera-Venegas JF, Chávez Y, Argüelles-Cruz AJ (2016) Clustering techniques for document classification. Res Comput Sci 118:115–125

    Article  Google Scholar 

  21. Gallagher RJ, Reing K, Kale D, Steeg GV (2017) Anchored correlation explanation: Topic modeling with minimal domain knowledge. Trans Assoc Comput Linguist 5:529–542

    Article  Google Scholar 

  22. Gomez JC, Moens M-F (2012) PCA document reconstruction for email classification. Comput Stat Data Anal 56(3):741–751

    Article  MathSciNet  Google Scholar 

  23. Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd international conference on machine learning (ICML’06). ACM Press, pp 377–384

  24. Hingmire S, Chougule S, Palshikar GK, Chakraborti S (2013) Document classification by topic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 877–880

  25. Hirsch L, Di Nuovo A (2017) Document clustering with evolved search queries. In: 2017 IEEE congress on evolutionary computation (CEC). IEEE, pp 1239–1246

  26. Huang R, Guan Yu, Wang Z, Zhang J, Shi L (2012) Dirichlet process mixture model for document clustering with feature partition. IEEE Trans Knowl Data Eng 25(8):1748–1759

    Article  Google Scholar 

  27. Indurkhya N, Damerau FJ (2010) Handbook of natural language processing. Chapman and Hall/CRC, Boca Raton

    Book  Google Scholar 

  28. Jagarlamudi J, Daumé III H, Udupa R (2012) Incorporating lexical priors into topic models. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, EACL ’12, pp 204–213, Stroudsburg, PA, USA. Association for Computational Linguistics

  29. Jain VK, Kumar S, Fernandes SL (2017) Extraction of emotions from multilingual text using intelligent text processing and computational linguistics. J Comput Sci 21:316–326

    Article  Google Scholar 

  30. Jan B, Farman H, Khan M, Imran M, Islam I, Ahmad A, Ali S, Jeon G (2017) Deep learning in big data analytics: a comparative study. Comput Electr Eng 12

  31. Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15169–15211

    Article  Google Scholar 

  32. Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. In Applications of intelligent optimization in biology and medicine. Springer, pp 267–287

  33. Karypis MSG, Kumar V, Steinbach M (2000) A comparison of document clustering techniques. In: KDD workshop on text mining

  34. Kim S-W, Gil J-M (2019) Research paper classification systems based on TF-IDF and LDA schemes. Hum Centric Comput Inf Sci 9(1):30

    Article  Google Scholar 

  35. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1746–1751. Association for Computational Linguistics

  36. Kong J, Scott A, Goerg GM (2016) Improving semantic topic clustering for search queries with word co-occurrence and bigraph co-clustering. Google Inc, Mountain View

    Google Scholar 

  37. Korshunova I, Xiong H, Fedoryszak M, Theis L (2019) Discriminative topic modeling with logistic LDA. In: Advances in neural information processing systems, pp 6770–6780

  38. Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence

  39. Liu L, Liu K, Cong Z, Zhao J, Ji Y, He J (2018) Long length document classification by local convolutional feature aggregation. Algorithms 11(8):109

    Article  MathSciNet  MATH  Google Scholar 

  40. Liu Y, Niculescu-Mizil A, Gryc W (2009) Topic-link LDA: joint models of topic and author community. In: Proceedings of the 26th annual international conference on machine learning, ICML ’09. ACM, New York, NY, USA, pp 665–672

  41. Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, pp 545–552

  42. Meng Y, Huang J, Wang G, Wang Z, Zhang C, Zhang Y, Han J (2020) Discriminative topic mining via category-name guided text embedding. In: Proceedings of the web conference 2020, pp 2121–2132

  43. Meng Y, Zhang Y, Huang J, Zhang Y, Zhang C, Han J (2020) Hierarchical topic mining via joint spherical tree and text embedding. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1908–1917

  44. Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1

    Article  Google Scholar 

  45. Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265

    MathSciNet  Google Scholar 

  46. Pasquali AR (2016) Automatic coherence evaluation applied to topic models

  47. Pavlopoulos GA, Promponas VJ, Ouzounis CA, Iliopoulos I (2014) Biological information extraction and co-occurrence analysis. In: Biomedical literature mining, pp 77–92. Springer

  48. Petz G, Karpowicz M, Fürschuß H, Auinger A, Stříteský V, Holzinger A (2013) Opinion mining on the web 2.0—characteristics of user generated content and their impacts. In: Holzinger A, Pasi G (eds) Human-computer interaction and knowledge discovery in complex, unstructured, big data. Springer, Berlin, pp 35–46

    Chapter  Google Scholar 

  49. Popel M, Mareček D (2010) Perplexity of n-gram and dependency language models. In: Sojka P, Horák A, Kopeček I, Pala K (eds) Text, speech and dialogue. Springer, Berlin, pp 173–180

    Chapter  Google Scholar 

  50. Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08. ACM, New York, USA, pp 569–577

  51. Power R, Chen J, Karthik T, Subramanian L (2010) Document classification for focused topics. In: 2010 AAAI spring symposium series

  52. Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 1, EMNLP ’09. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 248–256

  53. Rangrej A, Kulkarni S, Tendulkar AV (2011) Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th international conference companion on World wide web, pp 111–112

  54. Rapečka A, Dzemyda G (2015) A new recommendation model for the user clustering-based recommendation system. Inf Technol Control 44(1):54–63

    Google Scholar 

  55. Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on Web search and data mining, pp 399–408

  56. Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64

    Article  MATH  Google Scholar 

  57. Siivola V, Pellom BL (2005) Growing an n-gram language model. In: Proceedings of 9th European conference on speech communication and technology, pp 1309–1312

  58. Solka JL et al (2008) Text data mining: theory and methods. Stat Surv 2:94–112

    Article  MathSciNet  MATH  Google Scholar 

  59. Sontag D, Roy D (2011) Complexity of inference in latent dirichlet allocation. In: Advances in neural information processing systems, pp 1008–1016

  60. Stanchev L (2016) Semantic document clustering using a similarity graph. In: 2016 IEEE tenth international conference on semantic computing (ICSC). IEEE, pp 1–8

  61. Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 952–961

  62. Sun X (2014) Textual document clustering using topic models. In: 2014 10th International conference on semantics, knowledge and grids. IEEE, pp 1–4

  63. Suo Q, Ma F, Canino G, Gao J, Zhang A, Veltri P, Agostino G (2017) A multi-task framework for monitoring health conditions via attention-based recurrent neural networks. In: AMIA annual symposium proceedings, vol 2017, p 1665. American Medical Informatics Association

  64. Tang P, Wang H (2017) Richer feature for image classification with super and sub kernels based on deep convolutional neural network. Comput Electr Eng 62:499–510

    Article  Google Scholar 

  65. Theodosiou T, Darzentas N, Angelis L, Ouzounis CA (2008) Pured-MCL: a graph-based pubmed document clustering methodology. Bioinformatics 24(17):1935–1941

    Article  Google Scholar 

  66. Tian F, Gao B, He D, Liu T-Y (2016) Sentence level recurrent topic model: letting topics speak for themselves. arXiv preprint arXiv:1604.02038

  67. Tong Z, Zhang H (2016) A text mining research based on LDA topic modelling. In: Proceedings of the sixth international conference on computer science, engineering and information technology (CCSEIT), pp 21–22

  68. Teh YW, Jordan M, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101:1566–1581

    Article  MathSciNet  MATH  Google Scholar 

  69. Wilcoxon F, Katti SK, Wilcox RA (1970) Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. Sel Tables Math Stat 1:171–259

    MATH  Google Scholar 

  70. Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13:1–13:37

    Article  Google Scholar 

  71. Xie P, Xing EP (2013) Integrating document clustering and topic modeling. arXiv preprint arXiv:1309.6874

  72. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489

  73. Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd international conference on data engineering (ICDE). IEEE, pp 625–636

  74. Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 763–772

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isha Ganguli.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ganguli, I., Sil, J. & Sengupta, N. Nonparametric method of topic identification using granularity concept and graph-based modeling. Neural Comput & Applic 35, 1055–1075 (2023). https://doi.org/10.1007/s00521-020-05662-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05662-4

Keywords

Navigation