Skip to main content
Log in

A Novel Graph-Based Ensemble Token Classification Model for Keyword Extraction

  • Research Article-Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Keyword extraction is a fundamental problem in natural language processing applications. Many graph-based models can be found in the literature that construct a graph of word co-occurrences from the input text to solve this problem. These models use graph-based features, such as Betweenness Centrality, Closeness Centrality, Eigenvector Centrality, Degree, PageRank, Clustering Coefficient, Eccentricity, Structural Hole and Coreness. In this paper, we propose a novel graph-based token classification model based on commonly used graph-based features. We used extra tree, lasso, genetic algorithm and wrapper methods to filter most informative group from all features. The token classification module of the model uses the Random Forest Ensemble classification algorithm. The performance results were evaluated with the commonly used datasets Inspec, Semeval-2017, and 500N-KPCrowd. The proposed model was also evaluated with the newly collected TRDizinEn and DergiParkEn datasets. Semeval-2017, 500N-KPCrowd, DergiParkEn, and TRDizinEn achieved the highest \({F_1}\)-scores of 0.641, 0.694, 0.707, and 0.766, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Al-Sulttani, A.O.; Al-Mukhtar, M.; Roomi, A.B.; Farooque, A.A.; Khedher, K.M.; Yaseen, Z.M.: Proposition of new ensemble data-intelligence models for surface water quality prediction. IEEE Access 9, 108527–108541 (2021)

    Article  Google Scholar 

  2. Yan, G.; Yu, C.; Bai, Y.: Wind turbine bearing temperature forecasting using a new data-driven ensemble approach. Machines 9(11), 248 (2021)

    Article  Google Scholar 

  3. Afan, H.A.; Osman Ibrahem Ahmed, A.; Essam, Y.; Ahmed, A.N.; Huang, Y.F.; Kisi, O.; Sherif, M.; Sefelnasr, A.; Chau, K.-W.; El-Shafie, A.: Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng. Appl. Comput. Fluid Mech. 15(1), 1420–1439 (2021)

  4. Wang, W.-C.; Du, Y.-J.; Chau, K.-W.; Xu, D.-M.; Liu, C.-J.; Ma, Q.: An ensemble hybrid forecasting model for annual runoff based on sample entropy, secondary decomposition, and long short-term memory neural network. Water Resour. Manage 35(14), 4695–4726 (2021)

    Article  Google Scholar 

  5. Shamshirband, S.; Jafari Nodoushan, E.; Adolf, J.E.; Abdul Manaf, A.; Mosavi, A.; Chau, K.-W.: Ensemble models with uncertainty analysis for multi-day ahead forecasting of chlorophyll a concentration in coastal waters. Eng. Appl. Comput. Fluid Mech. 13(1), 91–101 (2019)

    Google Scholar 

  6. Alizadeh, M.J.; Jafari Nodoushan, E.; Kalarestaghi, N.; Chau, K.W.: Toward multi-day-ahead forecasting of suspended sediment concentration using ensemble models. Environ. Sci. Pollut. Res. 24(36), 28017–28025 (2017)

    Article  Google Scholar 

  7. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223 (2003)

  8. Marujo, L.; Viveiros, M.; Neto, J.P.d.S.: Keyphrase cloud generation of broadcast news. Preprint at https://arxiv.org/abs/1306.4606 (2013)

  9. Salton, G.; Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  10. Ramos, J.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48 . New Jersey, USA (2003)

  11. El-Beltagy, S.R.; Rafea, A.: Kp-miner: a keyphrase extraction system for English and Arabic documents. Inf. Syst. 34(1), 132–144 (2009)

    Article  Google Scholar 

  12. Hong, B.; Zhen, D.: An extended keyword extraction method. Phys. Proc. 24, 1120–1127 (2012)

    Article  Google Scholar 

  13. Pay, T.: Totally automated keyword extraction. In: 2016 IEEE International Conference on Big Data (big Data), pp. 3859–3863 . IEEE (2016)

  14. Li, J.; Fan, Q.; Zhang, K.: Keyword extraction based on TF/IDF for Chinese news document. Wuhan Univ. J. Natl. Sci. 12(5), 917–921 (2007)

    Article  Google Scholar 

  15. Li, T.; Hu, L.; Li, H.; Sun, C.; Li, S.; Chi, L.: Triplerank: an unsupervised keyphrase extraction algorithm. Knowl.-Based Syst. 219, 106846 (2021)

    Article  Google Scholar 

  16. Tomokiyo, T.; Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 33–40 (2003)

  17. Nguyen, T.D.; Kan, M.-Y.: Keyphrase extraction in scientific publications. In: International Conference on Asian Digital Libraries, pp. 317–326. Springer (2007)

  18. Haddoud, M.; Abdeddaïm, S.: Accurate keyphrase extraction by discriminating overlapping phrases. J. Inf. Sci. 40(4), 488–500 (2014)

    Article  Google Scholar 

  19. Mihalcea, R.; Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)

  20. Zhao, W.X.; Jiang, J.; He, J.; Song, Y.; Achanauparp, P.; Lim, E.-P.; Li, X.: Topical keyphrase extraction from twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 379–388 (2011)

  21. Florescu, C.; Caragea, C.: Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers), pp. 1105–1115 (2017)

  22. Alfarra, M.R.; Alfarra, A.: Graph-based technique for extracting keyphrases in a single-document (gtek). In: 2018 International Conference on Promising Electronic Technologies (ICPET), pp. 92–97. IEEE(2018)

  23. Duari, S.; Bhatnagar, V.: Complex network based supervised keyword extractor. Expert Syst. Appl. 140, 112876 (2020)

    Article  Google Scholar 

  24. Wang, B.; Yang, B.; Shan, S.; Chen, H.: Detecting hot topics from academic big data. IEEE Access 7, 185916–185927 (2019)

    Article  Google Scholar 

  25. Basaldella, M.; Antolli, E.; Serra, G.; Tasso, C.: Bidirectional lstm recurrent neural network for keyphrase extraction. In: Italian Research Conference on Digital Libraries, pp. 180–187. Springer (2018)

  26. Bennani-Smires, K.; Musat, C.; Hossmann, A.; Baeriswyl, M.; Jaggi, M.: Simple unsupervised keyphrase extraction using sentence embeddings. Preprint at https://arxiv.org/abs/1801.04470 (2018)

  27. Sun, Y.; Qiu, H.; Zheng, Y.; Wang, Z.; Zhang, C.: Sifrank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access 8, 10896–10906 (2020)

    Article  Google Scholar 

  28. Liang, X.; Wu, S.; Li, M.; Li, Z.: Unsupervised keyphrase extraction by jointly modeling local and global context. Preprint at https://arxiv.org/abs/2109.07293 (2021)

  29. Ajallouda, L.; Fagroud, F.Z.; Zellou, A.; Lahmar, E.B.: Kp-use: an unsupervised approach for key-phrases extraction from documents. Int. J. Adv. Comput. Sci. Appl. 13(4), 1–7 (2022)

    Google Scholar 

  30. Zehtab-Salmasi, A.; Feizi-Derakhshi, M.-R.; Balafar, M.-A.: FRAKE: fusional real-time automatic keyword extraction. Preprint at https://arxiv.org/abs/2104.04830 (2021)

  31. Shen, X.; Wang, Y.; Meng, R.; Shang, J.: Unsupervised deep keyphrase generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11303–11311 (2022)

  32. Nikzad-Khasmakhi, N.; Feizi-Derakhshi, M.-R.; Asgari-Chenaghlu, M.; Balafar, M.-A.; Feizi-Derakhshi, A.-R.; Rahkar-Farshi, T.; Ramezani, M.; Jahanbakhsh-Nagadeh, Z.; Zafarani-Moattar, E.; Ranjbar-Khadivi, M.: Phraseformer: multimodal key-phrase extraction using transformer and graph embedding. arXiv preprint arXiv:2106.04939 (2021)

  33. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018)

  34. Kılıç Ünlü, H.; Çetin, A.: Keyword extraction as sequence labeling with classification algorithms. Neural Computing and Applications, 1–10. https://doi.org/10.1007/s00521-022-07906-x (2022)

  35. Brin, S.; Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)

    Article  Google Scholar 

  36. Liu, Z.; Huang, W.; Zheng, Y.; Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 366–376 (2010)

  37. Wan, X.; Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: AAAI, vol. 8, pp. 855–860 (2008)

  38. Bougouin, A.; Boudin, F.; Daille, B.: Topicrank: graph-based topic ranking for keyphrase extraction. In: International Joint Conference on Natural Language Processing (IJCNLP), pp. 543–551 (2013)

  39. Prasad, A.; Kan, M.-Y.: Glocal: Incorporating global information in local convolution for keyphrase extraction. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1837–1846 (2019)

  40. Beliga, S.; Meštrović, A.; Martinčić-Ipšić, S.: Toward selectivity based keyword extraction for Croatian news. arXiv preprint arXiv:1407.4723 (2014)

  41. Vega-Oliveros, D.A.; Gomes, P.S.; Milios, E.E.; Berton, L.: A multi-centrality index for graph-based keyword extraction. Inf. Process. Manag. 56(6), 102063 (2019)

    Article  Google Scholar 

  42. Škrlj, B.; Repar, A.; Pollak, S.: Rakun: Rank-based keyword extraction via unsupervised learning and meta vertex aggregation. In: International Conference on Statistical Language and Speech Processing, pp. 311–323. Springer (2019)

  43. Das, K.; Samanta, S.; Pal, M.: Study on centrality measures in social networks: a survey. Soc. Netw. Anal. Min. 8(1), 1–11 (2018)

    Article  Google Scholar 

  44. Zaki, M.J.; Meira, W., Jr.; Meira, W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2014)

    Book  MATH  Google Scholar 

  45. Barrat, A.; Barthelemy, M.; Pastor-Satorras, R.; Vespignani, A.: The architecture of complex weighted networks. Proc. Natl. Acad. Sci. 101(11), 3747–3752 (2004)

    Article  MATH  Google Scholar 

  46. Pastor-Satorras, R.; Castellano, C.; Van Mieghem, P.; Vespignani, A.: Epidemic processes in complex networks. Rev. Modern Phys. 87(3), 925 (2015)

    Article  MathSciNet  Google Scholar 

  47. Vega-Oliveros, D.A.; Berton, L.; de Andrade Lopes, A.; Rodrigues, F.A.: Influence maximization based on the least influential spreaders. In: SocInf@ IJCAI, pp. 3–8 (2015)

  48. Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983)

    Article  MathSciNet  Google Scholar 

  49. Augenstein, I.; Das, M.; Riedel, S.; Vikraman, L.; McCallum, A.: Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. Preprint at https://arxiv.org/abs/1704.02853 (2017)

  50. Krapivin, M.; Autaeu, A.; Marchese, M.: Large dataset for keyphrases extraction (2009)

  51. Aronson, A.R.; Bodenreider, O.; Chang, H.F.; Humphrey, S.M.; Mork, J.G.; Nelson, S.J.; Rindflesch, T.C.; Wilbur, W.J.: The NLM indexing initiative. In: Proceedings of the AMIA Symposium, p. 17. American Medical Informatics Association (2000)

  52. Kim, S.N.; Medelyan, O.; Kan, M.-Y.; Baldwin, T.; Pingar, L.: Semeval-2010 task 5: automatic keyphrase extraction from scientific (2010)

  53. Zhao, M.-J.; Edakunni, N.; Pocock, A.; Brown, G.: Beyond Fano’s inequality: Bounds on the optimal F-score, BER, and cost-sensitive risk and their implications. J. Mach. Learn. Res. 14(1), 1033–1090 (2013)

    MathSciNet  MATH  Google Scholar 

  54. Passon, M.; Comuzzo, M.; Serra, G.; Tasso, C.: Keyphrase extraction via an attentive model. In: Italian Research Conference on Digital Libraries, pp. 304–314. Springer (2019)

  55. Sahrawat, D.; Mahata, D.; Zhang, H.; Kulkarni, M.; Sharma, A.; Gosangi, R.; Stent, A.; Kumar, Y.; Shah, R.R.; Zimmermann, R.: Keyphrase extraction as sequence labeling using contextualized embeddings. In: European Conference on Information Retrieval, pp. 328–335. Springer (2020)

  56. Gero, Z.; Ho, J.: Word centrality constrained representation for keyphrase extraction. In: Proceedings of the 20th Workshop on Biomedical Language Processing, pp. 155–161 (2021)

Download references

Acknowledgements

We thank TUBITAK Ulakbim for providing the TRDizinEn dataset for this study DergiParkEn dataset is publicly available at https://github.com/humakilicunlu/DergiParkEn.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hüma Kılıç.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the authors.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kılıç, H., Çetin, A. A Novel Graph-Based Ensemble Token Classification Model for Keyword Extraction. Arab J Sci Eng 48, 10673–10680 (2023). https://doi.org/10.1007/s13369-023-07721-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-023-07721-z

Keywords

Navigation