Abstract
Keyword extraction is a fundamental problem in natural language processing applications. Many graph-based models can be found in the literature that construct a graph of word co-occurrences from the input text to solve this problem. These models use graph-based features, such as Betweenness Centrality, Closeness Centrality, Eigenvector Centrality, Degree, PageRank, Clustering Coefficient, Eccentricity, Structural Hole and Coreness. In this paper, we propose a novel graph-based token classification model based on commonly used graph-based features. We used extra tree, lasso, genetic algorithm and wrapper methods to filter most informative group from all features. The token classification module of the model uses the Random Forest Ensemble classification algorithm. The performance results were evaluated with the commonly used datasets Inspec, Semeval-2017, and 500N-KPCrowd. The proposed model was also evaluated with the newly collected TRDizinEn and DergiParkEn datasets. Semeval-2017, 500N-KPCrowd, DergiParkEn, and TRDizinEn achieved the highest \({F_1}\)-scores of 0.641, 0.694, 0.707, and 0.766, respectively.
Similar content being viewed by others
References
Al-Sulttani, A.O.; Al-Mukhtar, M.; Roomi, A.B.; Farooque, A.A.; Khedher, K.M.; Yaseen, Z.M.: Proposition of new ensemble data-intelligence models for surface water quality prediction. IEEE Access 9, 108527–108541 (2021)
Yan, G.; Yu, C.; Bai, Y.: Wind turbine bearing temperature forecasting using a new data-driven ensemble approach. Machines 9(11), 248 (2021)
Afan, H.A.; Osman Ibrahem Ahmed, A.; Essam, Y.; Ahmed, A.N.; Huang, Y.F.; Kisi, O.; Sherif, M.; Sefelnasr, A.; Chau, K.-W.; El-Shafie, A.: Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng. Appl. Comput. Fluid Mech. 15(1), 1420–1439 (2021)
Wang, W.-C.; Du, Y.-J.; Chau, K.-W.; Xu, D.-M.; Liu, C.-J.; Ma, Q.: An ensemble hybrid forecasting model for annual runoff based on sample entropy, secondary decomposition, and long short-term memory neural network. Water Resour. Manage 35(14), 4695–4726 (2021)
Shamshirband, S.; Jafari Nodoushan, E.; Adolf, J.E.; Abdul Manaf, A.; Mosavi, A.; Chau, K.-W.: Ensemble models with uncertainty analysis for multi-day ahead forecasting of chlorophyll a concentration in coastal waters. Eng. Appl. Comput. Fluid Mech. 13(1), 91–101 (2019)
Alizadeh, M.J.; Jafari Nodoushan, E.; Kalarestaghi, N.; Chau, K.W.: Toward multi-day-ahead forecasting of suspended sediment concentration using ensemble models. Environ. Sci. Pollut. Res. 24(36), 28017–28025 (2017)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223 (2003)
Marujo, L.; Viveiros, M.; Neto, J.P.d.S.: Keyphrase cloud generation of broadcast news. Preprint at https://arxiv.org/abs/1306.4606 (2013)
Salton, G.; Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Ramos, J.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48 . New Jersey, USA (2003)
El-Beltagy, S.R.; Rafea, A.: Kp-miner: a keyphrase extraction system for English and Arabic documents. Inf. Syst. 34(1), 132–144 (2009)
Hong, B.; Zhen, D.: An extended keyword extraction method. Phys. Proc. 24, 1120–1127 (2012)
Pay, T.: Totally automated keyword extraction. In: 2016 IEEE International Conference on Big Data (big Data), pp. 3859–3863 . IEEE (2016)
Li, J.; Fan, Q.; Zhang, K.: Keyword extraction based on TF/IDF for Chinese news document. Wuhan Univ. J. Natl. Sci. 12(5), 917–921 (2007)
Li, T.; Hu, L.; Li, H.; Sun, C.; Li, S.; Chi, L.: Triplerank: an unsupervised keyphrase extraction algorithm. Knowl.-Based Syst. 219, 106846 (2021)
Tomokiyo, T.; Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 33–40 (2003)
Nguyen, T.D.; Kan, M.-Y.: Keyphrase extraction in scientific publications. In: International Conference on Asian Digital Libraries, pp. 317–326. Springer (2007)
Haddoud, M.; Abdeddaïm, S.: Accurate keyphrase extraction by discriminating overlapping phrases. J. Inf. Sci. 40(4), 488–500 (2014)
Mihalcea, R.; Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)
Zhao, W.X.; Jiang, J.; He, J.; Song, Y.; Achanauparp, P.; Lim, E.-P.; Li, X.: Topical keyphrase extraction from twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 379–388 (2011)
Florescu, C.; Caragea, C.: Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers), pp. 1105–1115 (2017)
Alfarra, M.R.; Alfarra, A.: Graph-based technique for extracting keyphrases in a single-document (gtek). In: 2018 International Conference on Promising Electronic Technologies (ICPET), pp. 92–97. IEEE(2018)
Duari, S.; Bhatnagar, V.: Complex network based supervised keyword extractor. Expert Syst. Appl. 140, 112876 (2020)
Wang, B.; Yang, B.; Shan, S.; Chen, H.: Detecting hot topics from academic big data. IEEE Access 7, 185916–185927 (2019)
Basaldella, M.; Antolli, E.; Serra, G.; Tasso, C.: Bidirectional lstm recurrent neural network for keyphrase extraction. In: Italian Research Conference on Digital Libraries, pp. 180–187. Springer (2018)
Bennani-Smires, K.; Musat, C.; Hossmann, A.; Baeriswyl, M.; Jaggi, M.: Simple unsupervised keyphrase extraction using sentence embeddings. Preprint at https://arxiv.org/abs/1801.04470 (2018)
Sun, Y.; Qiu, H.; Zheng, Y.; Wang, Z.; Zhang, C.: Sifrank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access 8, 10896–10906 (2020)
Liang, X.; Wu, S.; Li, M.; Li, Z.: Unsupervised keyphrase extraction by jointly modeling local and global context. Preprint at https://arxiv.org/abs/2109.07293 (2021)
Ajallouda, L.; Fagroud, F.Z.; Zellou, A.; Lahmar, E.B.: Kp-use: an unsupervised approach for key-phrases extraction from documents. Int. J. Adv. Comput. Sci. Appl. 13(4), 1–7 (2022)
Zehtab-Salmasi, A.; Feizi-Derakhshi, M.-R.; Balafar, M.-A.: FRAKE: fusional real-time automatic keyword extraction. Preprint at https://arxiv.org/abs/2104.04830 (2021)
Shen, X.; Wang, Y.; Meng, R.; Shang, J.: Unsupervised deep keyphrase generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11303–11311 (2022)
Nikzad-Khasmakhi, N.; Feizi-Derakhshi, M.-R.; Asgari-Chenaghlu, M.; Balafar, M.-A.; Feizi-Derakhshi, A.-R.; Rahkar-Farshi, T.; Ramezani, M.; Jahanbakhsh-Nagadeh, Z.; Zafarani-Moattar, E.; Ranjbar-Khadivi, M.: Phraseformer: multimodal key-phrase extraction using transformer and graph embedding. arXiv preprint arXiv:2106.04939 (2021)
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018)
Kılıç Ünlü, H.; Çetin, A.: Keyword extraction as sequence labeling with classification algorithms. Neural Computing and Applications, 1–10. https://doi.org/10.1007/s00521-022-07906-x (2022)
Brin, S.; Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
Liu, Z.; Huang, W.; Zheng, Y.; Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 366–376 (2010)
Wan, X.; Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: AAAI, vol. 8, pp. 855–860 (2008)
Bougouin, A.; Boudin, F.; Daille, B.: Topicrank: graph-based topic ranking for keyphrase extraction. In: International Joint Conference on Natural Language Processing (IJCNLP), pp. 543–551 (2013)
Prasad, A.; Kan, M.-Y.: Glocal: Incorporating global information in local convolution for keyphrase extraction. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1837–1846 (2019)
Beliga, S.; Meštrović, A.; Martinčić-Ipšić, S.: Toward selectivity based keyword extraction for Croatian news. arXiv preprint arXiv:1407.4723 (2014)
Vega-Oliveros, D.A.; Gomes, P.S.; Milios, E.E.; Berton, L.: A multi-centrality index for graph-based keyword extraction. Inf. Process. Manag. 56(6), 102063 (2019)
Škrlj, B.; Repar, A.; Pollak, S.: Rakun: Rank-based keyword extraction via unsupervised learning and meta vertex aggregation. In: International Conference on Statistical Language and Speech Processing, pp. 311–323. Springer (2019)
Das, K.; Samanta, S.; Pal, M.: Study on centrality measures in social networks: a survey. Soc. Netw. Anal. Min. 8(1), 1–11 (2018)
Zaki, M.J.; Meira, W., Jr.; Meira, W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2014)
Barrat, A.; Barthelemy, M.; Pastor-Satorras, R.; Vespignani, A.: The architecture of complex weighted networks. Proc. Natl. Acad. Sci. 101(11), 3747–3752 (2004)
Pastor-Satorras, R.; Castellano, C.; Van Mieghem, P.; Vespignani, A.: Epidemic processes in complex networks. Rev. Modern Phys. 87(3), 925 (2015)
Vega-Oliveros, D.A.; Berton, L.; de Andrade Lopes, A.; Rodrigues, F.A.: Influence maximization based on the least influential spreaders. In: SocInf@ IJCAI, pp. 3–8 (2015)
Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983)
Augenstein, I.; Das, M.; Riedel, S.; Vikraman, L.; McCallum, A.: Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. Preprint at https://arxiv.org/abs/1704.02853 (2017)
Krapivin, M.; Autaeu, A.; Marchese, M.: Large dataset for keyphrases extraction (2009)
Aronson, A.R.; Bodenreider, O.; Chang, H.F.; Humphrey, S.M.; Mork, J.G.; Nelson, S.J.; Rindflesch, T.C.; Wilbur, W.J.: The NLM indexing initiative. In: Proceedings of the AMIA Symposium, p. 17. American Medical Informatics Association (2000)
Kim, S.N.; Medelyan, O.; Kan, M.-Y.; Baldwin, T.; Pingar, L.: Semeval-2010 task 5: automatic keyphrase extraction from scientific (2010)
Zhao, M.-J.; Edakunni, N.; Pocock, A.; Brown, G.: Beyond Fano’s inequality: Bounds on the optimal F-score, BER, and cost-sensitive risk and their implications. J. Mach. Learn. Res. 14(1), 1033–1090 (2013)
Passon, M.; Comuzzo, M.; Serra, G.; Tasso, C.: Keyphrase extraction via an attentive model. In: Italian Research Conference on Digital Libraries, pp. 304–314. Springer (2019)
Sahrawat, D.; Mahata, D.; Zhang, H.; Kulkarni, M.; Sharma, A.; Gosangi, R.; Stent, A.; Kumar, Y.; Shah, R.R.; Zimmermann, R.: Keyphrase extraction as sequence labeling using contextualized embeddings. In: European Conference on Information Retrieval, pp. 328–335. Springer (2020)
Gero, Z.; Ho, J.: Word centrality constrained representation for keyphrase extraction. In: Proceedings of the 20th Workshop on Biomedical Language Processing, pp. 155–161 (2021)
Acknowledgements
We thank TUBITAK Ulakbim for providing the TRDizinEn dataset for this study DergiParkEn dataset is publicly available at https://github.com/humakilicunlu/DergiParkEn.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No potential conflict of interest was reported by the authors.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kılıç, H., Çetin, A. A Novel Graph-Based Ensemble Token Classification Model for Keyword Extraction. Arab J Sci Eng 48, 10673–10680 (2023). https://doi.org/10.1007/s13369-023-07721-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-023-07721-z