Skip to main content
Log in

Keyword extraction as sequence labeling with classification algorithms

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Keyword extraction is one of the main problems in clustering and linking textual content. In literature, several machine learning approaches were proposed for keyword and keyphrase extraction. However, the state-of-the-art performance results are still below the expectations. In this paper, we propose a novel hybrid keyword extraction model, HybridKEM. The proposed model addresses the keyword extraction problem as a sequence labelling task. Naive Bayes (NB), Polynomial Regression (PR) Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Random Forest (RF) classification algorithms were trained separately in the Token Classification module of the model. The Token Classification process was performed by using text, graphic, embedding, and set features in the model. The performance of the model was evaluated using the Inspec, Semeval-2017, 500N-KPCrowd datasets, which are widely used in studies in the literature, and two newly collected, TRDizinEn and DergiParkEn datasets. The model achieved an average \(F_1\)-score of 0.664 for all datasets. The highest \(F_1\)-score (0.74) was obtained with the TRDizinEn dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Shamshirband S, Rabczuk T, Chau K-W (2019) A survey of deep learning techniques: application in wind and solar energy resources. IEEE Access 7:164650–164666

    Article  Google Scholar 

  2. Fan Y, Xu K, Wu H, Zheng Y, Tao B (2020) Spatiotemporal modeling for nonlinear distributed thermal processes based on kl decomposition, mlp and lstm network. IEEE Access 8:25111–25121

    Article  Google Scholar 

  3. Afan HA, Osman A, Essam Y, Ahmed AN, Huang YF, Kisi O, Sherif M, Sefelnasr A, Chau K-W, El-Shafie A (2021) Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng Appl Comput Fluid Mech 15(1):1420–1439

    Google Scholar 

  4. Wang W-C, Du Y-J, Chau K-W, Xu D-M, Liu C-J, Ma Q (2021) An ensemble hybrid forecasting model for annual runoff based on sample entropy, secondary decomposition, and long short-term memory neural network. Water Resour Manage 35(14):4695–4726

    Article  Google Scholar 

  5. Chen C, Zhang Q, Kashani MH, Jun C, Bateni SM, Band SS, Dash SS, Chau K-W (2022) Forecast of rainfall distribution based on fixed sliding window long short-term memory. Eng Appl Comput Fluid Mech 16(1):248–261

    Google Scholar 

  6. Wang X, Zhang S, Qiao H, Liu L, Tian F (2022) Mid-long term forecasting of reservoir inflow using the coupling of time-varying filter-based empirical mode decomposition and gated recurrent unit. Environ Sci Pollut Res 45:1–18

    Google Scholar 

  7. Jung S, Jeoung J, Hong T (2022) Occupant-centered real-time control of indoor temperature using deep learning algorithms. Build Environ 208:108633

    Article  Google Scholar 

  8. Tomokiyo T, Hurst M (2003) A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 33–40

  9. Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S (2015) Accurate keyphrase extraction from scientific papers by mining linguistic information. In: CLBib@ ISSI, pp. 12–17

  10. Hong B, Zhen D (2012) An extended keyword extraction method. Phys Proc 24:1120–1127

    Article  Google Scholar 

  11. Ramos J, et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48. Citeseer

  12. El-Beltagy SR, Rafea A (2009) Kp-miner: a keyphrase extraction system for english and arabic documents. Inf Syst 34(1):132–144

    Article  Google Scholar 

  13. Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, Jatowt A (2018) A text feature based automatic keyword extraction method for single documents. In: European Conference on Information Retrieval, pp. 684–691. Springer

  14. Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411

  15. Zhao WX, Jiang J, He J, Song Y, Achanauparp P, Lim E-P, Li X (2011) Topical keyphrase extraction from twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 379–388

  16. Alfarra MR, Alfarra A (2018) Graph-based technique for extracting keyphrases in a single-document (gtek). In: 2018 International Conference on Promising Electronic Technologies (ICPET), pp. 92–97. IEEE

  17. Bennani-Smires K, Musat C, Hossmann A, Baeriswyl M, Jaggi M (2018) Simple unsupervised keyphrase extraction using sentence embeddings. Preprint at https://arxiv.org/abs/1801.04470

  18. Sun Y, Qiu H, Zheng Y, Wang Z, Zhang C (2020) Sifrank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access 8:10896–10906

    Article  Google Scholar 

  19. Liang X, Wu S, Li M, Li Z (2021) Unsupervised keyphrase extraction by jointly modeling local and global context. Preprint at https://arxiv.org/abs/2109.07293

  20. Ajallouda L, Fagroud FZ, Zellou A, Lahmar EB (2022) Kp-use: an unsupervised approach for key-phrases extraction from documents. Int J Adv Computer Sci Appl 13:4

    Google Scholar 

  21. Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. http://arxiv.org/abs/1607.05368

  22. Pagliardini M, Gupta P, Jaggi M (2017) Unsupervised learning of sentence embeddings using compositional n-gram features. http://arxiv.org/abs/1703.02507

  23. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805

  24. Cer D, Yang Y, Kong S-Y, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. http://arxiv.org/abs/1803.11175

  25. Zehtab-Salmasi A, Feizi-Derakhshi M-R, Balafar M-A (2021) FRAKE: Fusional real-time automatic keyword extraction. Preprint at https://arxiv.org/abs/2104.04830

  26. Shen X, Wang Y, Meng R, Shang J (2022) Unsupervised deep keyphrase generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11303–11311

  27. Meng R, Zhao S, Han S, He D, Brusilovsky P, Chi Y (2017) Deep keyphrase generation. Preprint at https://arxiv.org/abs/1704.06879

  28. Yuan X, Wang T, Meng R, Thaker K, Brusilovsky P, He D, Trischler A (2018) One size does not fit all: generating and evaluating variable number of keyphrases. Preprint at https://arxiv.org/abs/1810.05241

  29. Ye J, Cai R, Gui T, Zhang Q (2021) Heterogeneous graph neural networks for keyphrase generation. Preprint at https://arxiv.org/abs/2109.04703

  30. Wu H, Liu W, Li L, Nie D, Chen T, Zhang F, Wang D (2021) UniKeyphrase: a unified extraction and generation framework for keyphrase prediction. Preprint at https://arxiv.org/abs/2106.04847

  31. Zhang Y, Jiang T, Yang T, Li X, Wang S (2022) Htkg: Deep keyphrase generation with neural hierarchical topic guidance

  32. Yang P, Ge Y, Yao Y, Yang Y (2022) Gcn-based document representation for keyphrase generation enhanced by maximizing mutual information. Knowl-Based Syst 243:108488

    Article  Google Scholar 

  33. Sahrawat D, Mahata D, Zhang H, Kulkarni M, Sharma A, Gosangi R, Stent A, Kumar Y, Shah RR, Zimmermann R (2020) Keyphrase extraction as sequence labeling using contextualized embeddings. In: European Conference on Information Retrieval, pp. 328–335. Springer

  34. Duari S, Bhatnagar V (2020) Complex network based supervised keyword extractor. Expert Syst Appl 140:112876

    Article  Google Scholar 

  35. Liu R, Lin Z, Wang W (2020) Keyphrase prediction with pre-trained language model. arXiv preprint http://arxiv.org/abs/2004.10462

  36. Gero Z, Ho J (2021) Word centrality constrained representation for keyphrase extraction. In: Proceedings of the 20th Workshop on Biomedical Language Processing, pp. 155–161

  37. Nikzad-Khasmakhi N, Feizi-Derakhshi M-R, Asgari-Chenaghlu M, Balafar M-A, Feizi-Derakhshi A-R, Rahkar-Farshi T, Ramezani M, Jahanbakhsh-Nagadeh Z, Zafarani-Moattar E, Ranjbar-Khadivi M (2021) Phraseformer: Multimodal key-phrase extraction using transformer and graph embedding. http://arxiv.org/abs/2106.04939

  38. Basaldella M, Antolli E, Serra G, Tasso C (2018) Bidirectional lstm recurrent neural network for keyphrase extraction. In: Italian Research Conference on Digital Libraries, pp. 180–187. Springer

  39. Alzaidy R, Caragea C, Giles CL (2019) Bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents. In: The World Wide Web Conference, pp. 2551–2557

  40. Vega-Oliveros DA, Gomes PS, Milios EE, Berton L (2019) A multi-centrality index for graph-based keyword extraction. Inf Process Manage 56(6):102063

    Article  Google Scholar 

  41. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Computers Electr Eng 40(1):16–28

    Article  Google Scholar 

  42. Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223

  43. Marujo L, Viveiros M, Neto JPDS (2013) Keyphrase cloud generation of broadcast news. Preprint at https://arxiv.org/abs/1306.4606

  44. Augenstein I, Das M, Riedel S, Vikraman L, McCallum A (2014) Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. Preprint at https://arxiv.org/abs/1704.02853

  45. Krapivin M, Autaeu A, Marchese M (2009) Large dataset for keyphrases extraction

  46. Nguyen TD, Kan M-Y (2007) Keyphrase extraction in scientific publications. In: International Conference on Asian Digital Libraries, pp. 317–326. Springer

  47. Aronson AR, Bodenreider O, Chang HF, Humphrey SM, Mork JG, Nelson SJ, Rindflesch TC, Wilbur WJ (2000) The nlm indexing initiative. In: Proceedings of the AMIA Symposium, p. 17. American Medical Informatics Association

  48. Kim SN, Medelyan O, Kan M-Y, Baldwin T, Pingar L (2010) Semeval-2010 task 5: Automatic keyphrase extraction from scientific

  49. Zhao M-J, Edakunni N, Pocock A, Brown G (2013) Beyond fano’s inequality: bounds on the optimal f-score, ber, and cost-sensitive risk and their implications. J Mach Learn Res 14(1):1033–1090

    MATH  Google Scholar 

  50. Marcot BG, Hanea AM (2021) What is an optimal value of k in k-fold cross-validation in discrete bayesian network analysis? Comput Stat 36(3):2009–2031

    Article  MATH  Google Scholar 

  51. Argamon S, Levitan S (2005) Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, pp. 1–3

  52. Ghosh S, Saha C, Molakathaala N (2022) Neuragen-a low-resource neural network based approach for gender classification. http://arxiv.org/abs/2203.15253

  53. Hafeez S, Kathirisetty N (2022) Effects and comparison of different data pre-processing techniques and ml and deep learning models for sentiment analysis: Svm, knn, pca with svm and cnn. In: 2022 First International Conference on Artificial Intelligence Trends and Pattern Recognition (ICAITPR), pp. 1–6. IEEE

  54. Passon M, Comuzzo M, Serra G, Tasso C (2019) 0Keyphrase extraction via an attentive model. In: Italian Research Conference on Digital Libraries, pp. 304–314. Springer

Download references

Acknowledgements

We thank TUBITAK Ulakbim for providing the TRDizinEn dataset for this study. We make the DergiParkEn dataset publicly available at http://github.com/humakilicunlu/DergiParkEn.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hüma Kılıç Ünlü.

Ethics declarations

Conflict of Interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kılıç Ünlü, H., Çetin, A. Keyword extraction as sequence labeling with classification algorithms. Neural Comput & Applic 35, 3413–3422 (2023). https://doi.org/10.1007/s00521-022-07906-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07906-x

Keywords

Navigation