Skip to main content

Automatic keyphrase extraction using word embeddings

Abstract

Unsupervised random-walk keyphrase extraction models mainly rely on global structural information of the word graph, with nodes representing candidate words and edges capturing the co-occurrence information between candidate words. However, using word embedding method to integrate multiple kinds of useful information into the random-walk model to help better extract keyphrases is relatively unexplored. In this paper, we propose a random-walk-based ranking method to extract keyphrases from text documents using word embeddings. Specifically, we first design a heterogeneous text graph embedding model to integrate local context information of the word graph (i.e., the local word collocation patterns) with some crucial features of candidate words and edges of the word graph. Then, a novel random-walk-based ranking model is designed to score candidate words by leveraging such learned word embeddings. Finally, a new and generic similarity-based phrase scoring model using word embeddings is proposed to score phrases for selecting top-scoring phrases as keyphrases. Experimental results show that the proposed method consistently outperforms eight state-of-the-art unsupervised methods on three real datasets for keyphrase extraction.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. https://nlp.stanford.edu/projects/glove/.

  2. http://people.cs.ksu.edu/~ccaragea/keyphrases.html.

  3. http://www.nltk.org/.

  4. http://tartarus.org/martin/PorterStemmer/.

  5. https://github.com/facebookresearch/fastText.

References

  • Alrehamy H, Walker C (2018) Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction. Soft Comput 22(21):7041–7057

    Article  Google Scholar 

  • Artetxe M, Labaka G, Agirre E (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of ACL, pp 789–798

  • Baeza-Yates R, Ribeiro BAN et al (2011) Modern information retrieval. ACM Press/Addison-Wesley, New York/Harlow

    Google Scholar 

  • Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell (TPAMI) 35(8):1798–1828

    Article  Google Scholar 

  • Bhattacharya I, Godbole S, Joshi S (2008) Structured entity identification and document categorization: two tasks with one joint model. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Las Vegas, 24–27 Aug 2008. ACM, New York, pp 25–33. https://doi.org/10.1145/1401890.1401899

  • Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., Sebastopol

    MATH  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(1):993–1022

    MATH  Google Scholar 

  • Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist (TACL) 5:135–146

    Article  Google Scholar 

  • Boudin F (2013) A comparison of centrality measures for graph-based keyphrase extraction. In: Proceedings of IJCNLP, pp 834–838

  • Boudin F (2015) Reducing over-generation errors for automatic keyphrase extraction using integer linear programming. In: Proceedings of ACL workshop on novel computational approaches to keyphrase extraction, pp 19–24

  • Bulgarov F, Caragea C (2015) A comparison of supervised keyphrase extraction models. In: Proceedings of WWW, pp 13–14

  • Caragea C, Bulgarov F, Godea A, Gollapalli SD (2014) Citation-enhanced keyphrase extraction from research papers: a supervised approach. In: Proceedings of EMNLP, pp 1435–1446

  • Chuang J, Manning CD, Heer J (2012) Termite: visualization techniques for assessing textual topic models. In: Proceedings of the international working conference on advanced visual interfaces, pp 74–77

  • Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(8):2493–2537

    MATH  Google Scholar 

  • Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: Proceedings of ICLR, pp 1–14

  • Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302

    Article  Google Scholar 

  • Din S, Paul A, Ahmad A, Gupta B, Rho S (2018) Service orchestration of optimizing continuous features in industrial surveillance using big data based fog-enabled internet of things. IEEE Access 6:21582–21591

    Article  Google Scholar 

  • Florescu C, Caragea C (2017) Positionrank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of ACL, pp 1105–1115

  • Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG (1999) Domain-specific keyphrase extraction. In: Proceedings of EMNLP, pp 668–673

  • Gollapalli SD, Caragea C (2014) Extracting keyphrases from research papers using citation networks. In: Proceedings of AAAI, pp 1629–1635

  • Gollapalli SD, Li X, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. In: Proceedings of AAAI, pp 3180–3187

  • Gupta BB (2018) Computer and cyber security: principles, algorithm, applications, and perspectives. CRC Press, Boca Raton

    Google Scholar 

  • Hasan KS, Ng V (2010) Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In: Proceedings of COLING: Posters, pp 365–373

  • Hasan KS, Ng V (2014) Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of ACL, pp 1262–1273

  • Jones S, Staveley MS (1999) Phrasier: a system for interactive document retrieval using keyphrases. In: Proceedings of SIGIR, pp 160–167

  • Krapivin M, Autayeu A, Marchese M, Blanzieri E, Segata N (2010) Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing. In: Proceedings of ICADL, pp 102–111

  • Levy O, Goldberg Y (2014) Dependency-based word embeddings. Proc ACL 2:302–308

    Google Scholar 

  • Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of EMNLP, pp 366–376

  • Liu Y, Liu Z, Chua TS, Sun M (2015) Topical word embeddings. In: Proceedings of AAAI, pp 2418–2424

  • Lopez P, Romary L (2010) Humb: automatic key term extraction from scientific articles in GROBID. In: Proceedings of workshop on semantic evaluation, pp 248–251

  • Luo J, Meng B, Quan C, Tu X (2015) Exploiting salient semantic analysis for information retrieval. Enterp Inf Syst 10(9):959–969

    Article  Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of EMNLP, pp 404–411

  • Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Proceedings of ICLR workshop

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS, pp 3111–3119

  • Nedjah N, Wyant RS, Mourelle L, Gupta B (2017) Efficient yet robust biometric iris matching on smart cards for data high security and privacy. Fut Gener Comput Syst 76:18–32

    Article  Google Scholar 

  • Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab

  • Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of EMNLP, pp 1532–1543

  • Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of NAACL, pp 2227–2237

  • Plageras AP, Psannis KE, Stergiou C, Wang H, Gupta BB (2018) Efficient iot-based sensor big data collection-processing and analysis in smart buildings. Fut Gener Comput Syst 82:349–357

    Article  Google Scholar 

  • Porter M (2006) An algorithm for suffix stripping. Program Electron Libr Inf Syst 40(3):211–218

    Google Scholar 

  • Qazvinian V, Radev DR, Özgür A (2010) Citation summarization through keyphrase extraction. In: Proceedings of COLING, pp 895–903

  • Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  Google Scholar 

  • Shi B, Lam W, Jameel S, Schockaert S, Lai KP (2017) Jointly learning word embeddings and latent topics. In: Proceedings of SIGIR, pp 375–384

  • Shtok A, Kurland O, Carmel D (2010) Using statistical decision theory and relevance models for query-performance prediction. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, Geneva, 19–23 July 2010. ACM, New York, pp 259–266. https://doi.org/10.1145/1835449.1835494

  • Sterckx L, Demeester T, Deleu J, Develder C (2015) Topical word importance for fast keyphrase extraction. In: Proceedings of WWW, pp 121–122

  • Sterckx L, Caragea C, Demeester T, Develder C (2016) Supervised keyphrase extraction as positive unlabeled learning. In: Proceedings of EMNLP, pp 1924–1929

  • Tang J, Qu M, Mei Q (2015a) Pte: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of SIGKDD, pp 1165–1174

  • Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015b) Line: large-scale information network embedding. In: Proceedings of WWW, pp 1067–1077

  • Tang Y, Huang W, Liu Q, Tung AK, Wang X, Yang J, Zhang B (2017) Qalink: enriching text documents with relevant Q&A site contents. In: Proceedings of CIKM, pp 1359–1368

  • Teneva N, Cheng W (2017) Salience rank: efficient keyphrase extraction with topic modeling. In: Proceedings of ACL, pp 530–535

  • Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr J 2(4):303–336

    Article  Google Scholar 

  • Wan X, Xiao J (2008) Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of AAAI, pp 855–860

  • Wang R, Liu W, McDonald C (2015) Corpus-independent generic keyphrase extraction using word embedding vectors. In: Proceedings of DL-WSDM, pp 39–46

  • Wang Y, Jin Y, Zhu X, Goutte C (2016) Extracting discriminative keyphrases with learned semantic hierarchies. In: Proceedings of COLING, pp 932–942

  • Wieting J, Bansal M, Gimpel K, Livescu K (2016) Charagram: embedding words and sentences via character \(n\)-grams. In: Proceedings of EMNLP, pp 1504–1515

  • Yang J-M, Cai R, Wang Y, Zhu J, Zhang L, Ma W-Y (2009) Incorporating site-level knowledge to extract structured data from web forums. In: Proceedings of the 18th international conference on world wide web, Madrid, 20–24 Apr 2009. ACM, New York, pp 181–190. https://doi.org/10.1145/1526709.1526735

  • Zhang W, Feng W, Wang J (2013) Integrating semantic relatedness and words’ intrinsic features for keyword extraction. In: Proceedings of IJCAI, pp 139–160

  • Zhang W, Ming Z, Zhang Y, Liu T, Chua TS (2015) Exploring key concept paraphrasing based on pivot language translation for question retrieval. In: Proceedings of AAAI, pp 410–416

  • Zhang Q, Wang Y, Gong Y, Huang X (2016) Keyphrase extraction using deep recurrent neural networks on Twitter. In: Proceedings of EMNLP, pp 836–844

  • Zhang Y, Chang Y, Liu X, Gollapalli SD, Li X, Xiao C (2017) Mike: keyphrase extraction by integrating multidimensional information. In: Proceedings of CIKM, pp 1349–1358

  • Zhang Z, Gao J, Ciravegna F (2018) Semre-rank: Improving automatic term extraction by incorporating semantic relatedness with personalised pagerank. ACM Trans Knowl Dis Data (TKDD) 12(5):57:1–57:41

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by Grants from the National Natural Science Foundation of China (Nos. U1333109, 61632011, 61573231, U1533104), Department of Industrial and Systems Engineering, Hong Kong Polytechnic University (Project code H-ZG3K) and Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province (No. CICIP2018004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuxiang Zhang.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by B. B. Gupta.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Liu, H., Wang, S. et al. Automatic keyphrase extraction using word embeddings. Soft Comput 24, 5593–5608 (2020). https://doi.org/10.1007/s00500-019-03963-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-03963-y

Keywords

  • Keyphrase extraction
  • Random-walk-based keyphrase extraction model
  • Word embedding
  • Phrase scoring model