Keyword Extraction from Parallel Abstracts of Scientific Publications

  • Slobodan Beliga
  • Olivera Kitanović
  • Ranka Stanković
  • Sanda Martinčić-Ipšić
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10546)

Abstract

In this paper, we study the keyword extraction from parallel abstracts of scientific publication in the Serbian and English languages. The keywords are extracted by a selectivity-based keyword extraction method. The method is based on the structural and statistical properties of text represented as a complex network. The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the controlled experimental environment and data. The achieved keyword extraction results measured with an F1 score are 49.57% for English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in the sense of its structure. Still, there are drawbacks – the method can extract only the words that appear in the text.

Keywords

Graph-based keyword extraction Bilingual keyword extraction SBKE method Parallel abstracts 

Notes

Acknowledgments

The authors would like to acknowledge networking support by the ICT COST Action IC1302 KEYSTONE – Semantic keyword-based search on structured data sources (www.keystone-cost.eu). The authors would also like to thank the University of Rijeka for the support under the LangNet project (13.13.2.2.07).

References

  1. 1.
    Beliga, S., Meštrović, A., Martinčić-Ipšić, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39(1), 1–20 (2015)Google Scholar
  2. 2.
    Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of Empirical Methods in Natural Language Processing - EMNLP 2004, pp. 404–411. ACL, Barcelona (2004)Google Scholar
  3. 3.
    Marujo, L., Viveiros, M., Neto, J.P.: Keyphrase cloud generation of broadcast news. In: Proceeding of 12th Annual Conference of the International Speech Communication Association, Interspeech (2011)Google Scholar
  4. 4.
    Medelyan, O.: Human-competitive automatic topic indexing. Ph.D. thesis. Department of Computer Science, University of Waikato, New Zealand (2009)Google Scholar
  5. 5.
    Medelyan, O., Witten, I.H.: Domain independent automatic keyphrase indexing with small training sets. J. Am. Soc. Inf. Sci. Technol. 59(7), 1026–1040 (2008)CrossRefGoogle Scholar
  6. 6.
    Paroubek, P., Zweigenbaum, P., Forest, D., Grouin, C.: Indexation libre et controlee d’articles scientifiques. Presentation et resultats du defi fouille de textes DEFT2012. In: Proceedings of the DEfi Fouille de Textes 2012 Workshop, pp. 1–13 (2012)Google Scholar
  7. 7.
    Kozłowski, M.: PKE: a novel Polish keywords extraction method. Pomiary Autom. Kontrola 60(5), 305–308 (2014)Google Scholar
  8. 8.
    Beliga, S., Meštrović, A., Martinčić-Ipšić, S.: Selectivity-based keyword extraction method. Int. J. Sem. Web Inf. Syst. (IJSWIS) 12(3), 1–26 (2016)CrossRefGoogle Scholar
  9. 9.
    Beliga, S., Meštrović, A., Martinčić-Ipšić, S.: Toward selectivity-based keyword extraction for croatian news. In: CEUR Proceedings of the Workshop on Surfacing the Deep and the Social Web (SDSW 2014), Riva del Garda, Trentino, Italy, vol. 1310, pp. 1–8 (2014)Google Scholar
  10. 10.
    Beliga, S., Martinčić-Ipšić, S.: Network-enabled keyword extraction for under-resourced languages. In: Calì, A., Gorgan, D., Ugarte, M. (eds.) KEYSTONE 2016. LNCS, vol. 10151, pp. 124–135. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-53640-8_11 CrossRefGoogle Scholar
  11. 11.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Inc., Sebastopol (2009)MATHGoogle Scholar
  12. 12.
    Balakrishnan, V., Ethel, L.-Y.: Stemming and lemmatization: a comparison of retrieval performances. Lect. Notes Softw. Eng. 2(3), 262–267 (2014)CrossRefGoogle Scholar
  13. 13.
    Ludwig, P., Thiel, M., Nürnberger, A.: Unsupervised extraction of conceptual keyphrases from abstracts. In: Calì, A., Gorgan, D., Ugarte, M. (eds.) KEYSTONE 2016. LNCS, vol. 10151, pp. 37–48. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-53640-8_4 CrossRefGoogle Scholar
  14. 14.
    Stanković, R., Krstev, C., Obradović, I., Lazić, B., Trtovac, A.: Rule-based automatic multi-word term extraction and lemmatization. In: Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia (2016). ISBN 978-2-9517408-9-1Google Scholar
  15. 15.
    Stanković, R., Krstev, C., Lazić, B., Vorkapić, D.: A bilingual digital library for academic and entrepreneurial knowledge management. In: Proceeding of 10th International Forum on Knowledge Asset Dynamics - IFKAD 2015: Culture, Innovation and Entrepreneurship: Connecting the Knowledge Dots, Bari, Italy, pp. 1764–1777 (2015)Google Scholar
  16. 16.
    Stanković, R., Krstev, C., Vitas, D., Vulović, N., Kitanović, O.: Keyword-based search on bilingual digital libraries. In: Calì, A., Gorgan, D., Ugarte, M. (eds.) KEYSTONE 2016. LNCS, vol. 10151, pp. 112–123. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-53640-8_10 CrossRefGoogle Scholar
  17. 17.
    Vitas, D., Popović, L., Krstev, C., Obradović, I., Pavlović-Lazetić, G., Stanojević, M.: The Serbian Language in the Digital Age. META-NET White Paper Series. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-30755-3. Rehm, G., Uszkoreit, H. (Series eds.)Google Scholar
  18. 18.
    Krstev, C., Vitas, D., Stanković, R.: A lexical approach to acronyms and their definitions. In: Mariani, Z.V.J. (ed.) Proceedings of the 7th Language & Technology Conference, pp. 219–223. Fundacja Uniwersytetu im. A. Mickiewicza, Poznan (2016)Google Scholar
  19. 19.
    Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 855–860 (2008)Google Scholar
  20. 20.
    Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of the HLT: The Annual Conference on Empirical Methods in NLP, pp. 257–266 (2009)Google Scholar
  21. 21.
    Joorabchi, A., Mahdi, A.E.: Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms. J. Inf. Sci. 39(3), 410–426 (2013)CrossRefGoogle Scholar
  22. 22.
    Grineva, M., Grinev, M., Lizorkin, D.: Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th International Conference on World Wide Web, pp. 661–670. ACM, New York (2009)Google Scholar
  23. 23.
    Lahiri, S., Choudhury, S.R., Caragea, C.: Keyword and keyphrase extraction using centrality measures on collocation networks (2014). http://arxiv.org/pdf/1401.6571.pdf
  24. 24.
    Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2004 Conference on Empirical Methods in NLP, pp. 1318–1327 (2009)Google Scholar
  25. 25.
    Lahiri, S., Mihalcea, R., Lai, P.-H.: Keyword extraction from emails. Nat. Lang. Eng. 23(2), 295–317 (2016)CrossRefGoogle Scholar
  26. 26.
    Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T.: SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: SemEval 2010 Proceedings of the 5th International Workshop on Semantic Evaluation, Los Angeles, California, pp. 21–26 (2010)Google Scholar
  27. 27.
    Yih, W.-T., Goodman, J., Carvalho, V.R.: Finding advertising keywords on web pages. In: Proceedings of the 15th International Conference on World Wide Web, pp. 213–222 (2010)Google Scholar
  28. 28.
    Lopez, P., Romary, L.: HUMB: automatic key term extraction from scientific articles in GROBID. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 248–251 (2010)Google Scholar
  29. 29.
    Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-77094-7_41 CrossRefGoogle Scholar
  30. 30.
    Utvic, M.: List of frequency corpus of contemporary Serbian language (in Serbian). In: Milanovic, A., Stanojcic, Ž., Popovic, Lj. (eds.) International Slavic Center, Faculty of Philology, vol. 43/3, pp. 241–262 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Slobodan Beliga
    • 1
  • Olivera Kitanović
    • 2
  • Ranka Stanković
    • 2
  • Sanda Martinčić-Ipšić
    • 1
  1. 1.Department of InformaticsUniversity of RijekaRijekaCroatia
  2. 2.Faculty of Mining and GeologyUniversity of BelgradeBelgradeSerbia

Personalised recommendations