Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools

Łopuszyński, Michał; Bolikowski, Łukasz

doi:10.1007/978-3-319-08425-1_3

Michał Łopuszyński⁷ &
Łukasz Bolikowski⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 416))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

915 Accesses
3 Citations
4 Altmetric

Abstract

In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)
Chapter Google Scholar
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03, p. 216. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Spanakis, G., Siolas, G., Stafylopatis, A.: Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. Comput. J. 55(3), 299 (2012)
Article Google Scholar
Spanakis, G., Siolas, G., Stafylopatis, A.: DoSO: a document self-organizer. J. Intell. Inf. Syst. 39(3), 577 (2012)
Article Google Scholar
Nomoto, T.: WikiLabel: an encyclopedic approach to labeling documents en masse. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, p. 2341. ACM, New York (2011)
Google Scholar
Nomoto, T., Kando, N.: Conceptualizing documents with Wikipedia. In: Proceedings of the Fifth Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR ’12, p. 11. ACM, New York (2012)
Google Scholar
Wang, P., Hu, J., Zeng, H.J., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowl. Inf. Syst. 19(3), 265 (2009)
Article Google Scholar
Joorabchi, A., Mahdi, A.E.: Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms. J. Inf. Sci. 39(3), 410 (2013)
Article Google Scholar
arXiv preprint server. http://arxiv.org
Apache OpenNLP. http://opennlp.apache.org
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic Keyword Extraction from Individual Documents, p. 1. Wiley, New York (2010)
Google Scholar
Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(01), 9 (1995)
Article Google Scholar
Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K.: Data mining for improving textbooks. SIGKDD Explor. Newsl. 13(2), 7 (2012)
Article Google Scholar
Porter, M.: An algorithm for suffix stripping. Program: Electron. Libr. Inf. Syst. 14(3), 130 (1980)
Article Google Scholar
Zhang, Z.K., Lü, L., Liu, J.G., Zhou, T.: Empirical analysis on a keyword-based semantic system. Eur. Phys. J. B 66(4), 557 (2008)
Article MATH Google Scholar
Montemurro, M.A.: Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A 300(3–4), 567 (2001)
Article MATH Google Scholar
Laherrère, J., Sornette, D.: Stretched exponential distributions in nature and economy: “fat tails” with characteristic scales. Eur. Phys. J. B 2(4), 525 (1998)
Article Google Scholar

Download references

Acknowledgement

This research was carried out with the support of the “HPC Infrastructure for Grand Challenges of Science and Engineering (POWIEW)” Project, co-financed by the European Regional Development Fund under the Innovative Economy Operational Programme.

Author information

Authors and Affiliations

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Pawińskiego 5a, 02-106, Warsaw, Poland
Michał Łopuszyński & Łukasz Bolikowski

Authors

Michał Łopuszyński
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Bolikowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michał Łopuszyński .

Editor information

Editors and Affiliations

Uniwersytet Warszawski, Warszawa, Poland
Łukasz Bolikowski
Consiglio Nazionale delle Ricerche Istituto di Scienza e Tecnologie dell’Informazione, Pisa, Pisa, Italy
Vittore Casarosa
University of Sheffield, Sheffield, United Kingdom
Paula Goodale
National Documentation Centre, Athens, Greece
Nikos Houssos
Consiglio Nazionale delle Ricerche Istituto di Scienza e Tecnologie dell’Informazione, Pisa, Italy
Paolo Manghi
Bielefeld University, Bielefeld, Germany
Jochen Schirrwagen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Łopuszyński, M., Bolikowski, Ł. (2014). Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. https://doi.org/10.1007/978-3-319-08425-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-08425-1_3
Published: 06 July 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08424-4
Online ISBN: 978-3-319-08425-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics