Abstract
In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03, p. 216. Association for Computational Linguistics, Stroudsburg (2003)
Spanakis, G., Siolas, G., Stafylopatis, A.: Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. Comput. J. 55(3), 299 (2012)
Spanakis, G., Siolas, G., Stafylopatis, A.: DoSO: a document self-organizer. J. Intell. Inf. Syst. 39(3), 577 (2012)
Nomoto, T.: WikiLabel: an encyclopedic approach to labeling documents en masse. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, p. 2341. ACM, New York (2011)
Nomoto, T., Kando, N.: Conceptualizing documents with Wikipedia. In: Proceedings of the Fifth Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR ’12, p. 11. ACM, New York (2012)
Wang, P., Hu, J., Zeng, H.J., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowl. Inf. Syst. 19(3), 265 (2009)
Joorabchi, A., Mahdi, A.E.: Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms. J. Inf. Sci. 39(3), 410 (2013)
arXiv preprint server. http://arxiv.org
Apache OpenNLP. http://opennlp.apache.org
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic Keyword Extraction from Individual Documents, p. 1. Wiley, New York (2010)
Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(01), 9 (1995)
Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K.: Data mining for improving textbooks. SIGKDD Explor. Newsl. 13(2), 7 (2012)
Porter, M.: An algorithm for suffix stripping. Program: Electron. Libr. Inf. Syst. 14(3), 130 (1980)
Zhang, Z.K., Lü, L., Liu, J.G., Zhou, T.: Empirical analysis on a keyword-based semantic system. Eur. Phys. J. B 66(4), 557 (2008)
Montemurro, M.A.: Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A 300(3–4), 567 (2001)
Laherrère, J., Sornette, D.: Stretched exponential distributions in nature and economy: “fat tails” with characteristic scales. Eur. Phys. J. B 2(4), 525 (1998)
Acknowledgement
This research was carried out with the support of the “HPC Infrastructure for Grand Challenges of Science and Engineering (POWIEW)” Project, co-financed by the European Regional Development Fund under the Innovative Economy Operational Programme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Łopuszyński, M., Bolikowski, Ł. (2014). Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. https://doi.org/10.1007/978-3-319-08425-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-08425-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08424-4
Online ISBN: 978-3-319-08425-1
eBook Packages: Computer ScienceComputer Science (R0)