Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools

Comparison on the ArXiv Dataset
  • Michał Łopuszyński
  • Łukasz Bolikowski
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 416)


In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.).


Tagging document collections Natural language processing Wikipedia 



This research was carried out with the support of the “HPC Infrastructure for Grand Challenges of Science and Engineering (POWIEW)” Project, co-financed by the European Regional Development Fund under the Innovative Economy Operational Programme.


  1. 1.
    Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000) CrossRefGoogle Scholar
  2. 2.
    Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03, p. 216. Association for Computational Linguistics, Stroudsburg (2003)Google Scholar
  3. 3.
    Spanakis, G., Siolas, G., Stafylopatis, A.: Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. Comput. J. 55(3), 299 (2012)CrossRefGoogle Scholar
  4. 4.
    Spanakis, G., Siolas, G., Stafylopatis, A.: DoSO: a document self-organizer. J. Intell. Inf. Syst. 39(3), 577 (2012)CrossRefGoogle Scholar
  5. 5.
    Nomoto, T.: WikiLabel: an encyclopedic approach to labeling documents en masse. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, p. 2341. ACM, New York (2011)Google Scholar
  6. 6.
    Nomoto, T., Kando, N.: Conceptualizing documents with Wikipedia. In: Proceedings of the Fifth Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR ’12, p. 11. ACM, New York (2012)Google Scholar
  7. 7.
    Wang, P., Hu, J., Zeng, H.J., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowl. Inf. Syst. 19(3), 265 (2009)CrossRefGoogle Scholar
  8. 8.
    Joorabchi, A., Mahdi, A.E.: Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms. J. Inf. Sci. 39(3), 410 (2013)CrossRefGoogle Scholar
  9. 9.
    arXiv preprint server.
  10. 10.
    Apache OpenNLP.
  11. 11.
    Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic Keyword Extraction from Individual Documents, p. 1. Wiley, New York (2010)Google Scholar
  12. 12.
    Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(01), 9 (1995)CrossRefGoogle Scholar
  13. 13.
    Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K.: Data mining for improving textbooks. SIGKDD Explor. Newsl. 13(2), 7 (2012)CrossRefGoogle Scholar
  14. 14.
    Porter, M.: An algorithm for suffix stripping. Program: Electron. Libr. Inf. Syst. 14(3), 130 (1980)CrossRefGoogle Scholar
  15. 15.
    Zhang, Z.K., Lü, L., Liu, J.G., Zhou, T.: Empirical analysis on a keyword-based semantic system. Eur. Phys. J. B 66(4), 557 (2008)CrossRefzbMATHGoogle Scholar
  16. 16.
    Montemurro, M.A.: Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A 300(3–4), 567 (2001)CrossRefzbMATHGoogle Scholar
  17. 17.
    Laherrère, J., Sornette, D.: Stretched exponential distributions in nature and economy: “fat tails” with characteristic scales. Eur. Phys. J. B 2(4), 525 (1998)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of WarsawWarsawPoland

Personalised recommendations