Skip to main content

Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools

Comparison on the ArXiv Dataset

  • Conference paper
  • First Online:
Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops (TPDL 2013)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 416))

Included in the following conference series:

Abstract

In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  2. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03, p. 216. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  3. Spanakis, G., Siolas, G., Stafylopatis, A.: Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. Comput. J. 55(3), 299 (2012)

    Article  Google Scholar 

  4. Spanakis, G., Siolas, G., Stafylopatis, A.: DoSO: a document self-organizer. J. Intell. Inf. Syst. 39(3), 577 (2012)

    Article  Google Scholar 

  5. Nomoto, T.: WikiLabel: an encyclopedic approach to labeling documents en masse. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, p. 2341. ACM, New York (2011)

    Google Scholar 

  6. Nomoto, T., Kando, N.: Conceptualizing documents with Wikipedia. In: Proceedings of the Fifth Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR ’12, p. 11. ACM, New York (2012)

    Google Scholar 

  7. Wang, P., Hu, J., Zeng, H.J., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowl. Inf. Syst. 19(3), 265 (2009)

    Article  Google Scholar 

  8. Joorabchi, A., Mahdi, A.E.: Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms. J. Inf. Sci. 39(3), 410 (2013)

    Article  Google Scholar 

  9. arXiv preprint server. http://arxiv.org

  10. Apache OpenNLP. http://opennlp.apache.org

  11. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic Keyword Extraction from Individual Documents, p. 1. Wiley, New York (2010)

    Google Scholar 

  12. Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(01), 9 (1995)

    Article  Google Scholar 

  13. Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K.: Data mining for improving textbooks. SIGKDD Explor. Newsl. 13(2), 7 (2012)

    Article  Google Scholar 

  14. Porter, M.: An algorithm for suffix stripping. Program: Electron. Libr. Inf. Syst. 14(3), 130 (1980)

    Article  Google Scholar 

  15. Zhang, Z.K., Lü, L., Liu, J.G., Zhou, T.: Empirical analysis on a keyword-based semantic system. Eur. Phys. J. B 66(4), 557 (2008)

    Article  MATH  Google Scholar 

  16. Montemurro, M.A.: Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A 300(3–4), 567 (2001)

    Article  MATH  Google Scholar 

  17. Laherrère, J., Sornette, D.: Stretched exponential distributions in nature and economy: “fat tails” with characteristic scales. Eur. Phys. J. B 2(4), 525 (1998)

    Article  Google Scholar 

Download references

Acknowledgement

This research was carried out with the support of the “HPC Infrastructure for Grand Challenges of Science and Engineering (POWIEW)” Project, co-financed by the European Regional Development Fund under the Innovative Economy Operational Programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michał Łopuszyński .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Łopuszyński, M., Bolikowski, Ł. (2014). Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. https://doi.org/10.1007/978-3-319-08425-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08425-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08424-4

  • Online ISBN: 978-3-319-08425-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics