Text-Based Annotation of Scientific Images Using Wikimedia Categories

  • Frieda Josi
  • Christian WartenaEmail author
  • Jean Charbonnier
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 903)


The reuse of scientific raw data is a key demand of Open Science. In the project NOA we foster reuse of scientific images by collecting and uploading them to Wikimedia Commons. In this paper we present a text-based annotation method that proposes Wikipedia categories for open access images. The assigned categories can be used for image retrieval or to upload images to Wikimedia Commons. The annotation basically consists of two phases: extracting salient keywords and mapping these keywords to categories. The results are evaluated on a small record of open access images that were manually annotated.


Scientific image search Text annotation Wikipedia categories 



The presented work was developed within the NOA Project - Automatic Harvesting, Indexing and Provision of Open Access Figures from the Fields of Engineering and Technology Using the Infrastructure of Wikimedia Commons and Wikidata - funded by the DFG under grant number 315976924. NOA is a cooperative project of the Hochschule Hannover and the Technische Informationsbibliothek Hannover. We would like to thank the NOA project team.


  1. 1.
    Charbonnier, J., Sohmen, L., Rothman, J., Rohden, B., Wartena, C.: NOA: a search engine for reusable scientific images beyond the life sciences. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 797–800. Springer, Cham (2018). Scholar
  2. 2.
    Mihalcea, R., Csomai, A.: Wikify linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 233–242. ACM, New York (2007).
  3. 3.
    Medelyan, O., Witten, I.H., Milne, D.N.: Topic indexing with Wikipedia. AAAI Technical report WS-08-15, pp. 19–24 (2008).
  4. 4.
    Wartena, C., Brussee, R.: Instanced-based mapping between thesauri and folksonomies. In: Sheth, A., et al. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 356–370. Springer, Heidelberg (2008). Scholar
  5. 5.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 1999, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999).
  6. 6.
    Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000). Scholar
  7. 7.
    Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of 9th Conference on Empirical Methods in Natural Language Processing (EMNLP 2004) (2004).
  8. 8.
    Leong, C.W., Mihalcea, R., Hassan, S.: Text mining for automatic image tagging. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING 2010, pp. 647–655. Association for Computational Linguistics, Stroudsburg (2010).
  9. 9.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM. 18(11), 613–620 (1975). Scholar
  10. 10.
    Wartena, C., Brussee, R., Slakhorst, W.: Keyword extraction using word co-occurrence. In: TIR 2010–7th International Workshop on Text-Based Information Retrieval, in Conjunction with DEXA 2010, pp. 54–58, October 2010Google Scholar
  11. 11.
    Wartena, C., Sommer, M.: Automatic classification of scientific records using the German Subject Heading Authority File (SWD), October 2012.
  12. 12.
    Voss, J., et al.: Normdaten in Wikidata, May 2014.
  13. 13.
    Wikimedia Foundation: Wikipedia:Categorization, page Version ID: 821464874, January 2018.
  14. 14.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python, 1st edn. O’Reilly and Associates, Beijing (2009)zbMATHGoogle Scholar
  15. 15.
    English Penn Treebank tagset with modifications—Sketch Engine.
  16. 16.
    Charbonnier, J., Wartena, C.: Using word embeddings for unsupervised acronym disambiguation. In: Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe (2018, to appear)Google Scholar
  17. 17.
    Gazendam, L., Wartena, C., Malais, V., Schreiber, G., de Jong, A., Brugman, H.: Automatic annotation suggestions for audiovisual archives: evaluation aspects. Interdiscipl. Sci. Rev. 34(2–3), 172–188 (2009). Scholar
  18. 18.
    Iivonen, M., Consistency in the selection of search concepts and search terms. Inf. Process. Manag. 31(2), 173–190 (1995).
  19. 19.
    Schlötterer, J., Seifert, C., Granitzer, M.: Supporting web surfers in finding related material in digital library repositories. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 434–437. Springer, Cham (2016). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of Applied Sciences and Arts HanoverHanoverGermany

Personalised recommendations