Skip to main content

A Modality Converting Approach for Image Annotation to Overcome the Inconsistent Labels in Training Data

  • 2029 Accesses

Part of the Lecture Notes in Computer Science book series (LNIP,volume 12662)


The automatic image annotation (AIA) task, in which a system specifies descriptive keywords for an input image, has been a shared task studied for long time, and still important because the annotation keywords enables users efficient access of ever-growing image data. However, the current performance of the AIA systems remains at low levels. One of the difficulties of the AIA comes from inconsistency of annotation keywords in the training data, which is naturally occurred in manual annotations, for many supervised methods. For example, annotation keywords for images of people may be “tourist” or “woman” depending on scenes of the images. This inconsistency makes it difficult to annotate images, which possibly have such similar keywords. For that difficulty, we propose a modality converting method that transforms an input image into an encyclopedic text of keywords assigned to the image. With the modality converting, similar keywords can share their features derived from texts with each other. In the proposed method, we pair images with Wikipedia articles, which have annotation keywords as their titles. We train a modality convertor from images to Wikipedia texts using a neural network with the paired data. Then, the method classifies the converted text into annotation keywords similar to the text classification. Experimental results show relatively high performance of our method based on the converted text compared with existing methods.


  • Automatic image annotation
  • Modality conversion
  • Image to text
  • Text classification
  • Neural network
  • Wikipedia

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.


  1. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’09), pp. 248–255 (2009)

    Google Scholar 

  2. Gilbert, A., et al.: Overview of the imageCLEF 2016 scalable web image annotation task. In: Working Notes of CLEF 2016 – Conference and Labs of Evaluation Forum, pp. 254–278 (2016)

    Google Scholar 

  3. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press Cambridge, MA, USA (2016)

    MATH  Google Scholar 

  4. Grubinger, M., Clough, P., Müller, H., Deselaers, T.: The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In: Proceedings of the International Workshop OntoImage 2006 Language Resources for Content-Based Image Retrieval in conjunction with the fifth edition of the International Conference on Language Ressources and Evaluation (LREC 2006), pp. 13–23 (2006)

    Google Scholar 

  5. He., K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pp. 770–778 (2016)

    Google Scholar 

  6. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), pp. 119–126 (2003)

    Google Scholar 

  7. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI-15), pp. 2267–2273 (2015)

    Google Scholar 

  8. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS 2004), pp. 553–560 (2004)

    Google Scholar 

  9. Li, Z., Lin, L., Zhang, C., Ma, H., Zhao, W.: Collaborating CNN and SVM for automatic image annotation. In: Proceedings of the 2019 ACM International Conference on Multimedia Retrieval (ICMR 2019), pp. 63–67 (2019)

    Google Scholar 

  10. Makadia, A., Pavlovic, V., Kumar, S.: Baselines for image annotation. Int. J. Comput. Vis. 90(1), 88–105 (2010)

    CrossRef  Google Scholar 

  11. Ng, J.Y.-H., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 53–61 (2015)

    Google Scholar 

  12. Suzuki, T., Ikeda, D., Galuščáková, P., Oard, D.: Towards automatic cataloging of image and textual collections with Wikipedia. In: Proceedings of the 21st International Conference on Asia-Pacific Digital Libraries (ICADL 2019), pp. 167–180 (2019)

    Google Scholar 

  13. Villegas, M., Paredes, R.: Overview of the ImageCLEF 2014 scalable concept image annotation task. In: Proceedings of the fifth Conference of the CLEF Initiative, pp. 308–328 (2014)

    Google Scholar 

  14. Zhang, H., Xiao, L., Chen, W., Wang, Y., Jin, Y.: Multi-task label embedding for text classification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 4545–4553 (2018)

    Google Scholar 

  15. Zhang, S., Huang, J., Huang, Y., Yu, Y., Li, H., Metaxas, D.N.. Automatic image annotation using group sparsity. In: Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), pp. 3312–3319 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Tokinori Suzuki .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Suzuki, T., Ikeda, D. (2021). A Modality Converting Approach for Image Annotation to Overcome the Inconsistent Labels in Training Data. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12662. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-68789-2

  • Online ISBN: 978-3-030-68790-8

  • eBook Packages: Computer ScienceComputer Science (R0)