Abstract
The automatic image annotation (AIA) task, in which a system specifies descriptive keywords for an input image, has been a shared task studied for long time, and still important because the annotation keywords enables users efficient access of ever-growing image data. However, the current performance of the AIA systems remains at low levels. One of the difficulties of the AIA comes from inconsistency of annotation keywords in the training data, which is naturally occurred in manual annotations, for many supervised methods. For example, annotation keywords for images of people may be “tourist” or “woman” depending on scenes of the images. This inconsistency makes it difficult to annotate images, which possibly have such similar keywords. For that difficulty, we propose a modality converting method that transforms an input image into an encyclopedic text of keywords assigned to the image. With the modality converting, similar keywords can share their features derived from texts with each other. In the proposed method, we pair images with Wikipedia articles, which have annotation keywords as their titles. We train a modality convertor from images to Wikipedia texts using a neural network with the paired data. Then, the method classifies the converted text into annotation keywords similar to the text classification. Experimental results show relatively high performance of our method based on the converted text compared with existing methods.
Keywords
- Automatic image annotation
- Modality conversion
- Image to text
- Text classification
- Neural network
- Wikipedia
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’09), pp. 248–255 (2009)
Gilbert, A., et al.: Overview of the imageCLEF 2016 scalable web image annotation task. In: Working Notes of CLEF 2016 – Conference and Labs of Evaluation Forum, pp. 254–278 (2016)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press Cambridge, MA, USA (2016)
Grubinger, M., Clough, P., Müller, H., Deselaers, T.: The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In: Proceedings of the International Workshop OntoImage 2006 Language Resources for Content-Based Image Retrieval in conjunction with the fifth edition of the International Conference on Language Ressources and Evaluation (LREC 2006), pp. 13–23 (2006)
He., K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pp. 770–778 (2016)
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), pp. 119–126 (2003)
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI-15), pp. 2267–2273 (2015)
Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS 2004), pp. 553–560 (2004)
Li, Z., Lin, L., Zhang, C., Ma, H., Zhao, W.: Collaborating CNN and SVM for automatic image annotation. In: Proceedings of the 2019 ACM International Conference on Multimedia Retrieval (ICMR 2019), pp. 63–67 (2019)
Makadia, A., Pavlovic, V., Kumar, S.: Baselines for image annotation. Int. J. Comput. Vis. 90(1), 88–105 (2010)
Ng, J.Y.-H., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 53–61 (2015)
Suzuki, T., Ikeda, D., Galuščáková, P., Oard, D.: Towards automatic cataloging of image and textual collections with Wikipedia. In: Proceedings of the 21st International Conference on Asia-Pacific Digital Libraries (ICADL 2019), pp. 167–180 (2019)
Villegas, M., Paredes, R.: Overview of the ImageCLEF 2014 scalable concept image annotation task. In: Proceedings of the fifth Conference of the CLEF Initiative, pp. 308–328 (2014)
Zhang, H., Xiao, L., Chen, W., Wang, Y., Jin, Y.: Multi-task label embedding for text classification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 4545–4553 (2018)
Zhang, S., Huang, J., Huang, Y., Yu, Y., Li, H., Metaxas, D.N.. Automatic image annotation using group sparsity. In: Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), pp. 3312–3319 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Suzuki, T., Ikeda, D. (2021). A Modality Converting Approach for Image Annotation to Overcome the Inconsistent Labels in Training Data. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12662. Springer, Cham. https://doi.org/10.1007/978-3-030-68790-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-68790-8_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68789-2
Online ISBN: 978-3-030-68790-8
eBook Packages: Computer ScienceComputer Science (R0)
-
Published in cooperation with
http://www.iapr.org/