Image Tagging by Joint Deep Visual-Semantic Propagation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10735)


Image tagging has attracted much research interest due to its wide applications. Many existing methods have gained impressive results, however, they have two main limitations: (1) only focus on tagging images, but ignore the tags’ influences on visual feature modeling. (2) model the tag correlation without considering visual contents of image. In this paper, we propose a joint visual-semantic propagation model (JVSP) to address these two issues. First, we leverage a joint visual-semantic modeling to harvest integrated features which can accurately reflect the relationship between tags and image regions. Second, we introduce a visual-guided LSTM to capture the co-occurrence relation of the tags. Third, we also design a diversity loss to enforce that our model learns to focus on different regions. Experimental results on three challenging datasets demonstrate that our proposed method leads to significant performance gains over existing methods.


Image tagging CNN-LSTM Visual-semantic 


  1. 1.
    Sun, F., Tang, J., Li, H., Qi, G.J., Huang, T.S.: Multi-label image categorization with sparse factor representation. IEEE TIP 23(3), 1028–1037 (2014)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Liu, D., Yan, S., Rui, Y., Zhang, H.J.: Unified tag analysis with multi-edge graph. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 25–34 (2010)Google Scholar
  3. 3.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  4. 4.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009, pp. 248–255 (2009)Google Scholar
  5. 5.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2015)Google Scholar
  6. 6.
    Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, p. 48 (2009)Google Scholar
  7. 7.
    Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894 (2013)
  8. 8.
    Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE CVPR, pp. 2285–2294 (2016)Google Scholar
  9. 9.
    Jin, J., Nakayama, H.: Annotation order matters: recurrent image annotator for arbitrary length image tagging. arXiv preprint arXiv:1604.05225 (2016)
  10. 10.
    Murthy, V.N., Maji, S., Manmatha, R.: Automatic image annotation using deep learning representations. In: Proceedings of the 5th ACM on ICMR, pp. 603–606 (2015)Google Scholar
  11. 11.
    Wang, H., Huang, H., Ding, C.: Image annotation using multi-label correlated green’s function. In: IEEE ICCV (2009)Google Scholar
  12. 12.
    Verma, Y., Jawahar, C.V.: Image annotation using metric learning in semantic neighbourhoods. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 836–849. Springer, Heidelberg (2012). Scholar
  13. 13.
    Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: TagProp: discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV, pp. 309–316 (2009)Google Scholar
  14. 14.
    Cao, X., Zhang, H., Guo, X., Liu, S., Meng, D.: SLED: semantic label embedding dictionary representation for multilabel image annotation. IEEE TIP 24(9), 2746–2759 (2015)MathSciNetGoogle Scholar
  15. 15.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  16. 16.
    Von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 319–326. ACM (2004)Google Scholar
  17. 17.
    Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: ICCV, pp. 2407–2415 (2015)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of Hong KongHong KongChina
  2. 2.University of Chinese Academy of SciencesBeijingChina
  3. 3.National University of Defense TechnologyChangshaChina

Personalised recommendations