Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval

  • Yuhang Lu
  • Jing YuEmail author
  • Yanbing Liu
  • Jianlong Tan
  • Li Guo
  • Weifeng Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11061)


Cross-modal retrieval provides a flexible way to find semantically relevant information across different modalities given a query of one modality. The main challenge is to measure the similarity between different modalities of data. Generally, different modalities contain unequal amount of information when describing the same semantics. For example, textual descriptions often contain more background information that cannot be conveyed by images and vice versa. Existing works mostly map the global data features from different modalities to a common semantic space to measure their similarity, which ignore their imbalanced and complementary relationships. In this paper, we propose stacked co-attention networks (SCANet) to progressively learn the mutually attended features of different modalities and leverage these fine-grained correlations to enhance cross-modal retrieval performance. SCANet adopts a dual-path end-to-end framework to jointly learn the multimodal representations, stacked co-attention, and similarity metric. Experiment results on three widely-used benchmark datasets verify that SCANet outperforms state-of-the-art methods, with 19% improvements on MAP in average for the best case.


Stacked co-attention network Graph convolution Fine-grained cross-modal correlation 



This work is supported by the National Key Research and Development Program (Grant No. 2017YFC0820700) and the Fundamental Theory and Cutting Edge Technology Research Program of Institute of Information Engineering, CAS (Grant No. Y7Z0351101)


  1. 1.
    Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134. ACM (2003)Google Scholar
  2. 2.
    Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NIPS, pp. 3837–3845 (2016)Google Scholar
  3. 3.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  4. 4.
    Kan, M., Shan, S., Zhang, H., Lao, S., Chen, X.: Multi-view discriminant analysis. TPAMI 38(1), 188–194 (2016)CrossRefGoogle Scholar
  5. 5.
    Kang, C., Xiang, S., Liao, S., Xu, C., Pan, C.: Learning consistent feature representation for cross-modal multimedia retrieval. TMM 17(3), 276–288 (2017)Google Scholar
  6. 6.
    Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
  7. 7.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
  8. 8.
    Kumar, B.G.V., Carneiro, G., Reid, I.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimizing global loss functions. In: CVPR, pp. 5385–5394 (2016)Google Scholar
  9. 9.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  10. 10.
    Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471 (2016)
  11. 11.
    Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: IJCAI, pp. 3846–3853 (2016)Google Scholar
  12. 12.
    Peng, Y., Qi, J., Yuan, Y.: Modality-specific cross-modal similarity measurement with recurrent attention network. arXiv preprint arXiv:1708.04776 (2017)
  13. 13.
    Qin, Z., Yu, J., Cong, Y., Wan, T.: Topic correlation model for cross-modal multimedia information retrieval. PAA 19(4), 1007–1022 (2016)MathSciNetGoogle Scholar
  14. 14.
    Ranjan, V., Rasiwasia, N., Jawahar, C.: Multi-label cross-modal retrieval. In: ICCV, pp. 4094–4102 (2015)Google Scholar
  15. 15.
    Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: ACM-MM, pp. 251–260 (2010)Google Scholar
  16. 16.
    Sharma, A., Kumar, A., Daume, H., Jacobs, D.W.: Generalized multiview analysis: a discriminative latent space. In: CVPR, pp. 2160–2167 (2012)Google Scholar
  17. 17.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014)Google Scholar
  18. 18.
    Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. TPAMI 38(10), 2010–2023 (2016)CrossRefGoogle Scholar
  19. 19.
    Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: ICCV, pp. 2088–2095 (2013)Google Scholar
  20. 20.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)Google Scholar
  21. 21.
    Yu, J., et al.: Modeling text with graph convolutional network for cross-modal information retrieval. arXiv preprint arXiv:1802.00985 (2018)
  22. 22.
    Zhang, L., Ma, B., He, J., Li, G., Huang, Q., Tian, Q.: Adaptively unified semi-supervised learning for cross-modal retrieval. In: IJCAI, pp. 3406–3412 (2017)Google Scholar
  23. 23.
    Zhang, X., et al.: HashGAN: attention-aware deep adversarial hashing for cross modal retrieval. arXiv preprint arXiv:1711.09347 (2017)
  24. 24.
    Zhen, Y., Yeung, D.Y.: Co-regularized hashing for multimodal data. In: NIPS, pp. 1376–1384 (2012)Google Scholar
  25. 25.
    Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.D.: Dual-path convolutional image-text embedding. arXiv preprint arXiv:1711.05535 (2017)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yuhang Lu
    • 1
    • 2
  • Jing Yu
    • 1
    Email author
  • Yanbing Liu
    • 1
  • Jianlong Tan
    • 1
  • Li Guo
    • 1
  • Weifeng Zhang
    • 3
    • 4
  1. 1.Institute of Information EngineeringChinese Academy of SciencesBeijingChina
  2. 2.School of Cyber SecurityUniversity of Chinese Academy of SciencesBeijingChina
  3. 3.School of Computer Science and TechnologyHangzhou Dianzi UniversityHangzhouChina
  4. 4.Zhejiang Future Technology InstituteJiaxingChina

Personalised recommendations