Advertisement

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

  • Jingjing Guo
  • Jing YuEmail author
  • Yuhang Lu
  • Yue Hu
  • Yanbing Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11537)

Abstract

Cross-modal information retrieval (CMIR) enables users to search for semantically relevant data of various modalities from a given query of one modality. The predominant challenge is to alleviate the “heterogeneous gap” between different modalities. For text-image retrieval, the typical solution is to project text features and image features into a common semantic space and measure the cross-modal similarity. However, semantically relevant data from different modalities usually contains imbalanced information. Aligning all the modalities in the same space will weaken modal-specific semantics and introduce unexpected noise. In this paper, we propose a novel CMIR framework based on multi-modal feature fusion. In this framework, the cross-modal similarity is measured by directly analyzing the fine-grained correlations between the text features and image features without common semantic space learning. Specifically, we preliminarily construct a cross-modal feature matrix to fuse the original visual and textural features. Then the 2D-convolutional networks are proposed to reason about inner-group relationships among features across modalities, resulting in fine-grained text-image representations. The cross-modal similarity is measured by a multi-layer perception based on the fused feature representations. We conduct extensive experiments on two representative CMIR datasets, i.e. English Wikipedia and TVGraz. Experimental results indicate that our model outperforms state-of-the-art methods significantly. Meanwhile, the proposed cross-modal feature fusion approach is more effective in the CMIR tasks compared with other feature fusion approaches.

Keywords

2D-convolutional network Inner-group relationship Feature fusion Cross-modal correlation Cross-modal information retrieval 

References

  1. 1.
    Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., Torralba, A.: Learning aligned cross-modal representations from weakly aligned data. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2940–2949 (2016)Google Scholar
  2. 2.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805 (2018)
  3. 3.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 457–468 (2016)Google Scholar
  4. 4.
    He, Y., Xiang, S., Kang, C., Wang, J., Pan, C.: Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans. Multimedia (TMM) 18(7), 1363–1377 (2016)CrossRefGoogle Scholar
  5. 5.
    Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2310–2318 (2017)Google Scholar
  6. 6.
    Kang, C., Xiang, S., Liao, S., Xu, C., Pan, C.: Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia (TMM) 17(3), 276–288 (2017)Google Scholar
  7. 7.
    Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  8. 8.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  9. 9.
    Kumar, B.G.V., Carneiro, G., Reid, I.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimizing global loss functions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5385–5394 (2016)Google Scholar
  10. 10.
    Lu, Y., Yu, J., Liu, Y., Tan, J., Guo, L., Zhang, W.: Fine-grained correlation learning with stacked co-attention networks for cross-modal information retrieval. In: Liu, W., Giunchiglia, F., Yang, B. (eds.) KSEM 2018. LNCS (LNAI), vol. 11061, pp. 213–225. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-99365-2_19CrossRefGoogle Scholar
  11. 11.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: International Conference on Learning Representations (ICLR), pp. 1–12 (2013)Google Scholar
  12. 12.
    Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1045–1048 (2010)Google Scholar
  13. 13.
    Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 36(3), 521–535 (2014)CrossRefGoogle Scholar
  14. 14.
    Qin, Z., Yu, J., Cong, Y., Wan, T.: Topic correlation model for cross-modal multimedia information retrieval. Pattern Anal. Appl. (PAA) 19(4), 1007–1022 (2016)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Ranjan, V., Rasiwasia, N., Jawahar, C.V.: Multi-label cross-modal retrieval. In: IEEE International Conference on Computer Vision (ICCV), pp. 4094–4102 (2015)Google Scholar
  16. 16.
    Rasiwasia, N., Mahajan, D., Mahadevan, V., Aggarwal, G.: Cluster canonical correlation analysis. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 823–831 (2014)Google Scholar
  17. 17.
    Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: ACM International Conference on Multimedia (ACM MM), pp. 251–260 (2010)Google Scholar
  18. 18.
    Sharma, A., Kumar, A., Daume, H., Jacobs, D.W.: Generalized multiview analysis: a discriminative latent space. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2160–2167 (2012)Google Scholar
  19. 19.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR) (2015)Google Scholar
  20. 20.
    Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 38(10), 2010–2023 (2016)CrossRefGoogle Scholar
  21. 21.
    Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: IEEE International Conference on Computer Vision (ICCV), pp. 2088–2095 (2013)Google Scholar
  22. 22.
    Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 41(2), 394–407 (2018)CrossRefGoogle Scholar
  23. 23.
    Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3441–3450 (2015)Google Scholar
  24. 24.
    Yu, J., et al.: Modeling text with graph convolutional network for cross-modal information retrieval. In: Pacific-Rim Conference on Multimedia (PCM), pp. 862–871 (2005)Google Scholar
  25. 25.
    Zhang, L., Ma, B., He, J., Li, G., Huang, Q., Tian, Q.: Adaptively unified semi-supervised learning for cross-modal retrieval. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 3406–3412 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Jingjing Guo
    • 1
    • 2
  • Jing Yu
    • 1
    • 2
    Email author
  • Yuhang Lu
    • 1
    • 2
  • Yue Hu
    • 1
  • Yanbing Liu
    • 1
  1. 1.Institute of Information EngineeringChinese Academy of SciencesBeijingChina
  2. 2.School of Cyber SecurityUniversity of Chinese Academy of SciencesBeijingChina

Personalised recommendations