Two-stage deep learning for supervised cross-modal retrieval

  • Jie Shao
  • Zhicheng ZhaoEmail author
  • Fei Su


This paper deals with the problem of modeling internet images and associated texts for cross-modal retrieval such as text-to-image retrieval and image-to-text retrieval. Recently, supervised cross-modal retrieval has attracted increasing attention. Inspired by a typical two-stage method, i.e., semantic correlation matching(SCM), we propose a novel two-stage deep learning method for supervised cross-modal retrieval. Limited by the fact that traditional canonical correlation analysis (CCA) is a 2-view method, the supervised semantic information is only considered in the second stage of SCM. To maximize the value of semantics, we expand CCA from 2-view to 3-view and conduct supervised learning in both stages. In the first learning stage, we embed 3-view CCA into a deep architecture to learn non-linear correlation between image, text and semantics. To overcome over-fitting, we add the reconstruct loss of each view into the loss function, which includes the correlation loss of every two views and regularization of parameters. In the second stage, we build a novel fully-convolutional network (FCN), which is trained by joint supervision of contrastive loss and center loss to learn better features. The proposed method is evaluated on two publicly available data sets, and the experimental results show that our method is competitive with state-of-the-art methods.


Two-stage 3-view Reconstruct loss Center loss Contrastive loss 



This work is supported by Chinese National Natural Science Foundation(61532018, 61471049), and Key Laboratory of Forensic Marks, Ministry of Public Security of China.


  1. 1.
    Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 30th international conference on machine learning, pp 1247–1255Google Scholar
  2. 2.
    Cai J, Tang Y, Wang J (2016) Kernel canonical correlation analysis via gradient descent. Neurocomputing 182:322–331CrossRefGoogle Scholar
  3. 3.
    Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval. ACM, p 48Google Scholar
  4. 4.
    Costa PJ, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R, Vasconcelos N (2013) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–35CrossRefGoogle Scholar
  5. 5.
    Costa Pereira J, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535CrossRefGoogle Scholar
  6. 6.
    Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM international conference on multimedia. ACM, pp 7–16Google Scholar
  7. 7.
    Feng F, Li R, Wang X (2015) Deep correspondence restricted boltzmann machine for cross-modal retrieval. Neurocomputing 154:50–60CrossRefGoogle Scholar
  8. 8.
    Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129Google Scholar
  9. 9.
    Gao Z, Zhang H, Xu G, Xue Y, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97CrossRefGoogle Scholar
  10. 10.
    Gao Z, Wang D, He X, Zhang H (2018) Group-pair convolutional neural networks for multi-view based 3d object retrievalGoogle Scholar
  11. 11.
    Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233CrossRefGoogle Scholar
  12. 12.
    Grangier D, Bengio S (2008) A discriminative kernel-based approach to rank images from text queries. IEEE Trans Pattern Anal Mach Intell 30(8):1371–1384CrossRefGoogle Scholar
  13. 13.
    Hadsell R, Chopra S, Lecun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition, pp 1735–1742Google Scholar
  14. 14.
    Hardoon DR, Szedmak S, Shawetaylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639CrossRefGoogle Scholar
  15. 15.
    Hinton GE, Salakhutdinov R (2009) Replicated softmax: an undirected topic model. In: Advances in neural information processing systems, pp 1607–1614Google Scholar
  16. 16.
    Huang X, Peng Y (2017) Cross-modal deep metric learning with multi-task regularization. arXiv:
  17. 17.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv:
  18. 18.
    Kang C, Liao S, He Y, Wang J, Xiang S, Pan C (2014) Cross-modal similarity learning: a low rank bilinear formulation. arXiv:
  19. 19.
    Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the eleventh ACM international conference on multimedia. ACM, pp 604–611Google Scholar
  20. 20.
    Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 689–696Google Scholar
  21. 21.
    Nie W, Liu A, Su Y (2016) Cross-domain semantic transfer from large-scale social media. Multimed Syst 22(1):75–85CrossRefGoogle Scholar
  22. 22.
    Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International joint conference on artificial intelligence (IJCAI), pp 3846–3853Google Scholar
  23. 23.
    Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on multimedia. ACM, pp 251–260Google Scholar
  24. 24.
    Rosipal R, Krämer N (2006) Overview and recent advances in partial least squares. In: Subspace, latent structure and feature selection. Springer, pp 34–51Google Scholar
  25. 25.
    Shao J, Zhao Z, Su F, Yue T (2015) 3view deep canonical correlation analysis for cross-modal retrieval. In: Visual communications and image processing (VCIP), 2015. IEEE, pp 1–4Google Scholar
  26. 26.
    Shao J, Wang L, Zhao Z, Cai A et al (2016) Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval. Neurocomputing 214:618–628CrossRefGoogle Scholar
  27. 27.
    Sharma A, Kumar A, Daume H III, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2160–2167Google Scholar
  28. 28.
    Smolensky P (1986) Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. chapter information processing in dynamical systems: foundations of harmony theory. MIT Press, Cambridge. 15, 18Google Scholar
  29. 29.
    Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal transfer. In: Advances in neural information processing systems, pp 935–943Google Scholar
  30. 30.
    Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 785–796Google Scholar
  31. 31.
    Song J, Gao L, Nie F, Shen HT, Yan Y, Sebe N (2016) Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans Image Process 25(11):4999–5011MathSciNetCrossRefGoogle Scholar
  32. 32.
    Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst PP(99):1–12. Google Scholar
  33. 33.
    Song J, Zhang H, Li X, Gao L, Wang M, Hong R (2018) Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans Image Process 27 (7):3210–3221MathSciNetCrossRefGoogle Scholar
  34. 34.
    Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In: International conference on machine learning workshopGoogle Scholar
  35. 35.
    Sun S (2013) A survey of multi-view machine learning. Neural Comput & Appl 23(7-8):2031–2038CrossRefGoogle Scholar
  36. 36.
    Sun S, Hardoon DR (2010) Active learning with extremely sparse labeled examples. Neurocomputing 73(16):2980–2988CrossRefGoogle Scholar
  37. 37.
    Sun Y, Chen Y, Wang X, Tang X (2014) Deep learning face representation by joint identification-verification. In: Advances in neural information processing systems, pp 1988–1996Google Scholar
  38. 38.
    Tenenbaum JB, Freeman WT (2000) Separating style and content with bilinear models. Neural computation 12(6):1247–1283CrossRefGoogle Scholar
  39. 39.
    Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514CrossRefGoogle Scholar
  40. 40.
    Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634–644CrossRefGoogle Scholar
  41. 41.
    Wei Y, Zhao Y, Zhu Z, Wei S, Xiao Y, Feng J, Yan S (2016) Modality-dependent cross-media retrieval. ACM Trans Intell Syst Technol (TIST) 7 (4):57Google Scholar
  42. 42.
    Welling M, Rosen-Zvi M, Hinton GE (2004) Exponential family harmoniums with an application to information retrieval. In: Advances in neural information processing systems, pp 1481–1488Google Scholar
  43. 43.
    Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision. Springer, pp 499–515Google Scholar
  44. 44.
    Weston J, Bengio S, Usunier N (2010) Large scale image annotation: learning to rank with joint word-image embeddings. Mach Learn 81(1):21–35MathSciNetCrossRefGoogle Scholar
  45. 45.
    Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 877–886Google Scholar
  46. 46.
    Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol 24 (6):965–978CrossRefGoogle Scholar
  47. 47.
    Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph pca hashing for similarity search. IEEE Trans Multimed 19(9):2033–2044CrossRefGoogle Scholar
  48. 48.
    Zu C, Zhang D (2016) Canonical sparse cross-view correlation analysis. Neurocomputing 191:263–272CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Information and Communication EngineeringBeijing University of Posts and TelecommunicationsBeijingChina
  2. 2.Beijing Key Laboratory of Network System and Network CultureBeijing University of Posts and TelecommunicationsBeijingChina

Personalised recommendations