World Wide Web

, Volume 22, Issue 2, pp 657–672 | Cite as

Deep adversarial metric learning for cross-modal retrieval

  • Xing Xu
  • Li He
  • Huimin LuEmail author
  • Lianli Gao
  • Yanli Ji
Part of the following topical collections:
  1. Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications


Cross-modal retrieval has become a highlighted research topic, to provide flexible retrieval experience across multimedia data such as image, video, text and audio. The core of existing cross-modal retrieval approaches is to narrow down the gap between different modalities either by finding a maximally correlated embedding space. Recently, researchers leverage Deep Neural Network (DNN) to learn nonlinear transformations for each modality to obtain transformed features in a common subspace where cross-modal matching can be performed. However, the statistical characteristics of the original features for each modality are not explicitly preserved in the learned subspace. Inspired by recent advances in adversarial learning, we propose a novel Deep Adversarial Metric Learning approach, termed DAML for cross-modal retrieval. DAML nonlinearly maps labeled data pairs of different modalities into a shared latent feature subspace, under which the intra-class variation is minimized and the inter-class variation is maximized, and the difference of each data pair captured from two modalities of the same class is minimized, respectively. In addition to maximizing the correlations between modalities, we add an additional regularization by introducing adversarial learning. In particular, we introduce a modality classifier to predict the modality of a transformed feature, which ensures that the transformed features are also statistically indistinguishable. Experiments on three popular multimodal datasets show that DAML achieves superior performance compared to several state of the art cross-modal retrieval methods.


Cross-modal retrieval Adversarial learning Metric learning 



This work is partially supported by NSFC grant No. 61602089, No. 61673088, No. 61502080; the 111 Project No. B17008; the Fundamental Research Funds for Central Universities ZYGX2016KYQD114; the LEADER of MEXT-Japan (16809746); the Telecommunications Foundation; the REDAS and SCAT.


  1. 1.
    Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: ICML, pp. 1247–1255 (2013)Google Scholar
  2. 2.
    Chu, L., Zhang, Y., Li, G., Wang, S., Zhang, W., Huang, Q.: Effective multimodality fusion framework for cross-media topic detection. IEEE Trans. Circuits Syst. Video Technol. 26(3), 556–569 (2016)CrossRefGoogle Scholar
  3. 3.
    Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.-T.: Nus-wide: A real-world Web image database from national university of singapore. In: CIVR (2009)Google Scholar
  4. 4.
    Costa Pereira, J., Coviello, E., Doyle, G., Rasiwasia, N., Lanckriet, G., Levy, R., Vasconcelos, N.: On the role of correlation and abstraction in cross-modal multimedia retrieval. TPAMI 36(3), 521–535 (2014)CrossRefGoogle Scholar
  5. 5.
    Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: ACM MM, pp. 7–16. ACM (2014)Google Scholar
  6. 6.
    Ganin, Y., Lempitsky, V.S.: Unsupervised domain adaptation by backpropagation. In: ICML, pp. 1180–1189 (2015)Google Scholar
  7. 7.
    Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2), 210–233 (2014)CrossRefGoogle Scholar
  8. 8.
    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)Google Scholar
  9. 9.
    Gu, Q., Li, Z., Han, J.: Joint feature selection and subspace learning. In: IJCAI (2011)Google Scholar
  10. 10.
    Hardoon, D., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)CrossRefzbMATHGoogle Scholar
  11. 11.
    He, L., Xu, X., Lu, H., Yang, Y., Shen, H.T.: Unsupervised cross modal retrieval through adversarial learning. In: ICME, pp. 1–6 (2017)Google Scholar
  12. 12.
    Kang, C., Liao, S., He, Y., Wang, J., Niu, W., Xiang, S., Pan, C.: Cross-modal similarity learning: a low rank bilinear formulation. In: CKIM, pp. 1251–1260 (2015)Google Scholar
  13. 13.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 (2013)
  14. 14.
    Larsen, A.B.L., Sønderby, S.K., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300 (2015)
  15. 15.
    Liong, V.E., Lu, J., Tan, Y., Zhou, J.: Deep coupled metric learning for cross-modal matching. IEEE Trans. Multimedia 19(6), 1234–1244 (2017)CrossRefGoogle Scholar
  16. 16.
    Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I.: Adversarial Autoencoders. arXiv, pp. 1–10 (2015)Google Scholar
  17. 17.
    Mignon, A., Jurie, F.: CMML: a new metric learning approach for cross modal matching. In: ACCV (2012)Google Scholar
  18. 18.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.: Multimodal deep learning. In: ICML, pp. 689–696 (2011)Google Scholar
  19. 19.
    Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: IJCAI, pp. 3846C3853 (2016)Google Scholar
  20. 20.
    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434 (2015)
  21. 21.
    Rashtchian, C., Young, M., Hodosh, P., Hockenmaier, J.: Collecting image annotations using amazon’s mechanical turk. In: NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk (2010)Google Scholar
  22. 22.
    Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACM MM, pp. 251–260. ACM (2010)Google Scholar
  23. 23.
    Sharma, A., Kumar, A., Daume, H., Jacobs, D.: Generalized multiview analysis: a discriminative latent space. In: CVPR, pp. 2160–2167. IEEE (2012)Google Scholar
  24. 24.
    Siena, S., Boddeti, V.N., Kumar, B.V.K.V.: Coupled marginal fisher analysis for low-resolution face recognition. In: ECCV workshops and demonstrations, pp. 240–249 (2012)Google Scholar
  25. 25.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, arXiv:1409.1556 (2014)
  26. 26.
    Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 175–187 (2018)CrossRefGoogle Scholar
  27. 27.
    Song, J., Yang, Y., Yang, Y., Huang, Z., Shen, H.T.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: ACM SIGMOD, pp. 785–796 (2013)Google Scholar
  28. 28.
    Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 25 (11), 4999–5011 (2018)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Song, Y., Wang, W., Zhang, A.: Automatic annotation and retrieval of images. World Wide Web 6(2), 209–231 (2003)CrossRefGoogle Scholar
  30. 30.
    Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: ICML workshop (2012)Google Scholar
  31. 31.
    Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. In: NIPS, pp. 2222–2230 (2012)Google Scholar
  32. 32.
    Tian, Q., Chen, S.: Cross-heterogeneous-database age estimation through correlation representation learning. Neurocomput. 238(C), 286–295 (2017)CrossRefGoogle Scholar
  33. 33.
    Wang, J., Zhang, T., Song, J., Sebe, N., Shen, H.T.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 26(99), 15–29 (2017)Google Scholar
  34. 34.
    Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. TPAMI 38(10), 2010–2023 (2011)CrossRefGoogle Scholar
  35. 35.
    Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: ICCV, pp. 2088–2095 (2013)Google Scholar
  36. 36.
    Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A Comprehensive Survey on Cross-modal Retrieval. arXiv, pp. 1–20 (2016)Google Scholar
  37. 37.
    Wang, W., Yang, X., Ooi, B.C., Zhang, D., Zhuang, Y.: Effective deep learning-based multi-modal retrieval. VLDB J 25(1), 79–101 (2016)CrossRefGoogle Scholar
  38. 38.
    Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimedia PP(99), 1–1 (2017)Google Scholar
  39. 39.
    Wu, Y., Wang, S., Huang, Q.: Online asymmetric similarity learning for cross-modal retrieval. In: CVPR, pp. 3984–3993 (2017)Google Scholar
  40. 40.
    Xu, X., He, L., Shimada, A., Taniguchi, R., Lu, H.: Learning unified binary codes for cross-modal retrieval via latent semantic hashing. Neurocomputing 213, 191–203 (2016)CrossRefGoogle Scholar
  41. 41.
    Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Xu, X., Shimada, A., Taniguchi, R., He, L.: Coupled dictionary learning and feature mapping for cross-modal retrieval. In: ICME, pp. 1–6. IEEE (2015)Google Scholar
  43. 43.
    Xu, X., Yang, Y., Shimada, A., Taniguchi, R., He, L.: Semi-supervised coupled dictionary learning for cross-modal retrieval in internet images and texts. In: ACM MM, pp. 847–850. ACM (2015)Google Scholar
  44. 44.
    Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: CVPR, pp. 3441–3450 (2015)Google Scholar
  45. 45.
    Yang, Y., Zha, Z.-J., Gao, Y., Zhu, X., Chua, T.-S.: Exploiting Web images for semantic video indexing via robust sample-specific loss. IEEE Trans. Multimedia 16(6), 1677–1689 (2014)CrossRefGoogle Scholar
  46. 46.
    Yang, Y., Zhang, H., Zhang, M., Shen, F., Li, X.: Visual coding in a semantic hierarchy. In: Proceedings of the 23rd ACM international conference on Multimedia, pp. 59–68, ACM (2015)Google Scholar
  47. 47.
    Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24, 965C–978 (2014)CrossRefGoogle Scholar
  48. 48.
    Zhang, H., Gao, X., Wu, P., Xu, X.: A cross-media distance metric learning framework based on multi-view correlation mining and matching. World Wide Web 19(2), 181–197 (2016)CrossRefGoogle Scholar
  49. 49.
    Zhuang, Y., Wang, Y., Wu, F., Zhang, Y., Lu, W.: Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: AAAI (2013)Google Scholar
  50. 50.
    Zou, F., Chen, Y., Song, J., Zhou, K., Yang, Y., Sebe, N.: Compact image fingerprint via multiple kernel hashing. IEEE Transaction on Multimedia 17(7), 1006–1018 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Center for Future Media, School of Computer Science and EngineeringUniversity of Electronic Science and Technology of ChinaChengduChina
  2. 2.School of AutomationUniversity of Electronic Science and Technology of ChinaChengduChina
  3. 3.Qualcomm Technologies, Inc.San DiegoUSA
  4. 4.Kyushu Institute of TechnologyFukuokaJapan

Personalised recommendations