Deep adversarial metric learning for cross-modal retrieval

Xu, Xing; He, Li; Lu, Huimin; Gao, Lianli; Ji, Yanli

doi:10.1007/s11280-018-0541-x

Deep adversarial metric learning for cross-modal retrieval

Published: 04 April 2018

Volume 22, pages 657–672, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Xing Xu ORCID: orcid.org/0000-0001-5685-3123¹,
Li He³,
Huimin Lu⁴,
Lianli Gao¹ &
…
Yanli Ji²

5487 Accesses
145 Citations
Explore all metrics

Abstract

Cross-modal retrieval has become a highlighted research topic, to provide flexible retrieval experience across multimedia data such as image, video, text and audio. The core of existing cross-modal retrieval approaches is to narrow down the gap between different modalities either by finding a maximally correlated embedding space. Recently, researchers leverage Deep Neural Network (DNN) to learn nonlinear transformations for each modality to obtain transformed features in a common subspace where cross-modal matching can be performed. However, the statistical characteristics of the original features for each modality are not explicitly preserved in the learned subspace. Inspired by recent advances in adversarial learning, we propose a novel Deep Adversarial Metric Learning approach, termed DAML for cross-modal retrieval. DAML nonlinearly maps labeled data pairs of different modalities into a shared latent feature subspace, under which the intra-class variation is minimized and the inter-class variation is maximized, and the difference of each data pair captured from two modalities of the same class is minimized, respectively. In addition to maximizing the correlations between modalities, we add an additional regularization by introducing adversarial learning. In particular, we introduce a modality classifier to predict the modality of a transformed feature, which ensures that the transformed features are also statistically indistinguishable. Experiments on three popular multimodal datasets show that DAML achieves superior performance compared to several state of the art cross-modal retrieval methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modality Consistent Generative Adversarial Network for Cross-Modal Retrieval

Adversarial Learning for Cross-Modal Retrieval with Wasserstein Distance

Cross-modal retrieval based on shared proxies

Article 20 January 2024

Notes

References

Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: ICML, pp. 1247–1255 (2013)
Chu, L., Zhang, Y., Li, G., Wang, S., Zhang, W., Huang, Q.: Effective multimodality fusion framework for cross-media topic detection. IEEE Trans. Circuits Syst. Video Technol. 26(3), 556–569 (2016)
Article Google Scholar
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.-T.: Nus-wide: A real-world Web image database from national university of singapore. In: CIVR (2009)
Costa Pereira, J., Coviello, E., Doyle, G., Rasiwasia, N., Lanckriet, G., Levy, R., Vasconcelos, N.: On the role of correlation and abstraction in cross-modal multimedia retrieval. TPAMI 36(3), 521–535 (2014)
Article Google Scholar
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: ACM MM, pp. 7–16. ACM (2014)
Ganin, Y., Lempitsky, V.S.: Unsupervised domain adaptation by backpropagation. In: ICML, pp. 1180–1189 (2015)
Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2), 210–233 (2014)
Article Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Gu, Q., Li, Z., Han, J.: Joint feature selection and subspace learning. In: IJCAI (2011)
Hardoon, D., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Article MATH Google Scholar
He, L., Xu, X., Lu, H., Yang, Y., Shen, H.T.: Unsupervised cross modal retrieval through adversarial learning. In: ICME, pp. 1–6 (2017)
Kang, C., Liao, S., He, Y., Wang, J., Niu, W., Xiang, S., Pan, C.: Cross-modal similarity learning: a low rank bilinear formulation. In: CKIM, pp. 1251–1260 (2015)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 (2013)
Larsen, A.B.L., Sønderby, S.K., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300 (2015)
Liong, V.E., Lu, J., Tan, Y., Zhou, J.: Deep coupled metric learning for cross-modal matching. IEEE Trans. Multimedia 19(6), 1234–1244 (2017)
Article Google Scholar
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I.: Adversarial Autoencoders. arXiv, pp. 1–10 (2015)
Mignon, A., Jurie, F.: CMML: a new metric learning approach for cross modal matching. In: ACCV (2012)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.: Multimodal deep learning. In: ICML, pp. 689–696 (2011)
Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: IJCAI, pp. 3846C3853 (2016)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434 (2015)
Rashtchian, C., Young, M., Hodosh, P., Hockenmaier, J.: Collecting image annotations using amazon’s mechanical turk. In: NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk (2010)
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACM MM, pp. 251–260. ACM (2010)
Sharma, A., Kumar, A., Daume, H., Jacobs, D.: Generalized multiview analysis: a discriminative latent space. In: CVPR, pp. 2160–2167. IEEE (2012)
Siena, S., Boddeti, V.N., Kumar, B.V.K.V.: Coupled marginal fisher analysis for low-resolution face recognition. In: ECCV workshops and demonstrations, pp. 240–249 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, arXiv:1409.1556 (2014)
Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 175–187 (2018)
Article Google Scholar
Song, J., Yang, Y., Yang, Y., Huang, Z., Shen, H.T.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: ACM SIGMOD, pp. 785–796 (2013)
Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 25 (11), 4999–5011 (2018)
Article MathSciNet MATH Google Scholar
Song, Y., Wang, W., Zhang, A.: Automatic annotation and retrieval of images. World Wide Web 6(2), 209–231 (2003)
Article Google Scholar
Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: ICML workshop (2012)
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. In: NIPS, pp. 2222–2230 (2012)
Tian, Q., Chen, S.: Cross-heterogeneous-database age estimation through correlation representation learning. Neurocomput. 238(C), 286–295 (2017)
Article Google Scholar
Wang, J., Zhang, T., Song, J., Sebe, N., Shen, H.T.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 26(99), 15–29 (2017)
Google Scholar
Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. TPAMI 38(10), 2010–2023 (2011)
Article Google Scholar
Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: ICCV, pp. 2088–2095 (2013)
Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A Comprehensive Survey on Cross-modal Retrieval. arXiv, pp. 1–20 (2016)
Wang, W., Yang, X., Ooi, B.C., Zhang, D., Zhuang, Y.: Effective deep learning-based multi-modal retrieval. VLDB J 25(1), 79–101 (2016)
Article Google Scholar
Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimedia PP(99), 1–1 (2017)
Google Scholar
Wu, Y., Wang, S., Huang, Q.: Online asymmetric similarity learning for cross-modal retrieval. In: CVPR, pp. 3984–3993 (2017)
Xu, X., He, L., Shimada, A., Taniguchi, R., Lu, H.: Learning unified binary codes for cross-modal retrieval via latent semantic hashing. Neurocomputing 213, 191–203 (2016)
Article Google Scholar
Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)
Article MathSciNet MATH Google Scholar
Xu, X., Shimada, A., Taniguchi, R., He, L.: Coupled dictionary learning and feature mapping for cross-modal retrieval. In: ICME, pp. 1–6. IEEE (2015)
Xu, X., Yang, Y., Shimada, A., Taniguchi, R., He, L.: Semi-supervised coupled dictionary learning for cross-modal retrieval in internet images and texts. In: ACM MM, pp. 847–850. ACM (2015)
Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: CVPR, pp. 3441–3450 (2015)
Yang, Y., Zha, Z.-J., Gao, Y., Zhu, X., Chua, T.-S.: Exploiting Web images for semantic video indexing via robust sample-specific loss. IEEE Trans. Multimedia 16(6), 1677–1689 (2014)
Article Google Scholar
Yang, Y., Zhang, H., Zhang, M., Shen, F., Li, X.: Visual coding in a semantic hierarchy. In: Proceedings of the 23rd ACM international conference on Multimedia, pp. 59–68, ACM (2015)
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24, 965C–978 (2014)
Article Google Scholar
Zhang, H., Gao, X., Wu, P., Xu, X.: A cross-media distance metric learning framework based on multi-view correlation mining and matching. World Wide Web 19(2), 181–197 (2016)
Article Google Scholar
Zhuang, Y., Wang, Y., Wu, F., Zhang, Y., Lu, W.: Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: AAAI (2013)
Zou, F., Chen, Y., Song, J., Zhou, K., Yang, Y., Sebe, N.: Compact image fingerprint via multiple kernel hashing. IEEE Transaction on Multimedia 17(7), 1006–1018 (2015)
Article Google Scholar

Download references

Acknowledgements

This work is partially supported by NSFC grant No. 61602089, No. 61673088, No. 61502080; the 111 Project No. B17008; the Fundamental Research Funds for Central Universities ZYGX2016KYQD114; the LEADER of MEXT-Japan (16809746); the Telecommunications Foundation; the REDAS and SCAT.

Author information

Authors and Affiliations

Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Xing Xu & Lianli Gao
School of Automation, University of Electronic Science and Technology of China, Chengdu, China
Yanli Ji
Qualcomm Technologies, Inc., San Diego, CA, USA
Li He
Kyushu Institute of Technology, Fukuoka, Japan
Huimin Lu

Authors

Xing Xu
View author publications
You can also search for this author in PubMed Google Scholar
Li He
View author publications
You can also search for this author in PubMed Google Scholar
Huimin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Lianli Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yanli Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huimin Lu.

Additional information

This article belongs to the Topical Collection: Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications

Guest Editors: Jingkuan Song, Shuqiang Jiang, Elisa Ricci, and Zi Huang

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, X., He, L., Lu, H. et al. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 657–672 (2019). https://doi.org/10.1007/s11280-018-0541-x

Download citation

Received: 15 August 2017
Revised: 18 February 2018
Accepted: 27 February 2018
Published: 04 April 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s11280-018-0541-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep adversarial metric learning for cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

Modality Consistent Generative Adversarial Network for Cross-Modal Retrieval

Adversarial Learning for Cross-Modal Retrieval with Wasserstein Distance

Cross-modal retrieval based on shared proxies

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep adversarial metric learning for cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

Modality Consistent Generative Adversarial Network for Cross-Modal Retrieval

Adversarial Learning for Cross-Modal Retrieval with Wasserstein Distance

Cross-modal retrieval based on shared proxies

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation