Cross-media similarity metric learning with unified deep networks

Qi, Jinwei; Huang, Xin; Peng, Yuxin

doi:10.1007/s11042-017-4726-6

Cross-media similarity metric learning with unified deep networks

Published: 06 May 2017

Volume 76, pages 25109–25127, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jinwei Qi¹,
Xin Huang¹ &
Yuxin Peng¹

546 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

As a highlighting research topic in the multimedia area, cross-media retrieval aims to capture the complex correlations among multiple media types. Learning better shared representation and distance metric for multimedia data is important to boost the cross-media retrieval. Motivated by the strong ability of deep neural network in feature representation and comparison functions learning, we propose the Unified Network for Cross-media Similarity Metric (UNCSM) to associate cross-media shared representation learning with distance metric in a unified framework. First, we design a two-pathway deep network pretrained with contrastive loss, and employ double triplet similarity loss for fine-tuning to learn the shared representation for each media type by modeling the relative semantic similarity. Second, the metric network is designed for effectively calculating the cross-media similarity of the shared representation, by modeling the pairwise similar and dissimilar constraints. Compared to the existing methods which mostly ignore the dissimilar constraints and only use sample distance metric as Euclidean distance separately, our UNCSM approach unifies the representation learning and distance metric to preserve the relative similarity as well as embrace more complex similarity functions for further improving the cross-media retrieval accuracy. The experimental results show that our UNCSM approach outperforms 8 state-of-the-art methods on 4 widely-used cross-media datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-media residual correlation learning

Article Open access 05 October 2017

Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks

Cross-Media Retrieval via Deep Semantic Canonical Correlation Analysis and Logistic Regression

References

Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis International Conference on Machine Learning (ICML), pp 1247–1255
Google Scholar
Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns International Conference on Computer Vision (ICCV), pp 1–8
Google Scholar
Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of Singapore ACM International Conference on Image and Video Retrieval (ACM-CIVR
Google Scholar
Farhadi A, Hejrati SMM, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: Generating sentences from images European Conference on Computer Vision (ECCV), pp 15–29
Google Scholar
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder ACM international conference on Multimedia (ACM-MM), pp 7–16
Google Scholar
Han X, Leung T, Jia Y, Sukthankar R, Berg AC (2015) Matchnet: Unifying feature and metric learning for patch-based matching Conference on Computer Vision and Pattern Recognition (CVPR), pp 3279–3286
Google Scholar
Hardoon DR, Szedmák S (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
Article MATH Google Scholar
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet MATH Google Scholar
Hoffer E, Ailon N (2015) Deep metric learning using triplet network Similarity-Based Pattern Recognition (SIMBAD), pp 84–92
Chapter Google Scholar
Hotelling H (1936) Relations between two sets of variates. Biometrika 321–377
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association ACM international conference on Multimedia (ACM-MM), pp 604–611
Google Scholar
Manjunath BS, Ohm JR, Vinod VV, Yamada A (2001) Color and texture descriptors IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), pp 703–715
Google Scholar
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning International Conference on Machine Learning (ICML), pp 689–696
Google Scholar
Nie L, Wang M, Zha Z, Chua T (2012) Oracle in image search: A content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13:1–13:23
Article Google Scholar
Nie L, Yan S, Wang M, Hong R, Chua T (2012) Harvesting visual concepts for image search with complex queries ACM international conference on Multimedia (ACM-MM), pp 59–68
Google Scholar
Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope International Conference on Computer Vision (ICCV), pp 145–175
Google Scholar
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks International Joint Conference on Artificial Intelligence (IJCAI), pp 3846–3853
Google Scholar
Peng Y, Ngo CW (2006) Clip-based similarity measure for query-dependent clip retrieval and video summarization. IEEE Trans Circuits Syst Video Technol (TCSVT) 16(5):612–627
Article Google Scholar
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval ACM international conference on Multimedia (ACM-MM), pp 251–260
Google Scholar
Salakhutdinov R, Hinton GE (1607) Replicated softmax: an undirected topic model Advances in Neural Information Processing Systems (NIPS)
Google Scholar
Salakhutdinov R, Hinton GE (2012) An efficient learning procedure for deep boltzmann machines. Neural Comput 24(8):1967–2006
Article MathSciNet MATH Google Scholar
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets International Conference on Machine Learning (ICML)
Google Scholar
Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines Advances in Neural Information Processing Systems (NIPS), pp 2222–2230
Google Scholar
Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems The International Society for Music Information Retrieval (ISMIR), pp 153–160
Google Scholar
Vincent P, Larochelle H, Bengio Y, Manzagol P (2008) Extracting and composing robust features with denoising autoencoders International Conference on Machine Learning (ICML), pp 1096–1103
Google Scholar
Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning fine-grained image similarity with deep ranking Conference on Computer Vision and Pattern Recognition (CVPR), pp 1386–1393
Google Scholar
Wang W, Arora R, Livescu K, Bilmes JA (2015) On deep multi-view representation learning International Conference on Machine Learning (ICML), pp 1083–1092
Google Scholar
Welling M, Rosen-Zvi M, Hinton GE (2004) Exponential family harmoniums with an application to information retrieval Advances in Neural Information Processing Systems (NIPS), pp 1481–1488
Google Scholar
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3441–3450
Google Scholar
Yu J, Tian Q (2008) Semantic subspace projection and its applications in image retrieval. IEEE Trans Circuits Syst Video Technol (TCSVT) 18(4):544–548
Article Google Scholar
Zhai X, Peng Y, Xiao J (2013) Heterogeneous metric learning with joint graph regularization for cross-media retrieval AAAI Conference on Artificial Intelligence (AAAI), pp 1198–1204
Google Scholar
Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans Circuits Syst Video Technol (TCSVT) 24:965–978
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China under Grants 61371128 and 61532005, and National Hi-Tech Research and Development Program of China (863 Program) under Grant 2014AA015102.

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, 100871, China
Jinwei Qi, Xin Huang & Yuxin Peng

Authors

Jinwei Qi
View author publications
You can also search for this author in PubMed Google Scholar
Xin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxin Peng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qi, J., Huang, X. & Peng, Y. Cross-media similarity metric learning with unified deep networks. Multimed Tools Appl 76, 25109–25127 (2017). https://doi.org/10.1007/s11042-017-4726-6

Download citation

Received: 15 September 2016
Revised: 04 March 2017
Accepted: 17 April 2017
Published: 06 May 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s11042-017-4726-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-media similarity metric learning with unified deep networks

Abstract

Access this article

Similar content being viewed by others

Cross-media residual correlation learning

Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks

Cross-Media Retrieval via Deep Semantic Canonical Correlation Analysis and Logistic Regression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-media similarity metric learning with unified deep networks

Abstract

Access this article

Similar content being viewed by others

Cross-media residual correlation learning

Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks

Cross-Media Retrieval via Deep Semantic Canonical Correlation Analysis and Logistic Regression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation