Abstract
As a highlighting research topic in the multimedia area, cross-media retrieval aims to capture the complex correlations among multiple media types. Learning better shared representation and distance metric for multimedia data is important to boost the cross-media retrieval. Motivated by the strong ability of deep neural network in feature representation and comparison functions learning, we propose the Unified Network for Cross-media Similarity Metric (UNCSM) to associate cross-media shared representation learning with distance metric in a unified framework. First, we design a two-pathway deep network pretrained with contrastive loss, and employ double triplet similarity loss for fine-tuning to learn the shared representation for each media type by modeling the relative semantic similarity. Second, the metric network is designed for effectively calculating the cross-media similarity of the shared representation, by modeling the pairwise similar and dissimilar constraints. Compared to the existing methods which mostly ignore the dissimilar constraints and only use sample distance metric as Euclidean distance separately, our UNCSM approach unifies the representation learning and distance metric to preserve the relative similarity as well as embrace more complex similarity functions for further improving the cross-media retrieval accuracy. The experimental results show that our UNCSM approach outperforms 8 state-of-the-art methods on 4 widely-used cross-media datasets.
Similar content being viewed by others
References
Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis International Conference on Machine Learning (ICML), pp 1247–1255
Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns International Conference on Computer Vision (ICCV), pp 1–8
Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of Singapore ACM International Conference on Image and Video Retrieval (ACM-CIVR
Farhadi A, Hejrati SMM, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: Generating sentences from images European Conference on Computer Vision (ECCV), pp 15–29
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder ACM international conference on Multimedia (ACM-MM), pp 7–16
Han X, Leung T, Jia Y, Sukthankar R, Berg AC (2015) Matchnet: Unifying feature and metric learning for patch-based matching Conference on Computer Vision and Pattern Recognition (CVPR), pp 3279–3286
Hardoon DR, Szedmák S (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Hoffer E, Ailon N (2015) Deep metric learning using triplet network Similarity-Based Pattern Recognition (SIMBAD), pp 84–92
Hotelling H (1936) Relations between two sets of variates. Biometrika 321–377
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association ACM international conference on Multimedia (ACM-MM), pp 604–611
Manjunath BS, Ohm JR, Vinod VV, Yamada A (2001) Color and texture descriptors IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), pp 703–715
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning International Conference on Machine Learning (ICML), pp 689–696
Nie L, Wang M, Zha Z, Chua T (2012) Oracle in image search: A content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13:1–13:23
Nie L, Yan S, Wang M, Hong R, Chua T (2012) Harvesting visual concepts for image search with complex queries ACM international conference on Multimedia (ACM-MM), pp 59–68
Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope International Conference on Computer Vision (ICCV), pp 145–175
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks International Joint Conference on Artificial Intelligence (IJCAI), pp 3846–3853
Peng Y, Ngo CW (2006) Clip-based similarity measure for query-dependent clip retrieval and video summarization. IEEE Trans Circuits Syst Video Technol (TCSVT) 16(5):612–627
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval ACM international conference on Multimedia (ACM-MM), pp 251–260
Salakhutdinov R, Hinton GE (1607) Replicated softmax: an undirected topic model Advances in Neural Information Processing Systems (NIPS)
Salakhutdinov R, Hinton GE (2012) An efficient learning procedure for deep boltzmann machines. Neural Comput 24(8):1967–2006
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets International Conference on Machine Learning (ICML)
Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines Advances in Neural Information Processing Systems (NIPS), pp 2222–2230
Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems The International Society for Music Information Retrieval (ISMIR), pp 153–160
Vincent P, Larochelle H, Bengio Y, Manzagol P (2008) Extracting and composing robust features with denoising autoencoders International Conference on Machine Learning (ICML), pp 1096–1103
Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning fine-grained image similarity with deep ranking Conference on Computer Vision and Pattern Recognition (CVPR), pp 1386–1393
Wang W, Arora R, Livescu K, Bilmes JA (2015) On deep multi-view representation learning International Conference on Machine Learning (ICML), pp 1083–1092
Welling M, Rosen-Zvi M, Hinton GE (2004) Exponential family harmoniums with an application to information retrieval Advances in Neural Information Processing Systems (NIPS), pp 1481–1488
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3441–3450
Yu J, Tian Q (2008) Semantic subspace projection and its applications in image retrieval. IEEE Trans Circuits Syst Video Technol (TCSVT) 18(4):544–548
Zhai X, Peng Y, Xiao J (2013) Heterogeneous metric learning with joint graph regularization for cross-media retrieval AAAI Conference on Artificial Intelligence (AAAI), pp 1198–1204
Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans Circuits Syst Video Technol (TCSVT) 24:965–978
Acknowledgements
This work was supported by National Natural Science Foundation of China under Grants 61371128 and 61532005, and National Hi-Tech Research and Development Program of China (863 Program) under Grant 2014AA015102.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qi, J., Huang, X. & Peng, Y. Cross-media similarity metric learning with unified deep networks. Multimed Tools Appl 76, 25109–25127 (2017). https://doi.org/10.1007/s11042-017-4726-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4726-6