Skip to main content
Log in

Cross-media similarity metric learning with unified deep networks

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As a highlighting research topic in the multimedia area, cross-media retrieval aims to capture the complex correlations among multiple media types. Learning better shared representation and distance metric for multimedia data is important to boost the cross-media retrieval. Motivated by the strong ability of deep neural network in feature representation and comparison functions learning, we propose the Unified Network for Cross-media Similarity Metric (UNCSM) to associate cross-media shared representation learning with distance metric in a unified framework. First, we design a two-pathway deep network pretrained with contrastive loss, and employ double triplet similarity loss for fine-tuning to learn the shared representation for each media type by modeling the relative semantic similarity. Second, the metric network is designed for effectively calculating the cross-media similarity of the shared representation, by modeling the pairwise similar and dissimilar constraints. Compared to the existing methods which mostly ignore the dissimilar constraints and only use sample distance metric as Euclidean distance separately, our UNCSM approach unifies the representation learning and distance metric to preserve the relative similarity as well as embrace more complex similarity functions for further improving the cross-media retrieval accuracy. The experimental results show that our UNCSM approach outperforms 8 state-of-the-art methods on 4 widely-used cross-media datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis International Conference on Machine Learning (ICML), pp 1247–1255

    Google Scholar 

  2. Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns International Conference on Computer Vision (ICCV), pp 1–8

    Google Scholar 

  3. Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of Singapore ACM International Conference on Image and Video Retrieval (ACM-CIVR

    Google Scholar 

  4. Farhadi A, Hejrati SMM, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: Generating sentences from images European Conference on Computer Vision (ECCV), pp 15–29

    Google Scholar 

  5. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder ACM international conference on Multimedia (ACM-MM), pp 7–16

    Google Scholar 

  6. Han X, Leung T, Jia Y, Sukthankar R, Berg AC (2015) Matchnet: Unifying feature and metric learning for patch-based matching Conference on Computer Vision and Pattern Recognition (CVPR), pp 3279–3286

    Google Scholar 

  7. Hardoon DR, Szedmák S (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  8. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    Article  MathSciNet  MATH  Google Scholar 

  9. Hoffer E, Ailon N (2015) Deep metric learning using triplet network Similarity-Based Pattern Recognition (SIMBAD), pp 84–92

    Chapter  Google Scholar 

  10. Hotelling H (1936) Relations between two sets of variates. Biometrika 321–377

  11. Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association ACM international conference on Multimedia (ACM-MM), pp 604–611

    Google Scholar 

  12. Manjunath BS, Ohm JR, Vinod VV, Yamada A (2001) Color and texture descriptors IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), pp 703–715

    Google Scholar 

  13. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning International Conference on Machine Learning (ICML), pp 689–696

    Google Scholar 

  14. Nie L, Wang M, Zha Z, Chua T (2012) Oracle in image search: A content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13:1–13:23

    Article  Google Scholar 

  15. Nie L, Yan S, Wang M, Hong R, Chua T (2012) Harvesting visual concepts for image search with complex queries ACM international conference on Multimedia (ACM-MM), pp 59–68

    Google Scholar 

  16. Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope International Conference on Computer Vision (ICCV), pp 145–175

    Google Scholar 

  17. Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks International Joint Conference on Artificial Intelligence (IJCAI), pp 3846–3853

    Google Scholar 

  18. Peng Y, Ngo CW (2006) Clip-based similarity measure for query-dependent clip retrieval and video summarization. IEEE Trans Circuits Syst Video Technol (TCSVT) 16(5):612–627

    Article  Google Scholar 

  19. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval ACM international conference on Multimedia (ACM-MM), pp 251–260

    Google Scholar 

  20. Salakhutdinov R, Hinton GE (1607) Replicated softmax: an undirected topic model Advances in Neural Information Processing Systems (NIPS)

    Google Scholar 

  21. Salakhutdinov R, Hinton GE (2012) An efficient learning procedure for deep boltzmann machines. Neural Comput 24(8):1967–2006

    Article  MathSciNet  MATH  Google Scholar 

  22. Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets International Conference on Machine Learning (ICML)

    Google Scholar 

  23. Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines Advances in Neural Information Processing Systems (NIPS), pp 2222–2230

    Google Scholar 

  24. Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems The International Society for Music Information Retrieval (ISMIR), pp 153–160

    Google Scholar 

  25. Vincent P, Larochelle H, Bengio Y, Manzagol P (2008) Extracting and composing robust features with denoising autoencoders International Conference on Machine Learning (ICML), pp 1096–1103

    Google Scholar 

  26. Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning fine-grained image similarity with deep ranking Conference on Computer Vision and Pattern Recognition (CVPR), pp 1386–1393

    Google Scholar 

  27. Wang W, Arora R, Livescu K, Bilmes JA (2015) On deep multi-view representation learning International Conference on Machine Learning (ICML), pp 1083–1092

    Google Scholar 

  28. Welling M, Rosen-Zvi M, Hinton GE (2004) Exponential family harmoniums with an application to information retrieval Advances in Neural Information Processing Systems (NIPS), pp 1481–1488

    Google Scholar 

  29. Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3441–3450

    Google Scholar 

  30. Yu J, Tian Q (2008) Semantic subspace projection and its applications in image retrieval. IEEE Trans Circuits Syst Video Technol (TCSVT) 18(4):544–548

    Article  Google Scholar 

  31. Zhai X, Peng Y, Xiao J (2013) Heterogeneous metric learning with joint graph regularization for cross-media retrieval AAAI Conference on Artificial Intelligence (AAAI), pp 1198–1204

    Google Scholar 

  32. Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans Circuits Syst Video Technol (TCSVT) 24:965–978

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China under Grants 61371128 and 61532005, and National Hi-Tech Research and Development Program of China (863 Program) under Grant 2014AA015102.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuxin Peng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qi, J., Huang, X. & Peng, Y. Cross-media similarity metric learning with unified deep networks. Multimed Tools Appl 76, 25109–25127 (2017). https://doi.org/10.1007/s11042-017-4726-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4726-6

Keywords

Navigation