Skip to main content
Log in

Cross-modal retrieval with dual optimization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

For the flexible retrieval of data in different modalities, cross-modal retrieval has gradually attracted the attention of researchers. However, there is a heterogeneity gap between the data of different modalities, which cannot be measured directly. To solve this problem, researchers project data of different modalities into a common representation space to compensate for the heterogeneity of data of different modalities. However, existing methods with pair or triple constraints ignore the rich information between samples, which leads to the degradation of retrieval performance. In order to fully mine the information of samples, this paper proposes a cross-modal retrieval method (CMRDO) with dual optimization. First, the method optimizes the common representation space from inter-modal and intra-modal, respectively. Secondly, we introduce an efficient sample construction strategy to avoid sample pairs with less information. Finally, the bi-directional retrieval strategy we introduced can effectively capture the potential structure of query modal. In the three public datasets, the proposed CMRDO can effectively improve the final cross-modal retrieval accuracy, and has strong generalization ability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data. CoRR, vol. abs/1306.6709

  2. Cao Y, Long M, Wang J, Zhu H (2016) Correlation autoencoder hashing for supervised cross-modal search. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval. ACM, pp 197–204

    Chapter  Google Scholar 

  3. Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the 8th ACM international conference on image and video retrieval. ACM

    Google Scholar 

  4. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM international conference on multimedia. ACM, pp 7–16

    Google Scholar 

  5. Hardoon DR, Szedmák S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  6. Huang X, Peng Y, Yuan M (2020) MHTN: modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Trans Cybern 50(3):1047–1059

    Article  Google Scholar 

  7. Jiang Q, Li W (2017) Deep cross-modal hashing. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, pp 3270–3278

    Google Scholar 

  8. Kan M, Shan S, Zhang H, Lao S, Chen X (2012) Multi-view discriminant analysis. In: Computer vision - ECCV 2012 - 12th European conference on computer vision, vol 7572. Springer, pp 808–821

    Chapter  Google Scholar 

  9. Kan M, Shan S, Zhang H, Lao S, Chen X (2016) Multi-view discriminant analysis. IEEE Trans Pattern Anal Mach Intell 38(1):188–194

    Article  Google Scholar 

  10. Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimedia 17(3):370–381

    Article  Google Scholar 

  11. Laurens VDM, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(2605):2579–2605

    MATH  Google Scholar 

  12. Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the eleventh ACM international conference on multimedia, Berkeley, CA, USA, November 2-8, 2003. ACM, pp 604–611

    Google Scholar 

  13. Liong VE, Lu J, Tan Y, Zhou J (2017) Deep coupled metric learning for cross-modal matching. IEEE Trans Multimedia 19(6):1234–1244

    Article  Google Scholar 

  14. Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence. IJCAI/AAAI Press, pp 3846–3853

    Google Scholar 

  15. Peng Y, Qi J, Huang X, Yuan Y (2018) CCL: cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimedia 20(2):405–420

    Article  Google Scholar 

  16. Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535

    Article  Google Scholar 

  17. Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In: 2015 IEEE international conference on computer vision. IEEE Computer Society, pp 4094–4102

    Google Scholar 

  18. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the 2010 workshop on creating speech and language data with Amazon’s mechanical Turk. Association for Computational Linguistics, pp 139–147

    Google Scholar 

  19. Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, pp 1849–1857

    Google Scholar 

  20. Song HO, Xiang Y, Jegelka S, Savarese S (2016) Deep metric learning via lifted structured feature embedding. In: 2016 IEEE conference on computer vision and pattern recognition. IEEE Computer Society, pp 4004–4012

    Google Scholar 

  21. Sun C, Wang C, Lai W (2019) Gait analysis and recognition prediction of the human skeleton based on migration learning. Phys A: Stat Mech Appl 532:121812

    Article  MATH  Google Scholar 

  22. Unar S, Wang X, Zhang C, Wang C (2019) Detected text-based image retrieval approach for textual images. IET Image Process 13(3):515–521

    Article  Google Scholar 

  23. Unar S, Wang X, Wang C, Wang M (2019) New strategy for CBIR by combining low-level visual features with a colour descriptor. IET Image Process 13(7):1191–1200

    Article  Google Scholar 

  24. Wang C, Lai W (2021) A fuzzy model of wearable network real-time health monitoring system on pharmaceutical industry. Pers Ubiquit Comput 25:485–493

  25. Wang W, Livescu K (2016) Large-scale approximate kernel canonical correlation analysis. In: 4th international conference on learning representations

    Google Scholar 

  26. Wang X, Wang Z (2013) A novel method for image retrieval based on structure elements’ descriptor. J Vis Commun Image Represent 24(1):63–74

    Article  Google Scholar 

  27. Wang X, Wang Z (2014) The method for image retrieval based on multi-factors correlation utilizing block truncation coding. Pattern Recogn 47(10):3293–3303

    Article  Google Scholar 

  28. Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled features paces for cross-modal matching. In: IEEE international conference on computer vision. IEEE Computer Society, pp 2088–2095

    Google Scholar 

  29. Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. CoRR, vol. abs/1607.06215

  30. Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023

    Article  Google Scholar 

  31. Wang W, Yang X, Ooi BC, Zhang D, Zhuang Y (2016) Effective deep learning-based multi-modal retrieval. VLDB J 25(1):79–101

    Article  Google Scholar 

  32. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 154–162

    Chapter  Google Scholar 

  33. Wang C, Xu Q, Lin X, Liu S (2019) Research on data mining of permissions mode for android malware detection. Clust Comput 22(6):13337–13350

  34. Wang X, Hua Y, Kodirov E, Hu G, Garnier R, Robertson NM (2019) Ranked list loss for deep metric learning. In: IEEE conference on computer vision and pattern recognition. Computer Vision Foundation / IEEE, pp 5207–5216

    Google Scholar 

  35. Wang C, Wang X, Xia Z, Ma B, Shi Y (2020) Image description with polar harmonic fourier moments. IEEE Trans Circuits Syst Video Technol 30(12):4440–4452

    Article  Google Scholar 

  36. Wei Y, Song Y, Zhen Y, Liu B, Yang Q (2014) Scalable heterogeneous translated hashing. In: The 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 791–800

    Chapter  Google Scholar 

  37. Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2017) Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans Cybern 47(2):449–460

    Google Scholar 

  38. Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: ACM Multimedia Conference. ACM, pp 877–886

    Chapter  Google Scholar 

  39. Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition. IEEE Computer Society, pp 3441–3450

    Google Scholar 

  40. Yang Z, Lin Z, Kang P, Lv J, Li Q, Liu W (2020) Learning shared semantic space with correlation alignment for cross-modal event retrieval. ACM Trans Multimed Comput Commun Appl 16(1):9:1–9:22

    Article  Google Scholar 

  41. Ye M, Lan X, Wang Z, Yuen PC (2020) Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Trans Inf Forensics Secur 15:407–419

    Article  Google Scholar 

  42. Yuan Y, Yang K, Zhang C (2017) Hard-aware deeply cascaded embedding. In: IEEE international conference on computer vision. IEEE Computer Society, pp 814–823

    Google Scholar 

  43. Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol 24(6):965–978

    Article  Google Scholar 

  44. Zhang L, Ma B, Li G, Huang Q, Tian Q (2016) Pl-ranking: a novel ranking method for cross-modal retrieval. In: Proceedings of the 2016 ACM conference on multimedia conference. ACM, pp 1355–1364

    Google Scholar 

  45. Zhang J, Peng Y, Yuan M (2020) SCH-GAN: semi-supervised cross-modal hashing by generative adversarial network. IEEE Trans Cybern. 50(2):489–502

    Article  Google Scholar 

  46. Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: IEEE conference on computer vision and pattern recognition, pp 10394–10403

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingzhen Xu.

Ethics declarations

Conflict of interest

There is no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Q., Liu, S., Qiao, H. et al. Cross-modal retrieval with dual optimization. Multimed Tools Appl 82, 7141–7157 (2023). https://doi.org/10.1007/s11042-022-13650-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13650-0

Keywords

Navigation