Abstract
For the flexible retrieval of data in different modalities, cross-modal retrieval has gradually attracted the attention of researchers. However, there is a heterogeneity gap between the data of different modalities, which cannot be measured directly. To solve this problem, researchers project data of different modalities into a common representation space to compensate for the heterogeneity of data of different modalities. However, existing methods with pair or triple constraints ignore the rich information between samples, which leads to the degradation of retrieval performance. In order to fully mine the information of samples, this paper proposes a cross-modal retrieval method (CMRDO) with dual optimization. First, the method optimizes the common representation space from inter-modal and intra-modal, respectively. Secondly, we introduce an efficient sample construction strategy to avoid sample pairs with less information. Finally, the bi-directional retrieval strategy we introduced can effectively capture the potential structure of query modal. In the three public datasets, the proposed CMRDO can effectively improve the final cross-modal retrieval accuracy, and has strong generalization ability.
Similar content being viewed by others
References
Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data. CoRR, vol. abs/1306.6709
Cao Y, Long M, Wang J, Zhu H (2016) Correlation autoencoder hashing for supervised cross-modal search. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval. ACM, pp 197–204
Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the 8th ACM international conference on image and video retrieval. ACM
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM international conference on multimedia. ACM, pp 7–16
Hardoon DR, Szedmák S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Huang X, Peng Y, Yuan M (2020) MHTN: modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Trans Cybern 50(3):1047–1059
Jiang Q, Li W (2017) Deep cross-modal hashing. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, pp 3270–3278
Kan M, Shan S, Zhang H, Lao S, Chen X (2012) Multi-view discriminant analysis. In: Computer vision - ECCV 2012 - 12th European conference on computer vision, vol 7572. Springer, pp 808–821
Kan M, Shan S, Zhang H, Lao S, Chen X (2016) Multi-view discriminant analysis. IEEE Trans Pattern Anal Mach Intell 38(1):188–194
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimedia 17(3):370–381
Laurens VDM, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(2605):2579–2605
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the eleventh ACM international conference on multimedia, Berkeley, CA, USA, November 2-8, 2003. ACM, pp 604–611
Liong VE, Lu J, Tan Y, Zhou J (2017) Deep coupled metric learning for cross-modal matching. IEEE Trans Multimedia 19(6):1234–1244
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence. IJCAI/AAAI Press, pp 3846–3853
Peng Y, Qi J, Huang X, Yuan Y (2018) CCL: cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimedia 20(2):405–420
Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535
Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In: 2015 IEEE international conference on computer vision. IEEE Computer Society, pp 4094–4102
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the 2010 workshop on creating speech and language data with Amazon’s mechanical Turk. Association for Computational Linguistics, pp 139–147
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, pp 1849–1857
Song HO, Xiang Y, Jegelka S, Savarese S (2016) Deep metric learning via lifted structured feature embedding. In: 2016 IEEE conference on computer vision and pattern recognition. IEEE Computer Society, pp 4004–4012
Sun C, Wang C, Lai W (2019) Gait analysis and recognition prediction of the human skeleton based on migration learning. Phys A: Stat Mech Appl 532:121812
Unar S, Wang X, Zhang C, Wang C (2019) Detected text-based image retrieval approach for textual images. IET Image Process 13(3):515–521
Unar S, Wang X, Wang C, Wang M (2019) New strategy for CBIR by combining low-level visual features with a colour descriptor. IET Image Process 13(7):1191–1200
Wang C, Lai W (2021) A fuzzy model of wearable network real-time health monitoring system on pharmaceutical industry. Pers Ubiquit Comput 25:485–493
Wang W, Livescu K (2016) Large-scale approximate kernel canonical correlation analysis. In: 4th international conference on learning representations
Wang X, Wang Z (2013) A novel method for image retrieval based on structure elements’ descriptor. J Vis Commun Image Represent 24(1):63–74
Wang X, Wang Z (2014) The method for image retrieval based on multi-factors correlation utilizing block truncation coding. Pattern Recogn 47(10):3293–3303
Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled features paces for cross-modal matching. In: IEEE international conference on computer vision. IEEE Computer Society, pp 2088–2095
Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. CoRR, vol. abs/1607.06215
Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023
Wang W, Yang X, Ooi BC, Zhang D, Zhuang Y (2016) Effective deep learning-based multi-modal retrieval. VLDB J 25(1):79–101
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 154–162
Wang C, Xu Q, Lin X, Liu S (2019) Research on data mining of permissions mode for android malware detection. Clust Comput 22(6):13337–13350
Wang X, Hua Y, Kodirov E, Hu G, Garnier R, Robertson NM (2019) Ranked list loss for deep metric learning. In: IEEE conference on computer vision and pattern recognition. Computer Vision Foundation / IEEE, pp 5207–5216
Wang C, Wang X, Xia Z, Ma B, Shi Y (2020) Image description with polar harmonic fourier moments. IEEE Trans Circuits Syst Video Technol 30(12):4440–4452
Wei Y, Song Y, Zhen Y, Liu B, Yang Q (2014) Scalable heterogeneous translated hashing. In: The 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 791–800
Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2017) Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans Cybern 47(2):449–460
Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: ACM Multimedia Conference. ACM, pp 877–886
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition. IEEE Computer Society, pp 3441–3450
Yang Z, Lin Z, Kang P, Lv J, Li Q, Liu W (2020) Learning shared semantic space with correlation alignment for cross-modal event retrieval. ACM Trans Multimed Comput Commun Appl 16(1):9:1–9:22
Ye M, Lan X, Wang Z, Yuen PC (2020) Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Trans Inf Forensics Secur 15:407–419
Yuan Y, Yang K, Zhang C (2017) Hard-aware deeply cascaded embedding. In: IEEE international conference on computer vision. IEEE Computer Society, pp 814–823
Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol 24(6):965–978
Zhang L, Ma B, Li G, Huang Q, Tian Q (2016) Pl-ranking: a novel ranking method for cross-modal retrieval. In: Proceedings of the 2016 ACM conference on multimedia conference. ACM, pp 1355–1364
Zhang J, Peng Y, Yuan M (2020) SCH-GAN: semi-supervised cross-modal hashing by generative adversarial network. IEEE Trans Cybern. 50(2):489–502
Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: IEEE conference on computer vision and pattern recognition, pp 10394–10403
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, Q., Liu, S., Qiao, H. et al. Cross-modal retrieval with dual optimization. Multimed Tools Appl 82, 7141–7157 (2023). https://doi.org/10.1007/s11042-022-13650-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13650-0