Pretrained models for cross-modal retrieval: experiments and improvements

Zhou, Kun; Hassan, Fadratul Hafinaz; Gan, Keng Hoon

doi:10.1007/s11760-024-03126-z

Pretrained models for cross-modal retrieval: experiments and improvements

Original Paper
Published: 06 April 2024

Volume 18, pages 4915–4923, (2024)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Kun Zhou^1,2,
Fadratul Hafinaz Hassan² &
Keng Hoon Gan²

70 Accesses
Explore all metrics

Abstract

Cross-modal retrieval, the process of retrieving relevant data from one modality in response to a query in another, has become increasingly important with the growing amount of multimodal data. This paper proposes using a pretrained model CLIP as the backbone of a cross-modal retrieval system and explores various methods to enhance its performance. The proposed approach reduces the output feature dimension to 384, reducing model parameters, storage capacity, and retrieval time by 62.5%. By conducting cross-training on the training dataset, the model not only enhances its intermodal invariance but also achieves multimodal retrieval. The residual connections and an increased dropout ratio of 30% increase average retrieval performance. Additionally, we propose the utilization of class proxies as missing data to accomplish training in an incomplete (imbalanced) dataset. The proposed approach is evaluated on four benchmark datasets: Wikipedia, NUS-WIDE, Pascal-Sentence, and XmediaNet, achieving 3.4%, 1.9%, 2.3%, and 5.8% retrieval performance improvement, respectively. The results demonstrate the effectiveness of the proposed approach in significantly improving the performance of cross-modal retrieval systems, outperforming state-of-the-art methods on benchmark datasets while reducing the number of model parameters, retrieval time, and database storage space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Semantics Consistent Adversarial Cross-Modal Retrieval

Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss

Data availability

All datasets used in our research are public access from internet.

References

Alkhawlani, M., Elmogy, M., El Bakry, H.: Text-based, content-based, and semantic-based image retrievals: a survey. Int. J. Comput. Inf. Technol 4(01), 58–66 (2015)
Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004). https://doi.org/10.1162/0899766042321814
Article Google Scholar
Wang, W., Livescu, K.: Large-scale approximate kernel canonical correlation analysis. Preprint at arXiv:1511.04773 (2015)
Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02166-7
Article Google Scholar
Jayagopal, A., Aiswarya, A.M., Garg, A., Nandakumar, S.K.: Multimodal representation learning with text and images. Accessed 29 Apr 2022. https://doi.org/10.48550/arXiv.2205.00142
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. Accessed 07 Jan 2016. https://doi.org/10.48550/arXiv.1511.06434
Wang, C., Yang, H., Meinel, C.: A deep semantic framework for multimodal representation learning. Multimedia Tools Appl. 75(15), 9255–9276 (2016). https://doi.org/10.1007/s11042-016-3380-8
Article Google Scholar
Zhang, S.-F., Zhai, J.-H., Xie, B.-J., Zhan, Y., Wang, X.: Multimodal representation learning: advances, trends and challenges. In: 2019 International Conference on Machine Learning and Cybernetics (ICMLC), IEEE, pp. 1–6 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv 24 May 2019. Accessed 15 June 15 2022. [Online]. Preprint at http://arxiv.org/abs/1810.04805
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. Preprint at arXiv:1907.11692 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multitask benchmark and analysis platform for natural language understanding. Preprint at arXiv:1804.07461 (2018)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. arXiv 04 Dec 2019. Accessed 18 June 18 2022. [Online]. Preprint at http://arxiv.org/abs/1909.11059
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403 (2019)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp. 8748–8763 (2021)
Zeng, Z., Mao, W.: A comprehensive empirical study of vision-language pre-trained model for supervised cross-modal retrieval. arXiv, 17 Apr 2022. Accessed 01 Nov 2022. [Online]. Preprint at http://arxiv.org/abs/2201.02772
van Dis, E.A., Bollen, J., Zuidema, W., van Rooij, R., Bockting, C.L.: ChatGPT: five priorities for research. Nature 614(7947), 224–226 (2023). https://doi.org/10.1038/d41586-023-00288-7
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv 10 Apr 2015. Accessed 21 June 2022. [Online]. Preprint at http://arxiv.org/abs/1409.1556
Szegedy, C., et al.: Going deeper with convolutions. arXiv 16 Sep 2014. Accessed 21 June 2022. [Online]. Preprint at http://arxiv.org/abs/1409.4842
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv 10 Dec 2015. Accessed 21 June 2022. [Online]. Preprint at http://arxiv.org/abs/1512.03385
Sanderson, E., Matuszewski, B.J.: FCN-transformer feature fusion for polyp segmentation, pp. 892–907 (2022). https://doi.org/10.1007/978-3-031-12053-4_65
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. Preprint at arXiv:2010.11929 (2020)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, PMLR, pp. 8821–8831 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. Preprint at arXiv:2204.06125 (2022)
Wu, J., Lin, Z., Zha, H.: Joint latent subspace learning and regression for cross-modal retrieval. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 917–920 (2017). https://doi.org/10.1145/3077136.3080678
Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2010–2023 (2015). https://doi.org/10.1109/TPAMI.2015.2505311
Article Google Scholar
Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: IJCAI, p. 3853 (2016)
Peng, Y., Qi, J., Huang, X., Yuan, Y.: CCL: cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multimedia 20(2), 405–420 (2017). https://doi.org/10.1109/TMM.2017.2742704
Article Google Scholar
Peng, Y., Qi, J., Yuan, Y.: Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans. Image Process. 27(11), 5585–5599 (2018). https://doi.org/10.1109/TIP.2018.2852503
Article MathSciNet Google Scholar
Peng, Y., Qi, J.: CM-GANs: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 15(1), 1–24 (2019). https://doi.org/10.1145/3284750
Article Google Scholar
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In Proceedings of the 25th ACM international conference on Multimedia, pp. 154–162 (2017). https://doi.org/10.1145/3123266.3123326
Zeng, Z., Sun, Y., Mao, W.: MCCN: multimodal coordinated clustering network for large-scale cross-modal retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5427–5435 (2021). https://doi.org/10.1145/3474085.3475670
Zeng, Z., Wang, S., Xu, N., Mao, W.: Pan: prototype-based adaptive network for robust cross-modal retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1125–1134 (2021). https://doi.org/10.1145/3404835.3462867
Wang, J., Gong, T., Zeng, Z., Sun, C., Yan, Y.: C ³ CMR: cross-modality cross-instance contrastive learning for cross-media retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, Lisboa Portugal: ACM, pp. 4300–4308 (2022). https://doi.org/10.1145/3503161.3548263
Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 251–260 (2010). https://doi.org/10.1145/1873951.1873987
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp. 1–9 (2009). https://doi.org/10.1145/1646396.1646452
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp. 139–147 (2010)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. arXiv, Nov. 28, 2013. Accessed 21 June 21 2022. [Online]. Preprint at http://arxiv.org/abs/1311.2901
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 7–16 (2014). https://doi.org/10.1145/2647868.2654902
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24(6), 965–978 (2013). https://doi.org/10.1109/TCSVT.2013.2276704
Article Google Scholar
Zare, A., Ozdemir, A., Iwen, M.A., Aviyente, S.: Extension of PCA to higher order data structures: an introduction to tensors, tensor decompositions, and tensor PCA. Proc. IEEE 106(8), 1341–1358 (2018). https://doi.org/10.1109/JPROC.2018.2848209
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Prototype Research Grant Scheme (PRGS) by the Ministry of Higher Education Malaysia [PRGS/1/2021/ICT02/USM/02/1], the Hubert Curien Partnership (PHC-Hibiscus) Research Grant Scheme by the Ministry of Europe and Foreign Affairs, and the Ministry of Higher Education Malaysia [MyPAIR/1/2020/ICT02/USM//1], and the Department of Education of Zhejiang Province of China for their financial support through the General Research Project [Y202147706].

Funding

Ministry of Higher Education, Malaysia, PRGS/1/2021/ICT02/USM/02/1. Ministry of Europe and Foreign Affairs, and Ministry of Higher Education Malaysia, MyPAIR/1/2020/ICT02/USM//1. Department of Education of Zhejiang Province of China, Y202147706.

Author information

Authors and Affiliations

Zhejiang Business Technology Institute, Ningbo, 315000, Zhejiang, China
Kun Zhou
Universiti Sains Malaysia, 11800, Gelugor, Penang, Malaysia
Kun Zhou, Fadratul Hafinaz Hassan & Keng Hoon Gan

Authors

Kun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Fadratul Hafinaz Hassan
View author publications
You can also search for this author in PubMed Google Scholar
Keng Hoon Gan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KUN ZHOU conceived the research idea, conducted the experiments, and analyzed the data. FADRATUL and GAN contributed to the experimental design, data analysis, and interpretation of the results. Both authors co-wrote the manuscript and approved the final version for submission.

Corresponding author

Correspondence to Fadratul Hafinaz Hassan.

Ethics declarations

Conflict of interest

Not applicable.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, K., Hassan, F.H. & Gan, K.H. Pretrained models for cross-modal retrieval: experiments and improvements. SIViP 18, 4915–4923 (2024). https://doi.org/10.1007/s11760-024-03126-z

Download citation

Received: 11 June 2023
Revised: 24 February 2024
Accepted: 01 March 2024
Published: 06 April 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11760-024-03126-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pretrained models for cross-modal retrieval: experiments and improvements

Abstract

Access this article

Similar content being viewed by others

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Semantics Consistent Adversarial Cross-Modal Retrieval

Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Pretrained models for cross-modal retrieval: experiments and improvements

Abstract

Access this article

Similar content being viewed by others

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Semantics Consistent Adversarial Cross-Modal Retrieval

Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation