Abstract
A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.
Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available in the MSCOCO repository, https://cocodataset.org/.
References
Agnolucci L, Baldrati A, Todino F et al (2023) Eco: ensembling context optimization for vision-language models. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops, pp 2811–2815
Alayrac JB, Donahue J, Luc P et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736
Bahng H, Jahanian A, Sankaranarayanan S et al (2022) Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274
Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Chen J, Guo H, Yi K et al (2022) Visualgpt: data-efficient adaptation of pretrained language models for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 18030–18040
Cheng Y, Zhu X, Qian J et al (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23
Chun S, Oh SJ, de Rezende RS et al (2021) Probabilistic embeddings for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 8415–8424
Diao H, Zhang Y, Ma L et al (2021) Similarity reasoning and filtration for image-text matching. In: AAAI conference on artificial intelligence, pp 1218–1226
Ding N, Qin Y, Yang G et al (2023) Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat Mach Intell pp 1–16
Dong X, Zheng Y, Bao J et al (2022) Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations
Faghri F, Fleet DJ, Kiros JR et al (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612
Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recogn 77:354–377
Jia C, Yang Y, Xia Y et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR, pp 4904–4916
Jia M, Tang L, Chen BC et al (2022) Visual prompt tuning. In: European conference on computer vision
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE/CVF conference on computer vision and pattern recognition, pp 3128–3137
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Kottur S, Moon S, Markosyan AH et al (2022) Tell your story: task-oriented dialogs for interactive content creation. arXiv preprint arXiv:2211.03940
Lee KH, Chen X, Hua G et al (2018) Stacked cross attention for image-text matching. In: European conference on computer vision, pp 201–216
Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: 2021 Conference on empirical methods in natural language processing, pp 3045–3059
Li A, Jabri A, Joulin A et al (2017) Learning visual n-grams from web data. In: IEEE International conference on computer vision
Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. In: 59th Annual Meeting of the Association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 4582–4597
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
Liu L, Liu X, Gao J et al (2020) Understanding the difficulty of training transformers. In: 2020 conference on empirical methods in natural language processing, pp 5747–5763
Liu P, Yuan W, Fu J et al (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586
Logan R IV, Balažević I, Wallace E et al (2022) Cutting down on prompts and parameters: simple few-shot learning with language models. Find Assoc Comput Linguist ACL 2022:2824–2835
Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations
Lu D, Liu X, Qian X (2016) Tag-based image search by social re-ranking. IEEE Trans Multimed 18(8):1628–1639
Lu Y, Liu J, Zhang Y et al (2022) Prompt distribution learning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5206–5215
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605
Mikolov T, Karafiát M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech, pp 1045–1048
Mokady R, Hertz A, Bermano AH (2021) Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734
Petroni F, Rocktäschel T, Riedel S et al (2019) Language models as knowledge bases?. In 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, pp 2463–2473
Pham H, Dai Z, Ghiasi G et al (2021) Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050
Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training
Radford A, Kim JW, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: 16th Conference of the European chapter of the association for computational linguistics: main volume, pp 255–269
Schuhmann C, Kaczmarczyk R, Komatsuzaki A, et al (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop Datacentric AI, Jülich Supercomputing Center, FZJ-2022-00923
Song H, Kim M, Park D et al (2022) Learning from noisy labels with deep neural networks: a survey. IEEE Trans Neural Netw Learn Syst
Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 1979–1978
Strauss J, Paluska JM, Lesniewski-Laas C et al (2011) Eyo: device-transparent personal storage. In: USENIX Annual technical conference
Su W, Zhu X, Cao Y et al (2020) Vl-bert: pre-training of generic visual-linguistic representations. In: International conference on learning representations
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Wang K, Yin Q, Wang W et al (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215
Wang Y, Yao Q, Kwok JT et al (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv 53(3):1–34
Wortsman M, Ilharco G, Kim JW et al (2022) Robust fine-tuning of zero-shot models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 7959–7971
Xie Q, Luong MT, Hovy E et al (2020) Self-training with noisy student improves imagenet classification. In: IEEE/CVF conference on computer vision and pattern recognition
Yanagi R, Togo R, Ogawa T et al (2021) Database-adaptive re-ranking for enhancing cross-modal image retrieval. In: 29th ACM international conference on multimedia, pp 3816–3825
Yanagi R, Togo R, Ogawa T et al (2022) Interactive re-ranking via object entropy-guided question answering for cross-modal image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(3):1–17
Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Zhang H, Yanagi R, Togo R et al (2023) Cross-modal image retrieval considering semantic relationships with many-to-many correspondence loss. IEEE Access 11:10675–10686
Zhang S, Tong H, Xu J et al (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw 6(1):1–23
Zhou K, Yang J, Loy CC et al (2022a) Conditional prompt learning for vision-language models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825
Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348
Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: a literature survey. arXiv preprint arXiv:1706.06064
Acknowledgements
This work was supported by the JSPS KAKENHI Grant Numbers JP21H03456 and JP23K11141.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Ethical approval
There were no human subjects or animal subjects in this research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, H., Yanagi, R., Togo, R. et al. Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts. Int J Multimed Info Retr 13, 14 (2024). https://doi.org/10.1007/s13735-024-00322-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-024-00322-y