Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

Zhang, Huaying; Yanagi, Rintaro; Togo, Ren; Ogawa, Takahiro; Haseyama, Miki

doi:10.1007/s13735-024-00322-y

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

Regular Paper
Published: 29 February 2024

Volume 13, article number 14, (2024)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Huaying Zhang¹,
Rintaro Yanagi¹,
Ren Togo²,
Takahiro Ogawa² &
…
Miki Haseyama²

234 Accesses
1 Altmetric
Explore all metrics

Abstract

A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Article 03 August 2023

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Data availability

The datasets generated during and/or analyzed during the current study are available in the MSCOCO repository, https://cocodataset.org/.

References

Agnolucci L, Baldrati A, Todino F et al (2023) Eco: ensembling context optimization for vision-language models. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops, pp 2811–2815
Alayrac JB, Donahue J, Luc P et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736
Google Scholar
Bahng H, Jahanian A, Sankaranarayanan S et al (2022) Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274
Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Chen J, Guo H, Yi K et al (2022) Visualgpt: data-efficient adaptation of pretrained language models for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 18030–18040
Cheng Y, Zhu X, Qian J et al (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23
Article Google Scholar
Chun S, Oh SJ, de Rezende RS et al (2021) Probabilistic embeddings for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 8415–8424
Diao H, Zhang Y, Ma L et al (2021) Similarity reasoning and filtration for image-text matching. In: AAAI conference on artificial intelligence, pp 1218–1226
Ding N, Qin Y, Yang G et al (2023) Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat Mach Intell pp 1–16
Dong X, Zheng Y, Bao J et al (2022) Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations
Faghri F, Fleet DJ, Kiros JR et al (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612
Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recogn 77:354–377
Article Google Scholar
Jia C, Yang Y, Xia Y et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR, pp 4904–4916
Jia M, Tang L, Chen BC et al (2022) Visual prompt tuning. In: European conference on computer vision
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE/CVF conference on computer vision and pattern recognition, pp 3128–3137
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Kottur S, Moon S, Markosyan AH et al (2022) Tell your story: task-oriented dialogs for interactive content creation. arXiv preprint arXiv:2211.03940
Lee KH, Chen X, Hua G et al (2018) Stacked cross attention for image-text matching. In: European conference on computer vision, pp 201–216
Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: 2021 Conference on empirical methods in natural language processing, pp 3045–3059
Li A, Jabri A, Joulin A et al (2017) Learning visual n-grams from web data. In: IEEE International conference on computer vision
Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. In: 59th Annual Meeting of the Association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 4582–4597
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
Liu L, Liu X, Gao J et al (2020) Understanding the difficulty of training transformers. In: 2020 conference on empirical methods in natural language processing, pp 5747–5763
Liu P, Yuan W, Fu J et al (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586
Logan R IV, Balažević I, Wallace E et al (2022) Cutting down on prompts and parameters: simple few-shot learning with language models. Find Assoc Comput Linguist ACL 2022:2824–2835
Article Google Scholar
Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations
Lu D, Liu X, Qian X (2016) Tag-based image search by social re-ranking. IEEE Trans Multimed 18(8):1628–1639
Article Google Scholar
Lu Y, Liu J, Zhang Y et al (2022) Prompt distribution learning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5206–5215
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605
Google Scholar
Mikolov T, Karafiát M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech, pp 1045–1048
Mokady R, Hertz A, Bermano AH (2021) Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734
Petroni F, Rocktäschel T, Riedel S et al (2019) Language models as knowledge bases?. In 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, pp 2463–2473
Pham H, Dai Z, Ghiasi G et al (2021) Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050
Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training
Radford A, Kim JW, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: 16th Conference of the European chapter of the association for computational linguistics: main volume, pp 255–269
Schuhmann C, Kaczmarczyk R, Komatsuzaki A, et al (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop Datacentric AI, Jülich Supercomputing Center, FZJ-2022-00923
Song H, Kim M, Park D et al (2022) Learning from noisy labels with deep neural networks: a survey. IEEE Trans Neural Netw Learn Syst
Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 1979–1978
Strauss J, Paluska JM, Lesniewski-Laas C et al (2011) Eyo: device-transparent personal storage. In: USENIX Annual technical conference
Su W, Zhu X, Cao Y et al (2020) Vl-bert: pre-training of generic visual-linguistic representations. In: International conference on learning representations
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Wang K, Yin Q, Wang W et al (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215
Wang Y, Yao Q, Kwok JT et al (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv 53(3):1–34
Article Google Scholar
Wortsman M, Ilharco G, Kim JW et al (2022) Robust fine-tuning of zero-shot models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 7959–7971
Xie Q, Luong MT, Hovy E et al (2020) Self-training with noisy student improves imagenet classification. In: IEEE/CVF conference on computer vision and pattern recognition
Yanagi R, Togo R, Ogawa T et al (2021) Database-adaptive re-ranking for enhancing cross-modal image retrieval. In: 29th ACM international conference on multimedia, pp 3816–3825
Yanagi R, Togo R, Ogawa T et al (2022) Interactive re-ranking via object entropy-guided question answering for cross-modal image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(3):1–17
Article Google Scholar
Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Zhang H, Yanagi R, Togo R et al (2023) Cross-modal image retrieval considering semantic relationships with many-to-many correspondence loss. IEEE Access 11:10675–10686
Article Google Scholar
Zhang S, Tong H, Xu J et al (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw 6(1):1–23
Article Google Scholar
Zhou K, Yang J, Loy CC et al (2022a) Conditional prompt learning for vision-language models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825
Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348
Article Google Scholar
Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: a literature survey. arXiv preprint arXiv:1706.06064

Download references

Acknowledgements

This work was supported by the JSPS KAKENHI Grant Numbers JP21H03456 and JP23K11141.

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, 0600814, Japan
Huaying Zhang & Rintaro Yanagi
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, 0600814, Japan
Ren Togo, Takahiro Ogawa & Miki Haseyama

Authors

Huaying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Rintaro Yanagi
View author publications
You can also search for this author in PubMed Google Scholar
Ren Togo
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Ogawa
View author publications
You can also search for this author in PubMed Google Scholar
Miki Haseyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miki Haseyama.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

There were no human subjects or animal subjects in this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, H., Yanagi, R., Togo, R. et al. Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts. Int J Multimed Info Retr 13, 14 (2024). https://doi.org/10.1007/s13735-024-00322-y

Download citation

Received: 18 April 2023
Revised: 07 January 2024
Accepted: 21 January 2024
Published: 29 February 2024
DOI: https://doi.org/10.1007/s13735-024-00322-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

Abstract

Access this article

Similar content being viewed by others

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

Abstract

Access this article

Similar content being viewed by others

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation