Skip to main content
Log in

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available in the MSCOCO repository, https://cocodataset.org/.

References

  1. Agnolucci L, Baldrati A, Todino F et al (2023) Eco: ensembling context optimization for vision-language models. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops, pp 2811–2815

  2. Alayrac JB, Donahue J, Luc P et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736

    Google Scholar 

  3. Bahng H, Jahanian A, Sankaranarayanan S et al (2022) Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274

  4. Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

    Google Scholar 

  5. Chen J, Guo H, Yi K et al (2022) Visualgpt: data-efficient adaptation of pretrained language models for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 18030–18040

  6. Cheng Y, Zhu X, Qian J et al (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23

    Article  Google Scholar 

  7. Chun S, Oh SJ, de Rezende RS et al (2021) Probabilistic embeddings for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 8415–8424

  8. Diao H, Zhang Y, Ma L et al (2021) Similarity reasoning and filtration for image-text matching. In: AAAI conference on artificial intelligence, pp 1218–1226

  9. Ding N, Qin Y, Yang G et al (2023) Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat Mach Intell pp 1–16

  10. Dong X, Zheng Y, Bao J et al (2022) Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262

  11. Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations

  12. Faghri F, Fleet DJ, Kiros JR et al (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612

  13. Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recogn 77:354–377

    Article  Google Scholar 

  14. Jia C, Yang Y, Xia Y et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR, pp 4904–4916

  15. Jia M, Tang L, Chen BC et al (2022) Visual prompt tuning. In: European conference on computer vision

  16. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE/CVF conference on computer vision and pattern recognition, pp 3128–3137

  17. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539

  18. Kottur S, Moon S, Markosyan AH et al (2022) Tell your story: task-oriented dialogs for interactive content creation. arXiv preprint arXiv:2211.03940

  19. Lee KH, Chen X, Hua G et al (2018) Stacked cross attention for image-text matching. In: European conference on computer vision, pp 201–216

  20. Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: 2021 Conference on empirical methods in natural language processing, pp 3045–3059

  21. Li A, Jabri A, Joulin A et al (2017) Learning visual n-grams from web data. In: IEEE International conference on computer vision

  22. Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. In: 59th Annual Meeting of the Association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 4582–4597

  23. Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755

  24. Liu L, Liu X, Gao J et al (2020) Understanding the difficulty of training transformers. In: 2020 conference on empirical methods in natural language processing, pp 5747–5763

  25. Liu P, Yuan W, Fu J et al (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586

  26. Logan R IV, Balažević I, Wallace E et al (2022) Cutting down on prompts and parameters: simple few-shot learning with language models. Find Assoc Comput Linguist ACL 2022:2824–2835

    Article  Google Scholar 

  27. Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations

  28. Lu D, Liu X, Qian X (2016) Tag-based image search by social re-ranking. IEEE Trans Multimed 18(8):1628–1639

    Article  Google Scholar 

  29. Lu Y, Liu J, Zhang Y et al (2022) Prompt distribution learning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5206–5215

  30. van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605

    Google Scholar 

  31. Mikolov T, Karafiát M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech, pp 1045–1048

  32. Mokady R, Hertz A, Bermano AH (2021) Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734

  33. Petroni F, Rocktäschel T, Riedel S et al (2019) Language models as knowledge bases?. In 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, pp 2463–2473

  34. Pham H, Dai Z, Ghiasi G et al (2021) Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050

  35. Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training

  36. Radford A, Kim JW, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763

  37. Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28

  38. Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: 16th Conference of the European chapter of the association for computational linguistics: main volume, pp 255–269

  39. Schuhmann C, Kaczmarczyk R, Komatsuzaki A, et al (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop Datacentric AI, Jülich Supercomputing Center, FZJ-2022-00923

  40. Song H, Kim M, Park D et al (2022) Learning from noisy labels with deep neural networks: a survey. IEEE Trans Neural Netw Learn Syst

  41. Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 1979–1978

  42. Strauss J, Paluska JM, Lesniewski-Laas C et al (2011) Eyo: device-transparent personal storage. In: USENIX Annual technical conference

  43. Su W, Zhu X, Cao Y et al (2020) Vl-bert: pre-training of generic visual-linguistic representations. In: International conference on learning representations

  44. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  45. Wang K, Yin Q, Wang W et al (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215

  46. Wang Y, Yao Q, Kwok JT et al (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv 53(3):1–34

    Article  Google Scholar 

  47. Wortsman M, Ilharco G, Kim JW et al (2022) Robust fine-tuning of zero-shot models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 7959–7971

  48. Xie Q, Luong MT, Hovy E et al (2020) Self-training with noisy student improves imagenet classification. In: IEEE/CVF conference on computer vision and pattern recognition

  49. Yanagi R, Togo R, Ogawa T et al (2021) Database-adaptive re-ranking for enhancing cross-modal image retrieval. In: 29th ACM international conference on multimedia, pp 3816–3825

  50. Yanagi R, Togo R, Ogawa T et al (2022) Interactive re-ranking via object entropy-guided question answering for cross-modal image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(3):1–17

    Article  Google Scholar 

  51. Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  52. Zhang H, Yanagi R, Togo R et al (2023) Cross-modal image retrieval considering semantic relationships with many-to-many correspondence loss. IEEE Access 11:10675–10686

    Article  Google Scholar 

  53. Zhang S, Tong H, Xu J et al (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw 6(1):1–23

    Article  Google Scholar 

  54. Zhou K, Yang J, Loy CC et al (2022a) Conditional prompt learning for vision-language models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825

  55. Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348

    Article  Google Scholar 

  56. Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: a literature survey. arXiv preprint arXiv:1706.06064

Download references

Acknowledgements

This work was supported by the JSPS KAKENHI Grant Numbers JP21H03456 and JP23K11141.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miki Haseyama.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

There were no human subjects or animal subjects in this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Yanagi, R., Togo, R. et al. Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts. Int J Multimed Info Retr 13, 14 (2024). https://doi.org/10.1007/s13735-024-00322-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-024-00322-y

Keywords

Navigation