Skip to main content

Multimodal Causal Relations Enhanced CLIP forĀ Image-to-Text Retrieval

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

  • 1233 Accesses

Abstract

Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.

D. Cao ā€” is the corresponding author.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213ā€“229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    ChapterĀ  Google ScholarĀ 

  2. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578ā€“10587 (2020)

    Google ScholarĀ 

  3. Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1218ā€“1226 (2021)

    Google ScholarĀ 

  4. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)

  5. Feng, F., Zhang, J., He, X., Zhang, H., Chua, T.S.: Empowering language understanding with counterfactual reasoning. arXiv preprint arXiv:2106.03046 (2021)

  6. Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, vol. 26 (2013)

    Google ScholarĀ 

  7. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440ā€“1448 (2015)

    Google ScholarĀ 

  8. Isinkaye, F.O., Folajimi, Y.O., Ojokoh, B.A.: Recommendation systems: principles, methods and evaluation. Egypt. Inf. J. 16(3), 261ā€“273 (2015)

    Google ScholarĀ 

  9. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904ā€“4916. PMLR (2021)

    Google ScholarĀ 

  10. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583ā€“5594. PMLR (2021)

    Google ScholarĀ 

  11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  12. Landeiro, V., Culotta, A.: Robust text classification in the presence of confounding bias. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

    Google ScholarĀ 

  13. Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212ā€“228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13

    ChapterĀ  Google ScholarĀ 

  14. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336ā€“11344 (2020)

    Google ScholarĀ 

  15. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  16. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888ā€“12900. PMLR (2022)

    Google ScholarĀ 

  17. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694ā€“9705 (2021)

    Google ScholarĀ 

  18. Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10313ā€“10322 (2019)

    Google ScholarĀ 

  19. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121ā€“137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    ChapterĀ  Google ScholarĀ 

  20. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740ā€“755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    ChapterĀ  Google ScholarĀ 

  21. Liu, X., Yin, D., Feng, Y., Wu, Y., Zhao, D.: Everything has a cause: leveraging causal inference in legal text analysis. arXiv preprint arXiv:2104.09420 (2021)

  22. Loper, E., Bird, S.: NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002)

    Google ScholarĀ 

  23. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google ScholarĀ 

  24. Messina, N., et al.: ALADIN: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In: Proceedings of the 19th International Conference on Content-Based Multimedia Indexing, pp. 64ā€“70 (2022)

    Google ScholarĀ 

  25. Nan, G., Zeng, J., Qiao, R., Guo, Z., Lu, W.: Uncovering main causalities for long-tailed information extraction. arXiv preprint arXiv:2109.05213 (2021)

  26. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532ā€“1543 (2014)

    Google ScholarĀ 

  27. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748ā€“8763. PMLR (2021)

    Google ScholarĀ 

  28. Ramsey, J., Glymour, M., Sanchez-Romero, R., Glymour, C.: A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. Int. J. Data Sci. Anal. 3, 121ā€“129 (2017)

    ArticleĀ  Google ScholarĀ 

  29. Sharma, D., Shukla, R., Giri, A.K., Kumar, S.: A brief review on search engine optimization. In: 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 687ā€“692. IEEE (2019)

    Google ScholarĀ 

  30. Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3716ā€“3725 (2020)

    Google ScholarĀ 

  31. Wang, Y., Liang, D., Charlin, L., Blei, D.M.: Causal inference for recommender systems. In: Proceedings of the 14th ACM Conference on Recommender Systems, pp. 426ā€“431 (2020)

    Google ScholarĀ 

  32. Yang, X., Zhang, H., Cai, J.: Deconfounded image captioning: a causal retrospect. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google ScholarĀ 

  33. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67ā€“78 (2014)

    ArticleĀ  Google ScholarĀ 

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. 62076210, No. 81973752), the Natural Science Foundation of Xiamen city (No. 3502Z20227188) and the Open Project Program of The Key Laboratory of Cognitive Computing and Intelligent Information Processing of Fujian Education Institutions, Wuyi University (No. KLCCIIP2020203).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donglin Cao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Feng, W., Lin, D., Cao, D. (2024). Multimodal Causal Relations Enhanced CLIP forĀ Image-to-Text Retrieval. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_17

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8429-9_17

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8428-2

  • Online ISBN: 978-981-99-8429-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics