Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Wen, Keyu; Li, Linyang; Gu, Xiaodong

doi:10.1007/978-3-030-86362-3_33

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12891))

Included in the following conference series:

International Conference on Artificial Neural Networks

3062 Accesses

Abstract

There is a surge of interest in cross-modal representation learning, concerning mainly images and texts. Image-Text Matching task is one major challenge in cross-modal tasks. Traditional methods use multi-paths to encode features across modalities separately and project them into a shared latent space. Recently, the development of pre-trained models inspires people to learn cross-modal features jointly and boost performances through large-scale data. However, traditional methods are less effective when both modalities use pre-trained uni-modal encoders. Methods that encode features jointly would face an unacceptable calculation cost during inference, thus less valuable for real-time applications. In this paper, we first explore the pros and cons of these methods, then we propose an enhanced separate encoding framework, using an extra encoding process to project multi-layer features of pre-trained encoders into a similar latent space. Experiments show that our framework outperforms current methods that do not use large-scale image-text pairs in both Flickr30K and MS-COCO datasets while maintaining minimal cost during inference.

K. Wen and L. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Cross-modal alignment with graph reasoning for image-text retrieval

Article 18 March 2022

SAM: cross-modal semantic alignments module for image-text retrieval

Article 26 June 2023

References

Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (2015)
Google Scholar
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y.C., et al.: UNITER: learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019)
Cheng, Q., Gu, X.: Bridging multimedia heterogeneity gap via graph representation learning for cross-modal retrieval. Neural Netw. 134, 143–162 (2021)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. arXiv preprint arXiv:2101.01368 (2021)
Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improved visual-semantic embeddings. CoRR abs/1707.05612 (2017). http://arxiv.org/abs/1707.05612
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. CoRR abs/1602.07332 (2016). http://arxiv.org/abs/1602.07332
Lee, K., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. CoRR abs/1803.08024 (2018). http://arxiv.org/abs/1803.08024
Li, G., Duan, N., Fang, Y., Jiang, D., Zhou, M.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: ICCV (2019)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014). http://arxiv.org/abs/1405.0312
Peters, M.E., et al.: Deep contextualized word representations. CoRR abs/1802.05365 (2018). http://arxiv.org/abs/1802.05365
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123(1), 74–93 (2017)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018). https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/language understanding paper.pdf
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). http://arxiv.org/abs/1506.01497
Shi, B., Ji, L., Lu, P., Niu, Z., Duan, N.: Knowledge aware semantic concept expansion for image-text matching. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 5182–5189. International Joint Conferences on Artificial Intelligence Organization, July 2019. https://doi.org/10.24963/ijcai.2019/720
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556 (09 2014)
Google Scholar
Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762
Wang, Y., et al.: Position focused attention network for image-text matching. CoRR abs/1907.09748 (2019). http://arxiv.org/abs/1907.09748
Wen, K., Gu, X., Cheng, Q.: Learning dual semantic relations with graph attention for image-text matching. IEEE Trans. Circuits Syst. Video Technol. (2020)
Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. arXiv preprint arXiv:1909.11059 (2019)

Download references

Acknowledgement

This work was supported in part by National Natural Science Foundation of China under grants 61771145.

Author information

Authors and Affiliations

Department of Electronic Engineering, School of Information Science and Technology, Fudan University, Shanghai, 200438, China
Keyu Wen & Xiaodong Gu
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, 200438, China
Linyang Li

Authors

Keyu Wen
View author publications
You can also search for this author in PubMed Google Scholar
Linyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Gu .

Editor information

Editors and Affiliations

Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
iMotions A/S, Copenhagen, Denmark
Paolo Masulli
University of Tübingen, Tübingen, Baden-Württemberg, Germany
Sebastian Otte
Universität Hamburg, Hamburg, Germany
Stefan Wermter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, K., Li, L., Gu, X. (2021). Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12891. Springer, Cham. https://doi.org/10.1007/978-3-030-86362-3_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-86362-3_33
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86361-6
Online ISBN: 978-3-030-86362-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Abstract

Access this chapter

Similar content being viewed by others

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Cross-modal alignment with graph reasoning for image-text retrieval

SAM: cross-modal semantic alignments module for image-text retrieval

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Abstract

Access this chapter

Similar content being viewed by others

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Cross-modal alignment with graph reasoning for image-text retrieval

SAM: cross-modal semantic alignments module for image-text retrieval

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation