CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Mingyong, Li; Yewen, Li; Mingyuan, Ge; Longfei, Ma

doi:10.1007/s13735-023-00268-7

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Regular Paper
Published: 22 February 2023

Volume 12, article number 2, (2023)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Li Mingyong¹,
Li Yewen¹,
Ge Mingyuan¹ &
…
Ma Longfei¹

1115 Accesses
1 Citation
Explore all metrics

Abstract

As multi-modal data proliferates, people are no longer content with a single mode of data retrieval for access to information. Deep hashing retrieval algorithms have attracted much attention for their advantages of efficient storage and fast query speed. Currently, the existing unsupervised hashing methods generally have two limitations: (1) Existing methods fail to adequately capture the latent semantic relevance and coexistent information from the different modality data, resulting in the lack of effective feature and hash encoding representation to bridge the heterogeneous and semantic gaps in multi-modal data. (2) Existing unsupervised methods typically construct a similarity matrix to guide the hash code learning, which suffers from inaccurate similarity problems, resulting in sub-optimal retrieval performance. To address these issues, we propose a novel CLIP-based fusion-modal reconstructing hashing for Large-scale Unsupervised Cross-modal Retrieval. First, we use CLIP to encode cross-modal features of visual modalities, and learn the common representation space of the hash code using modality-specific autoencoders. Second, we propose an efficient fusion approach to construct a semantically complementary affinity matrix that can maximize the potential semantic relevance of different modal instances. Furthermore, to retain the intrinsic semantic similarity of all similar pairs in the learned hash codes, an objective function for similarity reconstruction based on semantic complementation is designed to learn high-quality hash code representations. Sufficient experiments were carried out on four multi-modal benchmark datasets (WIKI, MIRFLICKR, NUS-WIDE, and MS COCO), and the proposed method achieves state-of-the-art image-text retrieval performance compared to several representative unsupervised cross-modal hashing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-order nonlocal Hashing for unsupervised cross-modal retrieval

Article 27 February 2021

Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

Article 02 November 2020

Deep Consistency Preserving Network for Unsupervised Cross-Modal Hashing

Data availability statement

The data that support the findings of this study are available in https://github.com/AwakerLee/CFRH.

References

Yan C, Gong B, Wei Y, Gao Y (2020) Deep multi-view enhancement hashing for image retrieval. IEEE Trans Pattern Anal Mach Intell 43(4):1445–1451
Article Google Scholar
Gong Q, Wang L, Lai H, Pan Y, Yin J (2022) Vit2hash: unsupervised information-preserving hashing. arXiv preprint arXiv:2201.05541
Dubey SR, Singh SK, Chu W-T (2021) Vision transformer hashing for image retrieval. arXiv preprint arXiv:2109.12564
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162
Gu W, Gu X, Gu J, Li B, Xiong Z, Wang W (2019) Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on international conference on multimedia retrieval, pp 159–167
Cheng Q, Zhou Y, Fu P, Xu Y, Zhang L (2021) A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:4284–4297
Article Google Scholar
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: Twenty-second international joint conference on artificial intelligence
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796
Bai C, Zeng C, Ma Q, Zhang J, Chen S (2020) Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 525–531
Lu X, Zhu L, Cheng Z, Li J, Nie X, Zhang H (2019) Flexible online multi-modal hashing for large-scale multimedia retrieval. In: Proceedings of the 27th ACM international conference on multimedia, pp 1129–1137
Fan L, Ng KW, Ju C, Zhang T, Chan CS (2020) Deep polarized network for supervised learning of accurate binary hashing codes. In: IJCAI, pp 825–831
Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035
Li C, Deng C, Wang L, Xie D, Liu X (2019) Coupled cyclegan: Unsupervised hashing network for cross-modal retrieval. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp 176–183
Zhang P-F, Li Y, Huang Z, Xu X-S (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimedia 24:466–479
Article Google Scholar
Shen X, Zhang H, Li L, Liu L (2021) Attention-guided semantic hashing for unsupervised cross-modal retrieval. In: 2021 IEEE international conference on multimedia and expo (ICME), pp 1–6. IEEE
Shen, F., Xu, Y., Liu, L., Yang, Y., Huang, Z., Shen, H.T.: Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE transactions on pattern analysis and machine intelligence 40(12), 3034–3044 (2018)
Article Google Scholar
Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: Proceedings of the European conference on computer vision (ECCV), pp 591–606
Zhang D, Wu X-J, Xu T, Yin H (2021) Dah: discrete asymmetric hashing for efficient cross-media retrieval. IEEE Trans Knowl Data Eng 35(2):1365–1378
Google Scholar
Zhang D, Wu X-J, Yu J (2021) Label consistent flexible matrix factorization hashing for efficient cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications TOMM 17(3):1–18
Article Google Scholar
Zhang D, Wu X-J, Yu J (2021) Discrete bidirectional matrix factorization hashing for zero-shot cross-media retrieval. In: Chinese conference on pattern recognition and computer vision (PRCV), pp 524–536. Springer
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10s):1–41
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, p 32
Yu J, Zhou H, Zhan Y, Tao D (2021) Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35. pp 4626–4634
Wu G, Lin Z, Han J, Liu L, Ding G, Zhang B, Shen J (2018) Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: IJCAI, pp 1:5
Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52
Liu S, Qian S, Guan Y, Zhan J, Ying L (2020) Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp 1379–1388
Zhang P-F, Li Y, Huang Z, Xu X-S (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimedia 24:466–479
Article Google Scholar
Jiang Q-Y, Li W-J (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251
Zhang D, Wu X-J (2022) Robust and discrete matrix factorization hashing for cross-modal retrieval. Pattern Recogn 122:108343
Zhang D, Wu X-J, Xu T, Kittler J (2022) Watch: two-stage discrete cross-media hashing. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2022.3159131
Article Google Scholar
Wang, X., Zou, X., Bakker, E.M., Wu, S.: Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing 400, 255–271 (2020)
Article Google Scholar
Zhang P-F, Luo Y, Huang Z, Xu X-S, Song J (2021) High-order nonlocal hashing for unsupervised cross-modal retrieval. World Wide Web 24(2):563–583
Article Google Scholar
Mikriukov G, Ravanbakhsh M, Demir B (2022) Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv preprint arXiv:2201.08125
Cao J, Gan Z, Cheng Y, Yu L, Chen Y-C, Liu J (2020) Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In: European conference on computer vision, pp 565–580. Springer
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530
Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: universal image-text representation learning. In: European conference on computer vision, pp 104–120. Springer
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32
Zeng Z, Mao W (2022) A comprehensive empirical study of vision-language pre-trained model for supervised cross-modal retrieval. arXiv preprint arXiv:2201.02772
Zhuo Y, Li Y, Hsiao J, Ho C, Li B (2022) Clip4hashing: unsupervised deep hashing for cross-modal video-text retrieval. In: Proceedings of the 2022 international conference on multimedia retrieval, pp 158–166
Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing 27(7), 3210–3221 (2018)
Article MathSciNet MATH Google Scholar
Zhang D, Wu X-J (2020) Scalable discrete matrix factorization and semantic autoencoder for cross-media retrieval. IEEE Transactions on Cybernetics 52(7):5947–60
Article Google Scholar
Liu H, Lin M, Zhang S, Wu Y, Huang F, Ji R (2018) Dense auto-encoder hashing for robust cross-modality retrieval. In: Proceedings of the 26th ACM international conference on multimedia, pp 1589–1597
Bai S, Bai X, Tian Q, Latecki LJ (2017) Regularized diffusion process for visual retrieval. In: Proceedings of the AAAI conference on artificial intelligence, p 31
Bai, S., Bai, X., Tian, Q., Latecki, L.J.: Regularized diffusion process on bidirectional context for object retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(5), 1213–1226 (2018)
Article Google Scholar
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval, pp 39–43
Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755. Springer
Shi Y, Zhao Y, Liu X, Zheng F, Ou W, You X, Peng Q (2022) Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Transactions on Circuits and Systems for Video Technology 32(10):7255–68

Download references

Acknowledgements

This work was partially supported by National Natural Science foundation of China (No: 62003065), Science and Technology Project of Chongqing Education Commission of China (KJQN201900520) and Chongqing Normal University Fund (NO. 22XLB003).

Author information

Authors and Affiliations

School of Computer and Information Science, Chongqing Normal University, Chongqing, 401331, China
Li Mingyong, Li Yewen, Ge Mingyuan & Ma Longfei

Authors

Li Mingyong
View author publications
You can also search for this author in PubMed Google Scholar
Li Yewen
View author publications
You can also search for this author in PubMed Google Scholar
Ge Mingyuan
View author publications
You can also search for this author in PubMed Google Scholar
Ma Longfei
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ML and YL conceived the idea of the study; MG and LM analyzed the data and interpreted the results; YL and ML wrote the paper; all authors discussed the results and revised the manuscript.

Corresponding author

Correspondence to Li Mingyong.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mingyong, L., Yewen, L., Mingyuan, G. et al. CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval. Int J Multimed Info Retr 12, 2 (2023). https://doi.org/10.1007/s13735-023-00268-7

Download citation

Received: 20 September 2022
Revised: 30 October 2022
Accepted: 21 November 2022
Published: 22 February 2023
DOI: https://doi.org/10.1007/s13735-023-00268-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

High-order nonlocal Hashing for unsupervised cross-modal retrieval

Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

Deep Consistency Preserving Network for Unsupervised Cross-Modal Hashing

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

High-order nonlocal Hashing for unsupervised cross-modal retrieval

Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

Deep Consistency Preserving Network for Unsupervised Cross-Modal Hashing

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation