Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval

Zhang, Jie; Lin, Ziyong; Jiang, Xiaolong; Li, Mingyong; Wang, Chao

doi:10.1007/s11042-024-19371-w

Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval

1247: Recent Advances in AI-Powered Multimedia Visual Computing and Multimodal Signal Processing for Metaverse Era
Published: 18 May 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jie Zhang¹,
Ziyong Lin¹,
Xiaolong Jiang¹,
Mingyong Li ORCID: orcid.org/0000-0002-5517-3633¹ &
…
Chao Wang¹

81 Accesses
Explore all metrics

Abstract

As multimedia technologies advance, untagged image-text data processing has become central in cross-modal retrieval. However, current methods often neglect three critical issues when learning hash codes: 1. Incomplete feature representation limits capturing diverse latent semantics. 2. Binary codes from quantisation loss lack overall constraints and global interaction. 3. Prioritizing retrieval performance overlooks modality robustness, leading to significant multi-modal retrieval disparities. To address these challenges, we introduce HMIB, an unsupervised cross-modal hashing algorithm. We leverage deep feature encoders with pre-trained models like CLIP and VGG, capturing latent semantic associations across natural language and image classification. A hierarchical interactive modal similarity generator introduces comprehensive process constraints and corrects ambiguous edge semantic data, enhancing robustness and generating high-quality hash codes. We conducted extensive experiments on three widely used datasets, maintaining high-level performance while minimizing cross-modal retrieval disparities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

Article 11 April 2024

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Article 22 February 2023

Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Article 18 February 2024

Data Availability Statement

Data from this study will be released at a later date.

References

Zhu L, Wu X, Li J, Zhang Z, Guan W, Shen HT (2022) Work together: correlation-identity reconstruction hashing for unsupervised cross-modal retrieval. IEEE Trans. Knowl, Data Eng
Dey RK, Das AK (2023) Modified term frequency-inverse document frequency based deep hybrid framework for sentiment analysis. Multimed Tools Appl 82(21):32967–32990
Article Google Scholar
Dey RK, Das AK (2024) Neighbour adjusted dispersive flies optimization based deep hybrid sentiment analysis framework. Multimed Tools Appl 1–24
Tu RC, Jiang J, Lin Q, Cai C, Tian S, Wang H, Liu W (2023) Unsupervised cross-modal hashing with modality-interaction. IEEE Trans Circ Syst Video Tech
Yuan X, Zhang Z, Wang X, Wu L (2023) Semantic-aware adversarial training for reliable deep hashing retrieval. IEEE Trans Inf Forensics Secur
Hu Y, Liu M, Su X, Gao Z, Nie L (2021) Video moment localization via deep cross-modal hashing. IEEE Trans Image Process 30:4667–4677
Article MathSciNet Google Scholar
Sun Y, Ren Z, Hu P, Peng D, Wang X (2023) Hierarchical consensus hashing for cross-modal retrieval. IEEE Trans Multimed
Luo K, Zhang C, Li H, Jia X, Chen C (2023) Adaptive marginalized semantic hashing for unpaired cross-modal retrieval. IEEE Trans Multimedia
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
Sengupta A, Ye Y, Wang R, Liu C, Roy K (2019) Going deeper in spiking neural networks: VGG and residual architectures. Front Neurosci 13:95
Article Google Scholar
Zou Q, Zeng J, Cao L, Ji R (2016) A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 173:346–354
Article Google Scholar
Tang J, Wang K, Shao L (2016) Supervised matrix factorization hashing for cross-modal retrieval. IEEE Trans Image Process 25(7):3157–3166
Article MathSciNet Google Scholar
Huo Y, Qin Q, Dai J, Wang L, Zhang W, Huang L, Wang C (2024) Deep semantic-aware proxy hashing for multi-label cross-modal retrieval. IEEE Trans Circuits Syst Video Technol 34(1):576–589
Qin Q, Huo Y, Huang L, Dai J, Zhang H, Zhang W (2024) Deep Neighborhood-preserving Hashing with Quadratic Spherical Mutual Information for Cross-modal Retrieval. IEEE Trans Multimedia
Huo Y, Qin Q, Dai J, Zhang W, Huang L, Wang C (2024) Deep Neighborhood-aware Proxy Hashing with Uniform Distribution Constraint for Cross-modal Retrieval. ACM Trans Multimed Comput
Su M, Gu G, Ren X, Fu H, Zhao Y (2021) Semi-supervised knowledge distillation for cross-modal hashing. IEEE Trans Multimed
Zhang C, Li H, Gao Y, Chen C (2022) Weakly-supervised enhanced semantic-aware hashing for cross-modal retrieval. IEEE Trans Knowl Data Eng 35(6):6475–6488
Google Scholar
Bai C, Zeng C, Ma Q, Zhang J, Chen S (2020) Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 525–531
Wang Y, Chen ZD, Luo X, Li R, Xu XS (2021) Fast cross-modal hashing with global and local similarity embedding. IEEE Trans Cybern 52(10):10064–10077
Article Google Scholar
Nie X, Wang B, Li J, Hao F, Jian M, Yin Y (2020) Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans Circuits Syst 31(1):401–410
Google Scholar
Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075–2082
Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3027–3035
Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52
Cheng M, Jing L, Ng MK (2020) Robust unsupervised cross-modal hashing for multimedia retrieval. ACM Trans Inf Syst 38(3):1–25
Article Google Scholar
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv:1710.10903
Lu X, Zhu L, Liu L, Nie L, Zhang H (2021) Graph convolutional multi-modal hashing for flexible multimedia retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 1414–1422
Zhang PF, Li Y, Huang Z, Xu XS (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 24:466–479
Article Google Scholar
Shi Y, Zhao Y, Liu X, Zheng F, Ou W, You X, Peng Q (2022) Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Trans Circuits Syst Video Technol 32(10):7255–7268
Article Google Scholar
Mingyong L, Yewen L, Mingyuan G, Longfei M (2023) CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval. Int J Multimed Inf Retr 12(1):2
Article Google Scholar
Zhong F, Chu C, Zhu Z, Chen Z (2023) Hypergraph-enhanced hashing for unsupervised cross-modal retrieval via robust similarity guidance. In: Proceedings of the 31st ACM international conference on multimedia, pp 3517–3527
Jiang QY, Li WJ (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240
Liu S, Qian S, Guan Y, Zhan J, Ying L (2020) Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp. 1379–1388
Li L, Zheng B, Sun W (2022) Adaptive structural similarity preserving for unsupervised cross modal hashing. In: Proceedings of the 30th ACM international conference on multimedia, pp 3712–3721
Zhao H, Liu M, Li M (2023) Feature fusion and metric learning network for zero-shot sketch-based image retrieval. Entropy 25(3):502
Wang D, Wang Q, Gao X (2017) Robust and flexible discrete hashing for cross-modal similarity search. IEEE Trans Circuits Syst Video Technol 28(10):2703–2715
Article Google Scholar
Wu L, Sun P, Hong R, Fu Y, Wang X, Wang M (2018) Socialgcn: an efficient graph convolutional network based model for social recommendation. arXiv:1811.02815
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp 39–43
Zhang J, Peng Y (2019) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 22(1):174–187
Article Google Scholar
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp 740–755. Springer
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp 415–424
Mikriukov G, Ravanbakhsh M, Demir B (2022) Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv:2201.08125
Yu J, Zhou H, Zhan Y, Tao D (2021) Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. Proceedings of the AAAI conference on artificial intelligence 35:4626–4634
Article Google Scholar
Zhang PF, Luo Y, Huang Z, Xu XS, Song J (2021) High-order nonlocal Hashing for unsupervised cross-modal retrieval. World Wide Web 24:563–583
Article Google Scholar
Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence
Zhang Y, Jin R, Zhou ZH (2010) Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1:43–52
Article Google Scholar
Wang D, Wang Q, He L, Gao X, Tian Y (2020) Joint and individual matrix factorization hashing for large-scale cross-modal retrieval. Pattern Recognit 107:107479
Article Google Scholar
Hu H, Xie L, Hong R, Tian Q (2020) Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3123–3132
Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

Download references

Acknowledgements

This work was supported by the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant no. KJZD-K202200513), Chongqing Natural Science Foundation of China (Grant no. CSTB2022NSCQ-MSX1417) and the Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant No.2023SE204 and the Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant No.2023SE204.

Author information

Authors and Affiliations

School of Computer and Information Science, Chongqing Normal University, Chongqing, 401331, China
Jie Zhang, Ziyong Lin, Xiaolong Jiang, Mingyong Li & Chao Wang

Authors

Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ziyong Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Mingyong Li
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mingyong Li.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, J., Lin, Z., Jiang, X. et al. Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19371-w

Download citation

Received: 09 November 2023
Revised: 22 April 2024
Accepted: 06 May 2024
Published: 18 May 2024
DOI: https://doi.org/10.1007/s11042-024-19371-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval

Abstract

Access this article

Similar content being viewed by others

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval

Abstract

Access this article

Similar content being viewed by others

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation