Abstract
As multimedia technologies advance, untagged image-text data processing has become central in cross-modal retrieval. However, current methods often neglect three critical issues when learning hash codes: 1. Incomplete feature representation limits capturing diverse latent semantics. 2. Binary codes from quantisation loss lack overall constraints and global interaction. 3. Prioritizing retrieval performance overlooks modality robustness, leading to significant multi-modal retrieval disparities. To address these challenges, we introduce HMIB, an unsupervised cross-modal hashing algorithm. We leverage deep feature encoders with pre-trained models like CLIP and VGG, capturing latent semantic associations across natural language and image classification. A hierarchical interactive modal similarity generator introduces comprehensive process constraints and corrects ambiguous edge semantic data, enhancing robustness and generating high-quality hash codes. We conducted extensive experiments on three widely used datasets, maintaining high-level performance while minimizing cross-modal retrieval disparities.
Similar content being viewed by others
Data Availability Statement
Data from this study will be released at a later date.
References
Zhu L, Wu X, Li J, Zhang Z, Guan W, Shen HT (2022) Work together: correlation-identity reconstruction hashing for unsupervised cross-modal retrieval. IEEE Trans. Knowl, Data Eng
Dey RK, Das AK (2023) Modified term frequency-inverse document frequency based deep hybrid framework for sentiment analysis. Multimed Tools Appl 82(21):32967–32990
Dey RK, Das AK (2024) Neighbour adjusted dispersive flies optimization based deep hybrid sentiment analysis framework. Multimed Tools Appl 1–24
Tu RC, Jiang J, Lin Q, Cai C, Tian S, Wang H, Liu W (2023) Unsupervised cross-modal hashing with modality-interaction. IEEE Trans Circ Syst Video Tech
Yuan X, Zhang Z, Wang X, Wu L (2023) Semantic-aware adversarial training for reliable deep hashing retrieval. IEEE Trans Inf Forensics Secur
Hu Y, Liu M, Su X, Gao Z, Nie L (2021) Video moment localization via deep cross-modal hashing. IEEE Trans Image Process 30:4667–4677
Sun Y, Ren Z, Hu P, Peng D, Wang X (2023) Hierarchical consensus hashing for cross-modal retrieval. IEEE Trans Multimed
Luo K, Zhang C, Li H, Jia X, Chen C (2023) Adaptive marginalized semantic hashing for unpaired cross-modal retrieval. IEEE Trans Multimedia
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
Sengupta A, Ye Y, Wang R, Liu C, Roy K (2019) Going deeper in spiking neural networks: VGG and residual architectures. Front Neurosci 13:95
Zou Q, Zeng J, Cao L, Ji R (2016) A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 173:346–354
Tang J, Wang K, Shao L (2016) Supervised matrix factorization hashing for cross-modal retrieval. IEEE Trans Image Process 25(7):3157–3166
Huo Y, Qin Q, Dai J, Wang L, Zhang W, Huang L, Wang C (2024) Deep semantic-aware proxy hashing for multi-label cross-modal retrieval. IEEE Trans Circuits Syst Video Technol 34(1):576–589
Qin Q, Huo Y, Huang L, Dai J, Zhang H, Zhang W (2024) Deep Neighborhood-preserving Hashing with Quadratic Spherical Mutual Information for Cross-modal Retrieval. IEEE Trans Multimedia
Huo Y, Qin Q, Dai J, Zhang W, Huang L, Wang C (2024) Deep Neighborhood-aware Proxy Hashing with Uniform Distribution Constraint for Cross-modal Retrieval. ACM Trans Multimed Comput
Su M, Gu G, Ren X, Fu H, Zhao Y (2021) Semi-supervised knowledge distillation for cross-modal hashing. IEEE Trans Multimed
Zhang C, Li H, Gao Y, Chen C (2022) Weakly-supervised enhanced semantic-aware hashing for cross-modal retrieval. IEEE Trans Knowl Data Eng 35(6):6475–6488
Bai C, Zeng C, Ma Q, Zhang J, Chen S (2020) Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 525–531
Wang Y, Chen ZD, Luo X, Li R, Xu XS (2021) Fast cross-modal hashing with global and local similarity embedding. IEEE Trans Cybern 52(10):10064–10077
Nie X, Wang B, Li J, Hao F, Jian M, Yin Y (2020) Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans Circuits Syst 31(1):401–410
Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075–2082
Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3027–3035
Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52
Cheng M, Jing L, Ng MK (2020) Robust unsupervised cross-modal hashing for multimedia retrieval. ACM Trans Inf Syst 38(3):1–25
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv:1710.10903
Lu X, Zhu L, Liu L, Nie L, Zhang H (2021) Graph convolutional multi-modal hashing for flexible multimedia retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 1414–1422
Zhang PF, Li Y, Huang Z, Xu XS (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 24:466–479
Shi Y, Zhao Y, Liu X, Zheng F, Ou W, You X, Peng Q (2022) Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Trans Circuits Syst Video Technol 32(10):7255–7268
Mingyong L, Yewen L, Mingyuan G, Longfei M (2023) CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval. Int J Multimed Inf Retr 12(1):2
Zhong F, Chu C, Zhu Z, Chen Z (2023) Hypergraph-enhanced hashing for unsupervised cross-modal retrieval via robust similarity guidance. In: Proceedings of the 31st ACM international conference on multimedia, pp 3517–3527
Jiang QY, Li WJ (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240
Liu S, Qian S, Guan Y, Zhan J, Ying L (2020) Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp. 1379–1388
Li L, Zheng B, Sun W (2022) Adaptive structural similarity preserving for unsupervised cross modal hashing. In: Proceedings of the 30th ACM international conference on multimedia, pp 3712–3721
Zhao H, Liu M, Li M (2023) Feature fusion and metric learning network for zero-shot sketch-based image retrieval. Entropy 25(3):502
Wang D, Wang Q, Gao X (2017) Robust and flexible discrete hashing for cross-modal similarity search. IEEE Trans Circuits Syst Video Technol 28(10):2703–2715
Wu L, Sun P, Hong R, Fu Y, Wang X, Wang M (2018) Socialgcn: an efficient graph convolutional network based model for social recommendation. arXiv:1811.02815
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp 39–43
Zhang J, Peng Y (2019) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 22(1):174–187
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp 740–755. Springer
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp 415–424
Mikriukov G, Ravanbakhsh M, Demir B (2022) Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv:2201.08125
Yu J, Zhou H, Zhan Y, Tao D (2021) Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. Proceedings of the AAAI conference on artificial intelligence 35:4626–4634
Zhang PF, Luo Y, Huang Z, Xu XS, Song J (2021) High-order nonlocal Hashing for unsupervised cross-modal retrieval. World Wide Web 24:563–583
Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence
Zhang Y, Jin R, Zhou ZH (2010) Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1:43–52
Wang D, Wang Q, He L, Gao X, Tian Y (2020) Joint and individual matrix factorization hashing for large-scale cross-modal retrieval. Pattern Recognit 107:107479
Hu H, Xie L, Hong R, Tian Q (2020) Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3123–3132
Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Acknowledgements
This work was supported by the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant no. KJZD-K202200513), Chongqing Natural Science Foundation of China (Grant no. CSTB2022NSCQ-MSX1417) and the Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant No.2023SE204 and the Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant No.2023SE204.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Lin, Z., Jiang, X. et al. Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19371-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-19371-w