Skip to main content
Log in

Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval

  • 1247: Recent Advances in AI-Powered Multimedia Visual Computing and Multimodal Signal Processing for Metaverse Era
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As multimedia technologies advance, untagged image-text data processing has become central in cross-modal retrieval. However, current methods often neglect three critical issues when learning hash codes: 1. Incomplete feature representation limits capturing diverse latent semantics. 2. Binary codes from quantisation loss lack overall constraints and global interaction. 3. Prioritizing retrieval performance overlooks modality robustness, leading to significant multi-modal retrieval disparities. To address these challenges, we introduce HMIB, an unsupervised cross-modal hashing algorithm. We leverage deep feature encoders with pre-trained models like CLIP and VGG, capturing latent semantic associations across natural language and image classification. A hierarchical interactive modal similarity generator introduces comprehensive process constraints and corrects ambiguous edge semantic data, enhancing robustness and generating high-quality hash codes. We conducted extensive experiments on three widely used datasets, maintaining high-level performance while minimizing cross-modal retrieval disparities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability Statement

Data from this study will be released at a later date.

References

  1. Zhu L, Wu X, Li J, Zhang Z, Guan W, Shen HT (2022) Work together: correlation-identity reconstruction hashing for unsupervised cross-modal retrieval. IEEE Trans. Knowl, Data Eng

  2. Dey RK, Das AK (2023) Modified term frequency-inverse document frequency based deep hybrid framework for sentiment analysis. Multimed Tools Appl 82(21):32967–32990

    Article  Google Scholar 

  3. Dey RK, Das AK (2024) Neighbour adjusted dispersive flies optimization based deep hybrid sentiment analysis framework. Multimed Tools Appl 1–24

  4. Tu RC, Jiang J, Lin Q, Cai C, Tian S, Wang H, Liu W (2023) Unsupervised cross-modal hashing with modality-interaction. IEEE Trans Circ Syst Video Tech

  5. Yuan X, Zhang Z, Wang X, Wu L (2023) Semantic-aware adversarial training for reliable deep hashing retrieval. IEEE Trans Inf Forensics Secur

  6. Hu Y, Liu M, Su X, Gao Z, Nie L (2021) Video moment localization via deep cross-modal hashing. IEEE Trans Image Process 30:4667–4677

    Article  MathSciNet  Google Scholar 

  7. Sun Y, Ren Z, Hu P, Peng D, Wang X (2023) Hierarchical consensus hashing for cross-modal retrieval. IEEE Trans Multimed

  8. Luo K, Zhang C, Li H, Jia X, Chen C (2023) Adaptive marginalized semantic hashing for unpaired cross-modal retrieval. IEEE Trans Multimedia

  9. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR

  10. Sengupta A, Ye Y, Wang R, Liu C, Roy K (2019) Going deeper in spiking neural networks: VGG and residual architectures. Front Neurosci 13:95

    Article  Google Scholar 

  11. Zou Q, Zeng J, Cao L, Ji R (2016) A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 173:346–354

    Article  Google Scholar 

  12. Tang J, Wang K, Shao L (2016) Supervised matrix factorization hashing for cross-modal retrieval. IEEE Trans Image Process 25(7):3157–3166

    Article  MathSciNet  Google Scholar 

  13. Huo Y, Qin Q, Dai J, Wang L, Zhang W, Huang L, Wang C (2024) Deep semantic-aware proxy hashing for multi-label cross-modal retrieval. IEEE Trans Circuits Syst Video Technol 34(1):576–589

  14. Qin Q, Huo Y, Huang L, Dai J, Zhang H, Zhang W (2024) Deep Neighborhood-preserving Hashing with Quadratic Spherical Mutual Information for Cross-modal Retrieval. IEEE Trans Multimedia

  15. Huo Y, Qin Q, Dai J, Zhang W, Huang L, Wang C (2024) Deep Neighborhood-aware Proxy Hashing with Uniform Distribution Constraint for Cross-modal Retrieval. ACM Trans Multimed Comput

  16. Su M, Gu G, Ren X, Fu H, Zhao Y (2021) Semi-supervised knowledge distillation for cross-modal hashing. IEEE Trans Multimed

  17. Zhang C, Li H, Gao Y, Chen C (2022) Weakly-supervised enhanced semantic-aware hashing for cross-modal retrieval. IEEE Trans Knowl Data Eng 35(6):6475–6488

    Google Scholar 

  18. Bai C, Zeng C, Ma Q, Zhang J, Chen S (2020) Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 525–531

  19. Wang Y, Chen ZD, Luo X, Li R, Xu XS (2021) Fast cross-modal hashing with global and local similarity embedding. IEEE Trans Cybern 52(10):10064–10077

    Article  Google Scholar 

  20. Nie X, Wang B, Li J, Hao F, Jian M, Yin Y (2020) Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans Circuits Syst 31(1):401–410

    Google Scholar 

  21. Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075–2082

  22. Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3027–3035

  23. Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52

  24. Cheng M, Jing L, Ng MK (2020) Robust unsupervised cross-modal hashing for multimedia retrieval. ACM Trans Inf Syst 38(3):1–25

    Article  Google Scholar 

  25. Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv:1710.10903

  26. Lu X, Zhu L, Liu L, Nie L, Zhang H (2021) Graph convolutional multi-modal hashing for flexible multimedia retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 1414–1422

  27. Zhang PF, Li Y, Huang Z, Xu XS (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 24:466–479

    Article  Google Scholar 

  28. Shi Y, Zhao Y, Liu X, Zheng F, Ou W, You X, Peng Q (2022) Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Trans Circuits Syst Video Technol 32(10):7255–7268

    Article  Google Scholar 

  29. Mingyong L, Yewen L, Mingyuan G, Longfei M (2023) CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval. Int J Multimed Inf Retr 12(1):2

    Article  Google Scholar 

  30. Zhong F, Chu C, Zhu Z, Chen Z (2023) Hypergraph-enhanced hashing for unsupervised cross-modal retrieval via robust similarity guidance. In: Proceedings of the 31st ACM international conference on multimedia, pp 3517–3527

  31. Jiang QY, Li WJ (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240

  32. Liu S, Qian S, Guan Y, Zhan J, Ying L (2020) Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp. 1379–1388

  33. Li L, Zheng B, Sun W (2022) Adaptive structural similarity preserving for unsupervised cross modal hashing. In: Proceedings of the 30th ACM international conference on multimedia, pp 3712–3721

  34. Zhao H, Liu M, Li M (2023) Feature fusion and metric learning network for zero-shot sketch-based image retrieval. Entropy 25(3):502

  35. Wang D, Wang Q, Gao X (2017) Robust and flexible discrete hashing for cross-modal similarity search. IEEE Trans Circuits Syst Video Technol 28(10):2703–2715

    Article  Google Scholar 

  36. Wu L, Sun P, Hong R, Fu Y, Wang X, Wang M (2018) Socialgcn: an efficient graph convolutional network based model for social recommendation. arXiv:1811.02815

  37. Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp 39–43

  38. Zhang J, Peng Y (2019) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 22(1):174–187

    Article  Google Scholar 

  39. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9

  40. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp 740–755. Springer

  41. Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796

  42. Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp 415–424

  43. Mikriukov G, Ravanbakhsh M, Demir B (2022) Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv:2201.08125

  44. Yu J, Zhou H, Zhan Y, Tao D (2021) Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. Proceedings of the AAAI conference on artificial intelligence 35:4626–4634

    Article  Google Scholar 

  45. Zhang PF, Luo Y, Huang Z, Xu XS, Song J (2021) High-order nonlocal Hashing for unsupervised cross-modal retrieval. World Wide Web 24:563–583

    Article  Google Scholar 

  46. Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence

  47. Zhang Y, Jin R, Zhou ZH (2010) Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1:43–52

    Article  Google Scholar 

  48. Wang D, Wang Q, He L, Gao X, Tian Y (2020) Joint and individual matrix factorization hashing for large-scale cross-modal retrieval. Pattern Recognit 107:107479

    Article  Google Scholar 

  49. Hu H, Xie L, Hong R, Tian Q (2020) Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3123–3132

  50. Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

Download references

Acknowledgements

This work was supported by the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant no. KJZD-K202200513), Chongqing Natural Science Foundation of China (Grant no. CSTB2022NSCQ-MSX1417) and the Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant No.2023SE204 and the Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant No.2023SE204.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mingyong Li.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Lin, Z., Jiang, X. et al. Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19371-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19371-w

Keywords

Navigation