GA-SRN: graph attention based text-image semantic reasoning network for fine-grained image classification and retrieval

Li, Wenhao; Zhu, Hongqing; Yang, Suyi; Wang, Pengyu; Zhang, Han

doi:10.1007/s00521-022-07617-3

GA-SRN: graph attention based text-image semantic reasoning network for fine-grained image classification and retrieval

Original Article
Published: 27 July 2022

Volume 34, pages 21387–21401, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Wenhao Li¹,
Hongqing Zhu¹,
Suyi Yang²,
Pengyu Wang¹ &
…
Han Zhang¹

466 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, a new fine-grained image classification (FGIC) network with feature relationship enhancement of multiple stages is established. After the engaging of scene text in FGIC and retrieval, basic architecture of local, global, text feature encoders and classifier have been approved. This method retains these portions and expands them into a five-module architecture. In specific, positional encoding is incorporated to both local and textual feature encoders such that complementary information carried could engage in feature representation. In local and textual feature encoders, intra-modal semantic relation reasoning is introduced for FGIC by a proposed General Feature Relation Enhancement (GFRE) module. GFRE is a feature reasoning module applicable to any two inputs of same modality or distinct modalities. GFRE adopts Graph Attention which represents and infers relationships among graph data. Moreover, latest multi-modal reasoning module is improved by a proposed Multi-Head Multi-Modal Joint Semantic Reasoning module consisted of cross-modal GFREs by multi-head fusion. Experimental results on multiple datasets verify the effectiveness of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal alignment with graph reasoning for image-text retrieval

Article 18 March 2022

Cross-modal multi-relationship aware reasoning for image-text matching

Article 27 January 2021

Flexible graph-based attention and pooling network for image-text retrieval

Article 16 December 2023

Notes

https://cloud.google.com/vision/docs/ocr.

References

Wenyong W, Yongcheng C, Guangshun L, Chuntao J, Song D (2020) A self-attention-based destruction and construction learning fine-grained image classification method for retail product recognition. Neural Comput Appl 32:14613–14622
Article Google Scholar
Chang D, Ding Y, Xie J, Bhunia AK, Li X, Ma Z, Wu M, Guo J, Song YZ (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. IEEE Trans Image Process 29:4683–4695
Article MATH Google Scholar
Huang Z, Duan X, Zhao B, Lü J, Zhang B (2021) Interpretable attention guided network for fine-grained visual classification. arXiv preprint arXiv:2103.04701
Mafla A, Dey S, Biten AF, Gomez L, Karatzas D (2020) Fine-grained image classification and retrieval by combining visual and locally pooled textual features. In: Proceedings of the IEEE winter conference on applications of computer vision, pp 2950–2959
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Article MathSciNet Google Scholar
Movshovitz-Attias Y, Yu Q, Stumpe MC, Shet V, Arnoud S, Yatziv L (2015) Ontological supervision for fine grained classification of street view storefronts. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1693–1702
Bai X, Yang M, Lyu P, Xu Y, Luo J (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6:66322–66335
Article Google Scholar
Mafla A, Dey S, Biten AF, Gomez L, Karatzas D (2021) Multi-modal reasoning graph for scene-text based fine-grained image classification and retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4023–4033
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv preprint arXiv:1710.10903
Wei XS, Song YZ, Mac Aodha O, Wu J, Peng Y, Tang J, Yang J, Belongie S (2021) Fine-grained image analysis with deep learning: a survey. IEEE Trans Pattern Anal Mach Intell
Sun K, Zhu J (2022) Searching and learning discriminative regions for fine-grained image retrieval and classification. IEICE Trans Inf Syst 105(1):141–149
Article Google Scholar
Zhang L, Huang S, Liu W (2021) Intra-class part swapping for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3209–3218
He X, Peng Y (2017) Fine-graind image classification via combining vision and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7332–7340
Karaoglu S, Tao R, Gevers T, Smeulders AW (2016) Words matter: scene text for image classification and retrieval. IEEE Trans Multimed 19(5):1063–1076
Article Google Scholar
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497
Borisyuk F, Gordo A, Sivakumar V (2018) Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 71–79
He T, Tian Z, Huang W, Shen C, Qiao Y, Sun C (2018) An end-to-end textspotter with explicit alignment and attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5020–5029
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Zhang Y, Nie S, Liu W, Xu X, Zhang D, Shen HT (2019) Sequence-to-sequence domain adaptation network for robust text image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2740–2749
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit GRU neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems, pp 1597–1600
Almazán J, Gordo A, Fornés A, Valveny E (2014) Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell 36(12):2552–2566
Article Google Scholar
Gómez L, Mafla A, Rusinol M, Karatzas D (2018) Single shot scene text retrieval. In: Proceedings of the European conference on computer vision, pp 700–715
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80
Article Google Scholar
Gao D, Li K, Wang R, Shan S, Chen X (2020) Multi-modal graph neural network for joint reasoning on vision and scene text. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 12746–12756
Li K, Zhang Y, Li K, Li Y, Fu Y (2019a) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4654–4662
Li L, Gan Z, Cheng Y, Liu J (2019b) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10313–10322
Wen K, Gu X, Cheng Q (2020) Learning dual semantic relations with graph attention for image-text matching. IEEE Trans Circuits Syst Video Technol
Zeng J, Liu T, Jia W, Zhou J (2021) Fine-grained question-answer sentiment classification with hierarchical graph attention network. Neurocomputing 457:214–224
Article Google Scholar
Chen S, Zhao Y, Jin Q, Wu Q (2020) Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 10638–10647
Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325
Ben-Younes H, Cadene R, Thome N, Cord M (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. Proc AAAI Conf Artif Intell 33:8102–8109
Google Scholar
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv preprint arXiv:1704.03162
Zellers R, Bisk Y, Farhadi A, Choi Y (2019) From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6720–6731
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Wang Y, Yang H, Qian X, Ma L, Lu J, Li B, Fan X (2019) Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Article Google Scholar
Karaoglu S, van Gemert JC, Gevers T (2013) Con-text: Text detection using background connectivity for fine-grained object classification. In: Proceedings of the ACM international conference on multimedia, pp 757–760
Chang D, Ding Y, Xie J, Bhunia AK, Li X, Ma Z, Wu M, Guo J, Song YZ (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. IEEE Trans Image Process 29:4683–4695
Article MATH Google Scholar
Luo W, Zhang H, Li J, Wei XS (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549
Article Google Scholar
Teh EW, DeVries T, Taylor GW (2020) Proxynca++: revisiting and revitalizing proxy neighborhood component analysis. In: European conference on computer vision, pp 448–464
Zeng Z, Wang J, Chen B, Dai T, Xia ST (2021) Pyramid hybrid pooling quantization for efficient fine-grained image retrieval. arXiv preprint arXiv:2109.05206

Download references

Acknowledgements

This work was supported by the National Nature Science Foundation of China under Grant 61872143.

Author information

Authors and Affiliations

School of Information Science and Engineering, East China University of Science and Technology, MeiLong Road No.130, Shanghai, 200237, China
Wenhao Li, Hongqing Zhu, Pengyu Wang & Han Zhang
Department of Mathematics, Natural, Mathematical and Engineering Sciences, King’s College London, Strand, London, WC2R 2LS, England, UK
Suyi Yang

Authors

Wenhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongqing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Suyi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Pengyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Han Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongqing Zhu.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, W., Zhu, H., Yang, S. et al. GA-SRN: graph attention based text-image semantic reasoning network for fine-grained image classification and retrieval. Neural Comput & Applic 34, 21387–21401 (2022). https://doi.org/10.1007/s00521-022-07617-3

Download citation

Received: 22 December 2021
Accepted: 04 July 2022
Published: 27 July 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s00521-022-07617-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GA-SRN: graph attention based text-image semantic reasoning network for fine-grained image classification and retrieval

Abstract

Access this article

Similar content being viewed by others

Cross-modal alignment with graph reasoning for image-text retrieval

Cross-modal multi-relationship aware reasoning for image-text matching

Flexible graph-based attention and pooling network for image-text retrieval

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

GA-SRN: graph attention based text-image semantic reasoning network for fine-grained image classification and retrieval

Abstract

Access this article

Similar content being viewed by others

Cross-modal alignment with graph reasoning for image-text retrieval

Cross-modal multi-relationship aware reasoning for image-text matching

Flexible graph-based attention and pooling network for image-text retrieval

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation