Abstract
In this paper, a new fine-grained image classification (FGIC) network with feature relationship enhancement of multiple stages is established. After the engaging of scene text in FGIC and retrieval, basic architecture of local, global, text feature encoders and classifier have been approved. This method retains these portions and expands them into a five-module architecture. In specific, positional encoding is incorporated to both local and textual feature encoders such that complementary information carried could engage in feature representation. In local and textual feature encoders, intra-modal semantic relation reasoning is introduced for FGIC by a proposed General Feature Relation Enhancement (GFRE) module. GFRE is a feature reasoning module applicable to any two inputs of same modality or distinct modalities. GFRE adopts Graph Attention which represents and infers relationships among graph data. Moreover, latest multi-modal reasoning module is improved by a proposed Multi-Head Multi-Modal Joint Semantic Reasoning module consisted of cross-modal GFREs by multi-head fusion. Experimental results on multiple datasets verify the effectiveness of the proposed algorithm.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07617-3/MediaObjects/521_2022_7617_Fig9_HTML.png)
Similar content being viewed by others
References
Wenyong W, Yongcheng C, Guangshun L, Chuntao J, Song D (2020) A self-attention-based destruction and construction learning fine-grained image classification method for retail product recognition. Neural Comput Appl 32:14613–14622
Chang D, Ding Y, Xie J, Bhunia AK, Li X, Ma Z, Wu M, Guo J, Song YZ (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. IEEE Trans Image Process 29:4683–4695
Huang Z, Duan X, Zhao B, Lü J, Zhang B (2021) Interpretable attention guided network for fine-grained visual classification. arXiv preprint arXiv:2103.04701
Mafla A, Dey S, Biten AF, Gomez L, Karatzas D (2020) Fine-grained image classification and retrieval by combining visual and locally pooled textual features. In: Proceedings of the IEEE winter conference on applications of computer vision, pp 2950–2959
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Movshovitz-Attias Y, Yu Q, Stumpe MC, Shet V, Arnoud S, Yatziv L (2015) Ontological supervision for fine grained classification of street view storefronts. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1693–1702
Bai X, Yang M, Lyu P, Xu Y, Luo J (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6:66322–66335
Mafla A, Dey S, Biten AF, Gomez L, Karatzas D (2021) Multi-modal reasoning graph for scene-text based fine-grained image classification and retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4023–4033
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv preprint arXiv:1710.10903
Wei XS, Song YZ, Mac Aodha O, Wu J, Peng Y, Tang J, Yang J, Belongie S (2021) Fine-grained image analysis with deep learning: a survey. IEEE Trans Pattern Anal Mach Intell
Sun K, Zhu J (2022) Searching and learning discriminative regions for fine-grained image retrieval and classification. IEICE Trans Inf Syst 105(1):141–149
Zhang L, Huang S, Liu W (2021) Intra-class part swapping for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3209–3218
He X, Peng Y (2017) Fine-graind image classification via combining vision and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7332–7340
Karaoglu S, Tao R, Gevers T, Smeulders AW (2016) Words matter: scene text for image classification and retrieval. IEEE Trans Multimed 19(5):1063–1076
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497
Borisyuk F, Gordo A, Sivakumar V (2018) Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 71–79
He T, Tian Z, Huang W, Shen C, Qiao Y, Sun C (2018) An end-to-end textspotter with explicit alignment and attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5020–5029
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Zhang Y, Nie S, Liu W, Xu X, Zhang D, Shen HT (2019) Sequence-to-sequence domain adaptation network for robust text image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2740–2749
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit GRU neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems, pp 1597–1600
Almazán J, Gordo A, Fornés A, Valveny E (2014) Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell 36(12):2552–2566
Gómez L, Mafla A, Rusinol M, Karatzas D (2018) Single shot scene text retrieval. In: Proceedings of the European conference on computer vision, pp 700–715
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80
Gao D, Li K, Wang R, Shan S, Chen X (2020) Multi-modal graph neural network for joint reasoning on vision and scene text. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 12746–12756
Li K, Zhang Y, Li K, Li Y, Fu Y (2019a) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4654–4662
Li L, Gan Z, Cheng Y, Liu J (2019b) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10313–10322
Wen K, Gu X, Cheng Q (2020) Learning dual semantic relations with graph attention for image-text matching. IEEE Trans Circuits Syst Video Technol
Zeng J, Liu T, Jia W, Zhou J (2021) Fine-grained question-answer sentiment classification with hierarchical graph attention network. Neurocomputing 457:214–224
Chen S, Zhao Y, Jin Q, Wu Q (2020) Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 10638–10647
Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325
Ben-Younes H, Cadene R, Thome N, Cord M (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. Proc AAAI Conf Artif Intell 33:8102–8109
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv preprint arXiv:1704.03162
Zellers R, Bisk Y, Farhadi A, Choi Y (2019) From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6720–6731
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Wang Y, Yang H, Qian X, Ma L, Lu J, Li B, Fan X (2019) Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Karaoglu S, van Gemert JC, Gevers T (2013) Con-text: Text detection using background connectivity for fine-grained object classification. In: Proceedings of the ACM international conference on multimedia, pp 757–760
Chang D, Ding Y, Xie J, Bhunia AK, Li X, Ma Z, Wu M, Guo J, Song YZ (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. IEEE Trans Image Process 29:4683–4695
Luo W, Zhang H, Li J, Wei XS (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549
Teh EW, DeVries T, Taylor GW (2020) Proxynca++: revisiting and revitalizing proxy neighborhood component analysis. In: European conference on computer vision, pp 448–464
Zeng Z, Wang J, Chen B, Dai T, Xia ST (2021) Pyramid hybrid pooling quantization for efficient fine-grained image retrieval. arXiv preprint arXiv:2109.05206
Acknowledgements
This work was supported by the National Nature Science Foundation of China under Grant 61872143.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, W., Zhu, H., Yang, S. et al. GA-SRN: graph attention based text-image semantic reasoning network for fine-grained image classification and retrieval. Neural Comput & Applic 34, 21387–21401 (2022). https://doi.org/10.1007/s00521-022-07617-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07617-3