Siamese transformer with hierarchical concept embedding for fine-grained image recognition

Lyu, Yilin; Jing, Liping; Wang, Jiaqi; Guo, Mingzhe; Wang, Xinyue; Yu, Jian

doi:10.1007/s11432-022-3586-y

Siamese transformer with hierarchical concept embedding for fine-grained image recognition

Research Paper
Published: 31 January 2023

Volume 66, article number 132107, (2023)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Yilin Lyu^1,2,
Liping Jing^1,2,
Jiaqi Wang^1,2,
Mingzhe Guo^1,2,
Xinyue Wang³ &
…
Jian Yu^1,2

219 Accesses
3 Citations
Explore all metrics

Abstract

Distinguishing the subtle differences among fine-grained images from subordinate concepts of a concept hierarchy is a challenging task. In this paper, we propose a Siamese transformer with hierarchical concept embedding (STrHCE), which contains two transformer subnetworks sharing all configurations, and each subnetwork is equipped with the hierarchical semantic information at different concept levels for fine-grained image embeddings. In particular, one subnetwork is for coarse-scale patches to learn the discriminative regions with the aid of the innate multi-head self-attention mechanism of the transformer. The other subnetwork is for finer-scale patches, which are adaptively sampled from the discriminative regions, to capture subtle yet discriminative visual cues and eliminate redundant information. STrHCE connects the two subnetworks through a score margin adjustor to enforce the most discriminative regions generating more confident predictions. Extensive experiments conducted on four commonly-used benchmark datasets, including CUB-200-2011, FGVC-Aircraft, Stanford Dogs, and NABirds, empirically demonstrate the superiority of the proposed STrHCE over state-of-the-art baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Recombining Vision Transformer Architecture for Fine-Grained Visual Categorization

Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

Subtler mixed attention network on fine-grained image classification

Article 19 March 2021

Chao Liu, Lei Huang, … Wenfeng Zhang

References

Welinder P, Branson S, Mita T, et al. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010
Horn G V, Branson S, Farrell R, et al. Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 595–604
Maji S, Rahtu E, Kannala J, et al. Fine-grained visual classification of aircraft. 2013. ArXiv:1306.5151
Khosla A, Jayadevaprakash N, Yao B, et al. Novel dataset for fine-grained image categorization. In: Proceedings of the 1st Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, 2011
Chen T, Wu W, Gao Y, et al. Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. In: Proceedings of the 26th ACM International Conference on Multimedia, Seoul, 2018. 2023–2031
Zhang N, Donahue J, Girshick R B, et al. Part-based R-CNNs for fine-grained category detection. In: Proceedings of the 13th European Conference on Computer Vision, Zurich, 2014. 8689: 834–849
Lin D, Shen X, Lu C, et al. Deep LAC: deep localization, alignment and classification for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 1666–1674
Krause J, Jin H, Yang J, et al. Fine-grained recognition without part annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 5546–5555
Zhang H, Xu T, Elhoseiny M, et al. SPDA-CNN: unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 1143–1152
Fu J, Zheng H, Mei T. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 4476–4484
Zheng H, Fu J, Mei T, et al. Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017. 5219–5227
Li Z, Yang Y, Liu X, et al. Dynamic computational time for visual attention. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, 2017. 1199–1209
He X, Peng Y. Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, 2017. 4075–4081
Yang Z, Luo T, Wang D, et al. Learning to navigate for fine-grained classification. In: Proceedings of the 15th European Conference on Computer Vision, Munich, 2018. 11218: 438–454
He X, Peng Y, Zhao J. Which and how many regions to gaze: focus discriminative regions for fine-grained visual categorization. Int J Comput Vis, 2019, 127: 1235–1255
Article Google Scholar
Wang Z, Wang S, Zhang P, et al. Weakly supervised fine-grained image classification via correlation-guided discriminative learning. In: Proceedings of the 27th ACM International Conference on Multimedia, Nice, 2019. 1851–1860
Wang Z, Wang S, Li H, et al. Graph-propagation based correlation learning for weakly supervised fine-grained image classification. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. 12289–12296
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations, Vienna, 2021
Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In: Proceedings of the 16th European Conference on Computer Vision, Glasgow, 2020. 12346: 213–229
Zhu X, Su W, Lu L, et al. Deformable DETR: deformable transformers for end-to-end object detection. In: Proceedings of the 9th International Conference on Learning Representations, Vienna, 2021
Zheng S, Lu J, Zhao H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2021. 6881–6890
Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, 2021. 10347–10357
He S, Luo H, Wang P, et al. TransReID: transformer-based object re-identification. In: Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, 2021. 14993–15002
He J, Chen J, Liu S, et al. TransFG: a transformer architecture for fine-grained recognition. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence, the 34th Conference on Innovative Applications of Artificial Intelligence, and the 12th Symposium on Educational Advances in Artificial Intelligence, 2022. 852–860
Wang D, Shen Z, Shao J, et al. Multiple granularity descriptors for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 2399–2406
He G, Li F, Wang Q, et al. A hierarchical sampling based triplet network for fine-grained image classification. Pattern Recognit, 2021, 115: 107889
Article Google Scholar
Lin T, RoyChowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1449–1457
Gao Y, Beijbom O, Zhang N, et al. Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 317–326
Kong S, Fowlkes C C. Low-rank bilinear pooling for fine-grained classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 7025–7034
Wei X, Zhang Y, Gong Y, et al. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In: Proceedings of the 15th European Conference on Computer Vision, Munich, 2018. 11207: 365–380
Li P, Xie J, Wang Q, et al. Is second-order information helpful for large-scale visual recognition? In: Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017. 2089–2097
Zhuang P, Wang Y, Qiao Y. Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. 13130–13137
Gao Y, Han X, Wang X, et al. Channel interaction networks for fine-grained image categorization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. 10818–10825
Zhang S, Huang Q, Hua G, et al. Building contextual visual vocabulary for large-scale image applications. In: Proceedings of the 18th ACM International Conference on Multimedia, Firenze, 2010. 501–510
Conde M V, Turgutlu K. Exploring vision transformers for fine-grained classification. 2021. ArXiv:2106.10587
Zhang L, Huang S, Liu W, et al. Learning a mixture of granularity-specific experts for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision, Seoul, 2019. 8330–8339
He X, Peng Y. Only learn one sample: fine-grained visual categorization with one sample training. In: Proceedings of ACM International Conference on Multimedia, Seoul, 2018. 1372–1380
Zheng H, Fu J, Zha Z, et al. Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 5012–5021
Ding Y, Zhou Y, Zhu Y, et al. Selective sparse sampling for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, Seoul, 2019. 6598–6607
Abnar S, Zuidema W H. Quantifying attention flow in transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. 4190–4197
Chicco D. Siamese Neural Networks: An Overview. New York: Springer, 2021. 73–94
Google Scholar
O’Connor C M, Cree G S, McRae K. Conceptual hierarchies in a flat attractor network: dynamics of learning and computations. Cogn Sci, 2009, 33: 665–708
Article Google Scholar
Efraimidis P S, Spirakis P G. Weighted random sampling with a reservoir. Inf Process Lett, 2006, 97: 181–185
Article MathSciNet Google Scholar
Zhong Z, Zheng L, Kang G, et al. Random erasing data augmentation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. 13001–13008
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770–778
Luo W, Yang X, Mo X, et al. Cross-X learning for fine-grained visual categorization. In: Proceedings of the IEEE International Conference on Computer Vision, Seoul, 2019. 8241–8250
Liu C, Xie H, Zha Z, et al. Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. 11555–11562
Hu T, Qi H. See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. 2019. ArXiv:1901.09891
Ge W, Lin X, Yu Y. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 3034–3043
Du R, Chang D, Bhunia A K, et al. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In: Proceedings of the 16th European Conference on Computer Vision, Glasgow, 2020. 12365: 153–168
Li P, Xie J, Wang Q, et al. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 947–955
Zheng H, Fu J, Zha Z J, et al. Learning rich part hierarchies with progressive attention networks for fine-grained image recognition. IEEE Trans Image Process, 2020, 29: 476–488
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was partly supported by National Key Research and Development Program of China (Grant No. 2020AAA0106800), Beijing Natural Science Foundation (Grant Nos. Z180006, L211016), National Natural Science Foundation of China (Grant No. 62176020), CAAI-Huawei MindSpore Open Fund, and Chinese Academy of Sciences (Grant No. OEIP-O-202004).

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
Yilin Lyu, Liping Jing, Jiaqi Wang, Mingzhe Guo & Jian Yu
Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, 100044, China
Yilin Lyu, Liping Jing, Jiaqi Wang, Mingzhe Guo & Jian Yu
Alibaba Group, Beijing, 100102, China
Xinyue Wang

Authors

Yilin Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Liping Jing
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhe Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xinyue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liping Jing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lyu, Y., Jing, L., Wang, J. et al. Siamese transformer with hierarchical concept embedding for fine-grained image recognition. Sci. China Inf. Sci. 66, 132107 (2023). https://doi.org/10.1007/s11432-022-3586-y

Download citation

Received: 25 January 2022
Revised: 18 May 2022
Accepted: 16 August 2022
Published: 31 January 2023
DOI: https://doi.org/10.1007/s11432-022-3586-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Siamese transformer with hierarchical concept embedding for fine-grained image recognition

Abstract

Access this article

Similar content being viewed by others

Recombining Vision Transformer Architecture for Fine-Grained Visual Categorization

Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

Subtler mixed attention network on fine-grained image classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Siamese transformer with hierarchical concept embedding for fine-grained image recognition

Abstract

Access this article

Similar content being viewed by others

Recombining Vision Transformer Architecture for Fine-Grained Visual Categorization

Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

Subtler mixed attention network on fine-grained image classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation