Label graph learning for multi-label image recognition with cross-modal fusion

Xie, Yanzhao; Wang, Yangtao; Liu, Yu; Zhou, Ke

doi:10.1007/s11042-022-12397-y

Label graph learning for multi-label image recognition with cross-modal fusion

Published: 23 March 2022

Volume 81, pages 25363–25381, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yanzhao Xie¹,
Yangtao Wang ORCID: orcid.org/0000-0003-4605-9270²,
Yu Liu¹ &
…
Ke Zhou¹

563 Accesses
1 Altmetric
Explore all metrics

Abstract

It has become popular to learn the correlation between labels in most existing multi-label image recognition tasks. Existing approaches begin to construct a label graph to learn the label dependencies but they suffer from a low convergence efficiency when fusing image features and label embeddings, and also limit the performance improvement on multi-label images. To overcome this challenge, we propose a l abel g raph l earning m odel (termed as LGLM) for multi-label image recognition, which integrates a multi-modal fusion component to efficiently fuse cross-modal embeddings. First, LGLM uses convolution neural network to learn the feature for each image. Second, LGLM first constructs a label graph according to the word vector of each object and then adopts graph convolution network to learn the label correlations to generate label co-occurrence embeddings. Finally, the multi-modal fusion component efficiently fuses image features and label co-occurrence embeddings to generate an end-to-end image recognition model. We conduct extensive experiments on MS-COCO and FLICKR25K and the experimental results demonstrate the superiority of LGLM compared with the state-of-the-art image recognition methods. The code of LGLM has been released on GitHub: https://github.com/lzHZWZ/LGLM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Learning to Prompt for Vision-Language Models

Article 31 July 2022

References

Chen S-F, Chen Y-C, Yeh C-K, Wang Y-CF (2018) Order-free RNN with visual attention for multi-label classification, proceedings of the thirty-second AAAI conference on artificial intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 6714–6721 AAAI Press
Chen T, Xu M, Hui X, Wu H, Lin L (2019) Learning semantic-specific graph representation for multi-label image recognition, 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), october 27 - november 2, 522–531. IEEE
Chen Z-M, Wei X-S, Wang P, Guo Y (2019) Multi-Label image recognition with graph convolutional networks, IEEE Conference on computer vision and pattern recognition, CVPR, Long beach, CA, USA, June 16-20, 5177–5186. IEEE Computer Vision Foundation
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering, advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5-10, Barcelona, Spain, 3837–3845
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding, proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP 2016, austin, texas, USA, November 1-4, 457–468. The Association for Computational Linguistics
Ge W, Yang S, Yizhou Y (2018) Multi-Evidence Filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning, 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 1277–1286. IEEE Computer Society
Ghamrawi N, McCallum A (2005) Collective multi-label classification, Proceedings of the 2005 ACM CIKM International conference on information and knowledge management, Bremen, Germany, October 31 - November 5, 195–200. ACM
Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation, 2nd International conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14-16. Conference Track Proceedings
Guo Y, Suicheng G (2011) Multi-label classification using conditional dependency networks, IJCAI 2011, Proceedings of the 22nd International joint conference on artificial intelligence, Barcelona, Catalonia, Spain, July 16-22, 1300-1305. IJCAI/AAAI
Guo H, Zheng K, Fan X, Hongkai Y, Wang S (2019) Visual attention consistency under image transforms for multi-label image classification, IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 729–739. IEEE Computer Society
Huang F, Zhang X, Zhao Z, Jie X, Li Z (2019) Image-text sentiment analysis via deep multimodal attentive fusion. Knowl Based Syst 167:26–37
Article Google Scholar
Huang F, Zhang X, Jie X, Zhao Z, Li Z (2021) Multimodal learning of social image representation by exploiting social relations. IEEE Trans Cybern 51(3):1506–1518
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, 2016 IEEE conference on computer vision and pattern recognition, CVPR las vegas, NV, USA, June 27-30, 770–778. IEEE Computer Society
He T, Jin X (2019) Image emotion distribution learning with graph convolutional networks, Proceedings of the 2019 on International Conference on Multimedia Retrieval, ICMR 2019, Ottawa, ON, Canada, June 10-13, 392–390. ACM
Huiskes MJ, Lew MS (2008) The MIR flickr retrieval evaluation, Proceedings of the 1st ACM SIGMM International conference on multimedia information retrieval, MIR 2008, Vancouver, British Columbia, Canada, October 30-31, 39–43. ACM
Inoue N, Simo-Serra E, Yamasaki T, Ishikawa H (2017) Multi-label fashion image classification with minimal human supervision, 2017 IEEE International conference on computer vision workshops, ICCV Workshops. Venice, italy, october 22-29, 2261–2267. IEEE Computer Society
Johnson J, Gupta A, Li F-F (2018) image generation from scene graphs, 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 1219–1228. IEEE Computer Society
Kim J-H, On KW, Lim W, Kim J, Ha J-W, Zhang B-T (2017) Hadamard product for low-rank bilinear pooling, 5th International Conference on Learning Representations, ICLR 2017, Toulon, france, april 24-26, conference track proceedings. OpenReview.net
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks, 5th International conference on learning representations, ICLR 2017, Toulon, france, april 24-26, conference track proceedings. OpenReview.net
Lee C-W, Fang W, Yeh C-K, Wang Y-CF (2018) Multi-label zero-shot learning with structured knowledge graphs, 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 1576–1585. IEEE Computer Society
Li J, Huang C, Loy CC, Tang X (2016) Human attribute recognition by deep hierarchical contexts, computer vision - ECCV 2016 - 14th European Conference, Amsterdam, The netherlands, october 11-14, proceedings, Part VI 684–700. Springer
Li Q, Peng X, Qiao Y, Peng Q (2020) Learning label correlations for multi-label image recognition with graph networks. Pattern Recognit Lett 138:378–384
Article Google Scholar
Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollȧr P, Lawrence Zitnick C (2014) Microsoft coco: common objects in context, computer vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part v, 740–755. Springer
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input, advances in neural information processing systems 27: annual conference on neural information processing systems 2014, december 8-13, montreal, quebec, Canada, 1682–1690
Molchanov P, Tyree S, Karras T, Aila T, Kautz J (2017) Pruning Convolutional Neural Networks for Resource Efficient Inference, 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference track proceedings. OpenReview.net
Monti F, Boscaini D, Masci J, Rodolȧ E, Svoboda J, Bronstein MM (2017) Geometric deep learning on graphs and manifolds using mixture model CNNs, 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 5425–5434. IEEE Computer Society
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation, Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25-29, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, 1532–1543 ACL
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition, IEEE conference on computer vision and pattern recognition, CVPR Workshops 2014, Columbus, OH, USA, June 23-28, 512–519 IEEE Computer Society
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Li F-F (2015) ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding, IEEE conference on computer vision and pattern recognition, CVPR Boston, MA, USA, June 7-12, 4657–4666. IEEE Computer Society
Wei Y, Xia W, Lin M, Huang J, Ni B, Dong J, Zhao Y, Yan S (2016) HCP: A flexible CNN framework for Multi-Label image classification. IEEE Trans Pattern Anal Mach Intell 38(9):1901–1907
Article Google Scholar
Wang J, Yi Y, Mao J, Huang Z, Huang C, Wei X (2016) CNN-RNN: a unified framework for multi-label image classification, 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2285–2294 IEEE Computer Society
Wang Z, Chen T, Li G, Xu R, Lin L (2017) Multi-label image recognition by recurrently discovering attentional regions, IEEE International conference on computer vision, ICCV, Venice, italy, october 22-29, 464–474. IEEE Computer Society
Ye J, He J, Peng X, Wu W, Qiao Y (2020) Attention-driven dynamic graph convolutional network for multi-label image recognition, computer vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, part XXI, 649–665. Springer
Zhu F, Li H, Ouyang W, Nenghai Y, Wang X (2017) Learning spatial regularization with image-level supervisions for multi-label image classification, 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2027–2036. IEEE Computer Society
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167
Zhou Y, Jun Y, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized High-Order pooling for visual question answering. IEEE Trans Neural Networks Learn Syst 29(12):5947–5959
Article Google Scholar

Download references

Acknowledgements

Thanks for the support of the Innovation Group Project of the National Natural Science Foundation of China No.61821003 and the National Natural Science Foundation of China No.61902135.

Author information

Authors and Affiliations

Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan, China
Yanzhao Xie, Yu Liu & Ke Zhou
School of Computer Science and Cyber Engineering, Guangzhou University, 230 Wai Huan Xi Road, Guangzhou, China
Yangtao Wang

Authors

Yanzhao Xie
View author publications
You can also search for this author in PubMed Google Scholar
Yangtao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ke Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yangtao Wang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xie, Y., Wang, Y., Liu, Y. et al. Label graph learning for multi-label image recognition with cross-modal fusion. Multimed Tools Appl 81, 25363–25381 (2022). https://doi.org/10.1007/s11042-022-12397-y

Download citation

Received: 06 April 2021
Revised: 04 January 2022
Accepted: 25 January 2022
Published: 23 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11042-022-12397-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Label graph learning for multi-label image recognition with cross-modal fusion

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Learning to Prompt for Vision-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Label graph learning for multi-label image recognition with cross-modal fusion

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Learning to Prompt for Vision-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation