Abstract
It has become popular to learn the correlation between labels in most existing multi-label image recognition tasks. Existing approaches begin to construct a label graph to learn the label dependencies but they suffer from a low convergence efficiency when fusing image features and label embeddings, and also limit the performance improvement on multi-label images. To overcome this challenge, we propose a l abel g raph l earning m odel (termed as LGLM) for multi-label image recognition, which integrates a multi-modal fusion component to efficiently fuse cross-modal embeddings. First, LGLM uses convolution neural network to learn the feature for each image. Second, LGLM first constructs a label graph according to the word vector of each object and then adopts graph convolution network to learn the label correlations to generate label co-occurrence embeddings. Finally, the multi-modal fusion component efficiently fuses image features and label co-occurrence embeddings to generate an end-to-end image recognition model. We conduct extensive experiments on MS-COCO and FLICKR25K and the experimental results demonstrate the superiority of LGLM compared with the state-of-the-art image recognition methods. The code of LGLM has been released on GitHub: https://github.com/lzHZWZ/LGLM.
Similar content being viewed by others
References
Chen S-F, Chen Y-C, Yeh C-K, Wang Y-CF (2018) Order-free RNN with visual attention for multi-label classification, proceedings of the thirty-second AAAI conference on artificial intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 6714–6721 AAAI Press
Chen T, Xu M, Hui X, Wu H, Lin L (2019) Learning semantic-specific graph representation for multi-label image recognition, 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), october 27 - november 2, 522–531. IEEE
Chen Z-M, Wei X-S, Wang P, Guo Y (2019) Multi-Label image recognition with graph convolutional networks, IEEE Conference on computer vision and pattern recognition, CVPR, Long beach, CA, USA, June 16-20, 5177–5186. IEEE Computer Vision Foundation
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering, advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5-10, Barcelona, Spain, 3837–3845
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding, proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP 2016, austin, texas, USA, November 1-4, 457–468. The Association for Computational Linguistics
Ge W, Yang S, Yizhou Y (2018) Multi-Evidence Filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning, 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 1277–1286. IEEE Computer Society
Ghamrawi N, McCallum A (2005) Collective multi-label classification, Proceedings of the 2005 ACM CIKM International conference on information and knowledge management, Bremen, Germany, October 31 - November 5, 195–200. ACM
Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation, 2nd International conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14-16. Conference Track Proceedings
Guo Y, Suicheng G (2011) Multi-label classification using conditional dependency networks, IJCAI 2011, Proceedings of the 22nd International joint conference on artificial intelligence, Barcelona, Catalonia, Spain, July 16-22, 1300-1305. IJCAI/AAAI
Guo H, Zheng K, Fan X, Hongkai Y, Wang S (2019) Visual attention consistency under image transforms for multi-label image classification, IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 729–739. IEEE Computer Society
Huang F, Zhang X, Zhao Z, Jie X, Li Z (2019) Image-text sentiment analysis via deep multimodal attentive fusion. Knowl Based Syst 167:26–37
Huang F, Zhang X, Jie X, Zhao Z, Li Z (2021) Multimodal learning of social image representation by exploiting social relations. IEEE Trans Cybern 51(3):1506–1518
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, 2016 IEEE conference on computer vision and pattern recognition, CVPR las vegas, NV, USA, June 27-30, 770–778. IEEE Computer Society
He T, Jin X (2019) Image emotion distribution learning with graph convolutional networks, Proceedings of the 2019 on International Conference on Multimedia Retrieval, ICMR 2019, Ottawa, ON, Canada, June 10-13, 392–390. ACM
Huiskes MJ, Lew MS (2008) The MIR flickr retrieval evaluation, Proceedings of the 1st ACM SIGMM International conference on multimedia information retrieval, MIR 2008, Vancouver, British Columbia, Canada, October 30-31, 39–43. ACM
Inoue N, Simo-Serra E, Yamasaki T, Ishikawa H (2017) Multi-label fashion image classification with minimal human supervision, 2017 IEEE International conference on computer vision workshops, ICCV Workshops. Venice, italy, october 22-29, 2261–2267. IEEE Computer Society
Johnson J, Gupta A, Li F-F (2018) image generation from scene graphs, 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 1219–1228. IEEE Computer Society
Kim J-H, On KW, Lim W, Kim J, Ha J-W, Zhang B-T (2017) Hadamard product for low-rank bilinear pooling, 5th International Conference on Learning Representations, ICLR 2017, Toulon, france, april 24-26, conference track proceedings. OpenReview.net
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks, 5th International conference on learning representations, ICLR 2017, Toulon, france, april 24-26, conference track proceedings. OpenReview.net
Lee C-W, Fang W, Yeh C-K, Wang Y-CF (2018) Multi-label zero-shot learning with structured knowledge graphs, 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 1576–1585. IEEE Computer Society
Li J, Huang C, Loy CC, Tang X (2016) Human attribute recognition by deep hierarchical contexts, computer vision - ECCV 2016 - 14th European Conference, Amsterdam, The netherlands, october 11-14, proceedings, Part VI 684–700. Springer
Li Q, Peng X, Qiao Y, Peng Q (2020) Learning label correlations for multi-label image recognition with graph networks. Pattern Recognit Lett 138:378–384
Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollȧr P, Lawrence Zitnick C (2014) Microsoft coco: common objects in context, computer vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part v, 740–755. Springer
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input, advances in neural information processing systems 27: annual conference on neural information processing systems 2014, december 8-13, montreal, quebec, Canada, 1682–1690
Molchanov P, Tyree S, Karras T, Aila T, Kautz J (2017) Pruning Convolutional Neural Networks for Resource Efficient Inference, 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference track proceedings. OpenReview.net
Monti F, Boscaini D, Masci J, Rodolȧ E, Svoboda J, Bronstein MM (2017) Geometric deep learning on graphs and manifolds using mixture model CNNs, 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 5425–5434. IEEE Computer Society
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation, Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25-29, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, 1532–1543 ACL
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition, IEEE conference on computer vision and pattern recognition, CVPR Workshops 2014, Columbus, OH, USA, June 23-28, 512–519 IEEE Computer Society
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Li F-F (2015) ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211–252
Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding, IEEE conference on computer vision and pattern recognition, CVPR Boston, MA, USA, June 7-12, 4657–4666. IEEE Computer Society
Wei Y, Xia W, Lin M, Huang J, Ni B, Dong J, Zhao Y, Yan S (2016) HCP: A flexible CNN framework for Multi-Label image classification. IEEE Trans Pattern Anal Mach Intell 38(9):1901–1907
Wang J, Yi Y, Mao J, Huang Z, Huang C, Wei X (2016) CNN-RNN: a unified framework for multi-label image classification, 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2285–2294 IEEE Computer Society
Wang Z, Chen T, Li G, Xu R, Lin L (2017) Multi-label image recognition by recurrently discovering attentional regions, IEEE International conference on computer vision, ICCV, Venice, italy, october 22-29, 464–474. IEEE Computer Society
Ye J, He J, Peng X, Wu W, Qiao Y (2020) Attention-driven dynamic graph convolutional network for multi-label image recognition, computer vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, part XXI, 649–665. Springer
Zhu F, Li H, Ouyang W, Nenghai Y, Wang X (2017) Learning spatial regularization with image-level supervisions for multi-label image classification, 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2027–2036. IEEE Computer Society
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167
Zhou Y, Jun Y, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized High-Order pooling for visual question answering. IEEE Trans Neural Networks Learn Syst 29(12):5947–5959
Acknowledgements
Thanks for the support of the Innovation Group Project of the National Natural Science Foundation of China No.61821003 and the National Natural Science Foundation of China No.61902135.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xie, Y., Wang, Y., Liu, Y. et al. Label graph learning for multi-label image recognition with cross-modal fusion. Multimed Tools Appl 81, 25363–25381 (2022). https://doi.org/10.1007/s11042-022-12397-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12397-y