Human-object interaction detection based on cascade multi-scale transformer

Xia, Limin; Ding, Xiaoyue

doi:10.1007/s10489-024-05324-1

Human-object interaction detection based on cascade multi-scale transformer

Published: 16 February 2024

Volume 54, pages 2831–2850, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Limin Xia¹^na1 &
Xiaoyue Ding¹^na1

246 Accesses
Explore all metrics

Abstract

Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human-Object Interaction Detection Based on Multi-scale Attention Fusion

UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection

Human-Object Interaction Detection: A Survey of Deep Learning-Based Methods

Availability of data and materials

The datasets analyzed during the current study are available in the following public domain resources: http://websites.umich.edu/ywchao/hico/. https://github.com/s-gupta/v-coco.

References

Xia L, Li R (2020) Multi-stream neural network fused with local information and global information for hoi detection. Appl Intell 50. https://doi.org/10.1007/s10489-020-01794-1
Cheng Y, Wang Z, Zhan W et al (2023) Multi-scale human-object interaction detector. IEEE Trans Circuits Syst Video Technol 33(4):1827–1838. https://doi.org/10.1109/TCSVT.2022.3216663
Article Google Scholar
Antoun M, Asmar D (2023) Human object interaction detection: Design and survey. Image Vis Comput 130(104):617. https://doi.org/10.1016/j.imavis.2022.104617. https://www.sciencedirect.com/science/article/pii/S0262885622002463
Zhang H, Ma C, Jiang Z et al (2023) Image caption generation using contextual information fusion with bi-lstm-s. IEEE Access 11:134–143. https://doi.org/10.1109/ACCESS.2022.3232508
Article Google Scholar
Sasibhooshan R, Kumaraswamy S, Sasidharan S (2023) Image caption generation using visual attention prediction and contextual spatial relation extraction. J Big Data 10(1):18. https://doi.org/10.1186/S40537-023-00693-9
Article Google Scholar
Dineva K, Atanasova TV (2022) Cloud data-driven intelligent monitoring system for interactive smart farming. Sensors 22(17):6566. https://doi.org/10.3390/S22176566
Article ADS PubMed PubMed Central Google Scholar
Veinidis C, Pratikakis I, Theoharis T (2019) Unsupervised human action retrieval using salient points in 3d mesh sequences. Multimed Tools Appl 78(3):2789–2814. https://doi.org/10.1007/S11042-018-5855-2
Article Google Scholar
Kaur R, Singh S (2023) A comprehensive review of object detection with deep learning. Digit Signal Process 132(103):812. https://doi.org/10.1016/j.dsp.2022.103812. https://www.sciencedirect.com/science/article/pii/S1051200422004298
Pal SK, Pramanik A, Maiti J et al (2021) Deep learning in multi-object detection and tracking: state of the art. Appl Intell 51(9):6400–6429. https://doi.org/10.1007/S10489-021-02293-7
Article Google Scholar
Yu H, Li X, Feng Y et al (2023) Multiple attentional path aggregation network for marine object detection. Appl Intell 53(2):2434–2451. https://doi.org/10.1007/S10489-022-03622-0
Article Google Scholar
Zhu X, Su W, Lu L et al (2021) Deformable detr: Deformable transformers for end-to-end object detection. In: International conference on learning representations
Bai L, Chen F, Tian Y (2023) Automatically detecting human-object interaction by an instance part-level attention deep framework. Pattern Recognit 134(109):110
Google Scholar
Xia Lm WuW (2021) Graph-based method for human-object interactions detection. J Cent South Univ 28(1):205–218. https://doi.org/10.1007/s11771-021-4597-x
Article Google Scholar
Xia L, Ding X (2023) Human-object interaction recognition based on interactivity detection and multi-feature fusion. Clust Comput. https://doi.org/10.1007/s10586-023-04004-y
Article Google Scholar
Gupta S, Malik J (2015) Visual semantic role labeling. CoRR. https://doi.org/10.48550/arXiv.2104.00990. arXiv:1505.04474
Chao YW, Liu Y, Liu X et al (2018) Learning to detect human-object interactions. In: 2018 IEEE winter conference on applications of computer vision (WACV), pp 381–389. https://doi.org/10.1109/WACV.2018.00048
Ji Z, Liu X, Pang Y et al (2021) Few-shot human-object interaction recognition with semantic-guided attentive prototypes network. IEEE Trans Image Process 30:1648–1661. https://doi.org/10.1109/TIP.2020.3046861
Article ADS PubMed Google Scholar
Kuznetsova A, Rom H, Alldrin N et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981. https://doi.org/10.1007/s11263-020-01316-z
Article Google Scholar
Shao Z, Hu Z, Yang J et al (2022) Multi-stream feature refinement network for human object interaction detection. J Vis Commun Image Represent 86(103):529. https://doi.org/10.1016/j.jvcir.2022.103529. https://www.sciencedirect.com/science/article/pii/S1047320322000712
Luo T, Guan S, Yang R et al (2023) From detection to understanding: A survey on representation learning for human-object interaction. Neurocomputing 543(126):243. https://doi.org/10.1016/j.neucom.2023.126243. https://www.sciencedirect.com/science/article/pii/S0925231223003661
Mansour AE, Mohammed A, Elsayed HAEA et al (2022) Spatial-net for human-object interaction detection. IEEE Access 10:88920–88931
Article Google Scholar
Arulalan V, Kumar D (2023) Efficient object detection and classification approach using htyolov4 and m2rfo-cnn. Comput Syst Sci Eng 44(2):1703–1717. https://doi.org/10.32604/csse.2023.026744. http://www.techscience.com/csse/v44n2/48281
Cores D, Brea VM, Mucientes M (2023) Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos. Appl Intell 53(1):1205–1217. https://doi.org/10.1007/s10489-022-03529-w
Article Google Scholar
Gkioxari G, Girshick R, Dollár P et al (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359–8367
Wan B, Zhou D, Liu Y et al (2019) Pose-aware multi-level feature network for human object interaction detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9468–9477. https://doi.org/10.1109/ICCV.2019.00956
Liu L, Tan RT (2022) Human object interaction detection using two-direction spatial enhancement and exclusive object prior. Pattern Recognit 124(108):438
Google Scholar
Xu B, Li J, Wong Y et al (2019) Interact as you intend: Intention-driven human-object interaction detection. IEEE Trans Multimed 22(6):1423–1432
Article Google Scholar
Yang W, Chen G, Zhao Z et al (2022) icgpn: Interaction-centric graph parsing network for human-object interaction detection. Neurocomputing 502:98–109. https://doi.org/10.1016/j.neucom.2022.06.100
Article Google Scholar
Ye Q, Wang X, Li R et al (2023) Human object interaction detection based on feature optimization and key human-object enhancement. J Vis Commun Image Represent 93(103):824. https://doi.org/10.1016/j.jvcir.2023.103824. https://www.sciencedirect.com/science/article/pii/S1047320323000743
Li YL, Liu X, Wu X et al (2022) Transferable interactiveness knowledge for human-object interaction detection. IEEE Trans Pattern Anal Mach Intell 44(7):3870–3882. https://doi.org/10.1109/TPAMI.2021.3054048
Article PubMed Google Scholar
Liao Y, Liu S, Wang F et al (2020) Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 479–487. https://doi.org/10.1109/CVPR42600.2020.00056
Zhong X, Qu X, Ding C et al (2021) Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 13229–13238. https://doi.org/10.1109/CVPR46437.2021.01303
Kim B, Choi T, Kang J et al (2020) Uniondet: Union-level detector towards real-time human-object interaction detection. In: Vedaldi A, Bischof H, Brox T et al (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 498–514. https://doi.org/10.1007/978-3-030-58555-6_30
Lim J, Baskaran VM, Lim JMY et al (2023) Ernet: An efficient and reliable human-object interaction detection network. IEEE Trans Image Process 32:964–979. https://doi.org/10.1109/TIP.2022.3231528
Article ADS PubMed Google Scholar
Ghimire A, Kakani V, Kim H (2023) Ssrt: A sequential skeleton rgb transformer to recognize fine-grained human-object interactions and action recognition. IEEE Access 11:51930–51948. https://doi.org/10.1109/ACCESS.2023.3278974
Article Google Scholar
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, pp 213–229
Kim B, Lee J, Kang J et al (2021) Hotr: End-to-end human-object interaction detection with transformers. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 74–83. https://doi.org/10.1109/CVPR46437.2021.00014
Tamura M, Ohashi H, Yoshinaga T (2021) Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10405–10414. https://doi.org/10.1109/CVPR46437.2021.01027
Zou C, Wang B, Hu Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11820–11829. https://doi.org/10.1109/CVPR46437.2021.01165
Cheng Y, Duan H, Wang C et al (2023) Parallel disentangling network for human-object interaction detection. Pattern Recognit 146(110):021
ADS Google Scholar
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. In: Fleet D, Pajdla T, Schiele B et al (eds) Computer Vision - ECCV 2014. Springer International Publishing, Cham, pp 740–755
Chapter Google Scholar
Yang D, Zou Y, Zhang J et al (2021) Gid-net: Detecting human-object interaction with global and instance dependency. Neurocomputing 444:366–377. https://doi.org/10.1016/j.neucom.2020.02.136. https://www.sciencedirect.com/science/article/pii/S0925231220317768
Li YL, Liu X, Wu X et al (2020) Hoi analysis: Integrating and decomposing human-object interaction. Adv Neural Inf Process Syst 33:5011–5022
Google Scholar
Cheng Y, Zhao Z, Wang Z et al (2023) Rethinking vision transformer through human-object interaction detection. Eng Appl Artif Intell 122(106):123. https://doi.org/10.1016/j.engappai.2023.106123. https://www.sciencedirect.com/science/article/pii/S095219762300307X
Kim B, Mun J, On KW et al (2022) Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 19556–19565. https://doi.org/10.1109/CVPR52688.2022.01897
Tu D, Sun W, Zhai G et al (2023) Agglomerative transformer for human-object interaction detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 21614–21624
Zhang A, Liao Y, Liu S et al (2021) Mining the benefits of two-stage and one-stage HOI detection. In: Advances in neural information processing systems 34: annual conference on neural information processing systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp 17209–17220. https://proceedings.neurips.cc/paper/2021/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html

Download references

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 51678075) the Science and Technology Project of Hunan (Grant No. 2017GK2271).

Author information

Limin Xia and Xiaoyue Ding contributed equally to this work.

Authors and Affiliations

School of Automation, Central South University, Changsha, 410083, China
Limin Xia & Xiaoyue Ding

Authors

Limin Xia
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyue Ding
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have contributed equally.

Corresponding author

Correspondence to Xiaoyue Ding.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xia, L., Ding, X. Human-object interaction detection based on cascade multi-scale transformer. Appl Intell 54, 2831–2850 (2024). https://doi.org/10.1007/s10489-024-05324-1

Download citation

Accepted: 06 February 2024
Published: 16 February 2024
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10489-024-05324-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human-object interaction detection based on cascade multi-scale transformer

Abstract

Access this article

Similar content being viewed by others

Human-Object Interaction Detection Based on Multi-scale Attention Fusion

UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection

Human-Object Interaction Detection: A Survey of Deep Learning-Based Methods

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Human-object interaction detection based on cascade multi-scale transformer

Abstract

Access this article

Similar content being viewed by others

Human-Object Interaction Detection Based on Multi-scale Attention Fusion

UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection

Human-Object Interaction Detection: A Survey of Deep Learning-Based Methods

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation