Abstract
Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.
Similar content being viewed by others
Availability of data and materials
The datasets analyzed during the current study are available in the following public domain resources: http://websites.umich.edu/ywchao/hico/. https://github.com/s-gupta/v-coco.
References
Xia L, Li R (2020) Multi-stream neural network fused with local information and global information for hoi detection. Appl Intell 50. https://doi.org/10.1007/s10489-020-01794-1
Cheng Y, Wang Z, Zhan W et al (2023) Multi-scale human-object interaction detector. IEEE Trans Circuits Syst Video Technol 33(4):1827–1838. https://doi.org/10.1109/TCSVT.2022.3216663
Antoun M, Asmar D (2023) Human object interaction detection: Design and survey. Image Vis Comput 130(104):617. https://doi.org/10.1016/j.imavis.2022.104617. https://www.sciencedirect.com/science/article/pii/S0262885622002463
Zhang H, Ma C, Jiang Z et al (2023) Image caption generation using contextual information fusion with bi-lstm-s. IEEE Access 11:134–143. https://doi.org/10.1109/ACCESS.2022.3232508
Sasibhooshan R, Kumaraswamy S, Sasidharan S (2023) Image caption generation using visual attention prediction and contextual spatial relation extraction. J Big Data 10(1):18. https://doi.org/10.1186/S40537-023-00693-9
Dineva K, Atanasova TV (2022) Cloud data-driven intelligent monitoring system for interactive smart farming. Sensors 22(17):6566. https://doi.org/10.3390/S22176566
Veinidis C, Pratikakis I, Theoharis T (2019) Unsupervised human action retrieval using salient points in 3d mesh sequences. Multimed Tools Appl 78(3):2789–2814. https://doi.org/10.1007/S11042-018-5855-2
Kaur R, Singh S (2023) A comprehensive review of object detection with deep learning. Digit Signal Process 132(103):812. https://doi.org/10.1016/j.dsp.2022.103812. https://www.sciencedirect.com/science/article/pii/S1051200422004298
Pal SK, Pramanik A, Maiti J et al (2021) Deep learning in multi-object detection and tracking: state of the art. Appl Intell 51(9):6400–6429. https://doi.org/10.1007/S10489-021-02293-7
Yu H, Li X, Feng Y et al (2023) Multiple attentional path aggregation network for marine object detection. Appl Intell 53(2):2434–2451. https://doi.org/10.1007/S10489-022-03622-0
Zhu X, Su W, Lu L et al (2021) Deformable detr: Deformable transformers for end-to-end object detection. In: International conference on learning representations
Bai L, Chen F, Tian Y (2023) Automatically detecting human-object interaction by an instance part-level attention deep framework. Pattern Recognit 134(109):110
Xia Lm WuW (2021) Graph-based method for human-object interactions detection. J Cent South Univ 28(1):205–218. https://doi.org/10.1007/s11771-021-4597-x
Xia L, Ding X (2023) Human-object interaction recognition based on interactivity detection and multi-feature fusion. Clust Comput. https://doi.org/10.1007/s10586-023-04004-y
Gupta S, Malik J (2015) Visual semantic role labeling. CoRR. https://doi.org/10.48550/arXiv.2104.00990. arXiv:1505.04474
Chao YW, Liu Y, Liu X et al (2018) Learning to detect human-object interactions. In: 2018 IEEE winter conference on applications of computer vision (WACV), pp 381–389. https://doi.org/10.1109/WACV.2018.00048
Ji Z, Liu X, Pang Y et al (2021) Few-shot human-object interaction recognition with semantic-guided attentive prototypes network. IEEE Trans Image Process 30:1648–1661. https://doi.org/10.1109/TIP.2020.3046861
Kuznetsova A, Rom H, Alldrin N et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981. https://doi.org/10.1007/s11263-020-01316-z
Shao Z, Hu Z, Yang J et al (2022) Multi-stream feature refinement network for human object interaction detection. J Vis Commun Image Represent 86(103):529. https://doi.org/10.1016/j.jvcir.2022.103529. https://www.sciencedirect.com/science/article/pii/S1047320322000712
Luo T, Guan S, Yang R et al (2023) From detection to understanding: A survey on representation learning for human-object interaction. Neurocomputing 543(126):243. https://doi.org/10.1016/j.neucom.2023.126243. https://www.sciencedirect.com/science/article/pii/S0925231223003661
Mansour AE, Mohammed A, Elsayed HAEA et al (2022) Spatial-net for human-object interaction detection. IEEE Access 10:88920–88931
Arulalan V, Kumar D (2023) Efficient object detection and classification approach using htyolov4 and m2rfo-cnn. Comput Syst Sci Eng 44(2):1703–1717. https://doi.org/10.32604/csse.2023.026744. http://www.techscience.com/csse/v44n2/48281
Cores D, Brea VM, Mucientes M (2023) Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos. Appl Intell 53(1):1205–1217. https://doi.org/10.1007/s10489-022-03529-w
Gkioxari G, Girshick R, Dollár P et al (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359–8367
Wan B, Zhou D, Liu Y et al (2019) Pose-aware multi-level feature network for human object interaction detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9468–9477. https://doi.org/10.1109/ICCV.2019.00956
Liu L, Tan RT (2022) Human object interaction detection using two-direction spatial enhancement and exclusive object prior. Pattern Recognit 124(108):438
Xu B, Li J, Wong Y et al (2019) Interact as you intend: Intention-driven human-object interaction detection. IEEE Trans Multimed 22(6):1423–1432
Yang W, Chen G, Zhao Z et al (2022) icgpn: Interaction-centric graph parsing network for human-object interaction detection. Neurocomputing 502:98–109. https://doi.org/10.1016/j.neucom.2022.06.100
Ye Q, Wang X, Li R et al (2023) Human object interaction detection based on feature optimization and key human-object enhancement. J Vis Commun Image Represent 93(103):824. https://doi.org/10.1016/j.jvcir.2023.103824. https://www.sciencedirect.com/science/article/pii/S1047320323000743
Li YL, Liu X, Wu X et al (2022) Transferable interactiveness knowledge for human-object interaction detection. IEEE Trans Pattern Anal Mach Intell 44(7):3870–3882. https://doi.org/10.1109/TPAMI.2021.3054048
Liao Y, Liu S, Wang F et al (2020) Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 479–487. https://doi.org/10.1109/CVPR42600.2020.00056
Zhong X, Qu X, Ding C et al (2021) Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 13229–13238. https://doi.org/10.1109/CVPR46437.2021.01303
Kim B, Choi T, Kang J et al (2020) Uniondet: Union-level detector towards real-time human-object interaction detection. In: Vedaldi A, Bischof H, Brox T et al (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 498–514. https://doi.org/10.1007/978-3-030-58555-6_30
Lim J, Baskaran VM, Lim JMY et al (2023) Ernet: An efficient and reliable human-object interaction detection network. IEEE Trans Image Process 32:964–979. https://doi.org/10.1109/TIP.2022.3231528
Ghimire A, Kakani V, Kim H (2023) Ssrt: A sequential skeleton rgb transformer to recognize fine-grained human-object interactions and action recognition. IEEE Access 11:51930–51948. https://doi.org/10.1109/ACCESS.2023.3278974
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, pp 213–229
Kim B, Lee J, Kang J et al (2021) Hotr: End-to-end human-object interaction detection with transformers. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 74–83. https://doi.org/10.1109/CVPR46437.2021.00014
Tamura M, Ohashi H, Yoshinaga T (2021) Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10405–10414. https://doi.org/10.1109/CVPR46437.2021.01027
Zou C, Wang B, Hu Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11820–11829. https://doi.org/10.1109/CVPR46437.2021.01165
Cheng Y, Duan H, Wang C et al (2023) Parallel disentangling network for human-object interaction detection. Pattern Recognit 146(110):021
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. In: Fleet D, Pajdla T, Schiele B et al (eds) Computer Vision - ECCV 2014. Springer International Publishing, Cham, pp 740–755
Yang D, Zou Y, Zhang J et al (2021) Gid-net: Detecting human-object interaction with global and instance dependency. Neurocomputing 444:366–377. https://doi.org/10.1016/j.neucom.2020.02.136. https://www.sciencedirect.com/science/article/pii/S0925231220317768
Li YL, Liu X, Wu X et al (2020) Hoi analysis: Integrating and decomposing human-object interaction. Adv Neural Inf Process Syst 33:5011–5022
Cheng Y, Zhao Z, Wang Z et al (2023) Rethinking vision transformer through human-object interaction detection. Eng Appl Artif Intell 122(106):123. https://doi.org/10.1016/j.engappai.2023.106123. https://www.sciencedirect.com/science/article/pii/S095219762300307X
Kim B, Mun J, On KW et al (2022) Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 19556–19565. https://doi.org/10.1109/CVPR52688.2022.01897
Tu D, Sun W, Zhai G et al (2023) Agglomerative transformer for human-object interaction detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 21614–21624
Zhang A, Liao Y, Liu S et al (2021) Mining the benefits of two-stage and one-stage HOI detection. In: Advances in neural information processing systems 34: annual conference on neural information processing systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp 17209–17220. https://proceedings.neurips.cc/paper/2021/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html
Funding
This work was supported by the National Natural Science Foundation of China (Grant No. 51678075) the Science and Technology Project of Hunan (Grant No. 2017GK2271).
Author information
Authors and Affiliations
Contributions
All authors have contributed equally.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xia, L., Ding, X. Human-object interaction detection based on cascade multi-scale transformer. Appl Intell 54, 2831–2850 (2024). https://doi.org/10.1007/s10489-024-05324-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05324-1