Skip to main content
Log in

Human-object interaction detection based on cascade multi-scale transformer

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

The datasets analyzed during the current study are available in the following public domain resources: http://websites.umich.edu/ywchao/hico/. https://github.com/s-gupta/v-coco.

References

  1. Xia L, Li R (2020) Multi-stream neural network fused with local information and global information for hoi detection. Appl Intell 50. https://doi.org/10.1007/s10489-020-01794-1

  2. Cheng Y, Wang Z, Zhan W et al (2023) Multi-scale human-object interaction detector. IEEE Trans Circuits Syst Video Technol 33(4):1827–1838. https://doi.org/10.1109/TCSVT.2022.3216663

    Article  Google Scholar 

  3. Antoun M, Asmar D (2023) Human object interaction detection: Design and survey. Image Vis Comput 130(104):617. https://doi.org/10.1016/j.imavis.2022.104617. https://www.sciencedirect.com/science/article/pii/S0262885622002463

  4. Zhang H, Ma C, Jiang Z et al (2023) Image caption generation using contextual information fusion with bi-lstm-s. IEEE Access 11:134–143. https://doi.org/10.1109/ACCESS.2022.3232508

    Article  Google Scholar 

  5. Sasibhooshan R, Kumaraswamy S, Sasidharan S (2023) Image caption generation using visual attention prediction and contextual spatial relation extraction. J Big Data 10(1):18. https://doi.org/10.1186/S40537-023-00693-9

    Article  Google Scholar 

  6. Dineva K, Atanasova TV (2022) Cloud data-driven intelligent monitoring system for interactive smart farming. Sensors 22(17):6566. https://doi.org/10.3390/S22176566

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  7. Veinidis C, Pratikakis I, Theoharis T (2019) Unsupervised human action retrieval using salient points in 3d mesh sequences. Multimed Tools Appl 78(3):2789–2814. https://doi.org/10.1007/S11042-018-5855-2

    Article  Google Scholar 

  8. Kaur R, Singh S (2023) A comprehensive review of object detection with deep learning. Digit Signal Process 132(103):812. https://doi.org/10.1016/j.dsp.2022.103812. https://www.sciencedirect.com/science/article/pii/S1051200422004298

  9. Pal SK, Pramanik A, Maiti J et al (2021) Deep learning in multi-object detection and tracking: state of the art. Appl Intell 51(9):6400–6429. https://doi.org/10.1007/S10489-021-02293-7

    Article  Google Scholar 

  10. Yu H, Li X, Feng Y et al (2023) Multiple attentional path aggregation network for marine object detection. Appl Intell 53(2):2434–2451. https://doi.org/10.1007/S10489-022-03622-0

    Article  Google Scholar 

  11. Zhu X, Su W, Lu L et al (2021) Deformable detr: Deformable transformers for end-to-end object detection. In: International conference on learning representations

  12. Bai L, Chen F, Tian Y (2023) Automatically detecting human-object interaction by an instance part-level attention deep framework. Pattern Recognit 134(109):110

    Google Scholar 

  13. Xia Lm WuW (2021) Graph-based method for human-object interactions detection. J Cent South Univ 28(1):205–218. https://doi.org/10.1007/s11771-021-4597-x

    Article  Google Scholar 

  14. Xia L, Ding X (2023) Human-object interaction recognition based on interactivity detection and multi-feature fusion. Clust Comput. https://doi.org/10.1007/s10586-023-04004-y

    Article  Google Scholar 

  15. Gupta S, Malik J (2015) Visual semantic role labeling. CoRR. https://doi.org/10.48550/arXiv.2104.00990. arXiv:1505.04474

  16. Chao YW, Liu Y, Liu X et al (2018) Learning to detect human-object interactions. In: 2018 IEEE winter conference on applications of computer vision (WACV), pp 381–389. https://doi.org/10.1109/WACV.2018.00048

  17. Ji Z, Liu X, Pang Y et al (2021) Few-shot human-object interaction recognition with semantic-guided attentive prototypes network. IEEE Trans Image Process 30:1648–1661. https://doi.org/10.1109/TIP.2020.3046861

    Article  ADS  PubMed  Google Scholar 

  18. Kuznetsova A, Rom H, Alldrin N et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981. https://doi.org/10.1007/s11263-020-01316-z

    Article  Google Scholar 

  19. Shao Z, Hu Z, Yang J et al (2022) Multi-stream feature refinement network for human object interaction detection. J Vis Commun Image Represent 86(103):529. https://doi.org/10.1016/j.jvcir.2022.103529. https://www.sciencedirect.com/science/article/pii/S1047320322000712

  20. Luo T, Guan S, Yang R et al (2023) From detection to understanding: A survey on representation learning for human-object interaction. Neurocomputing 543(126):243. https://doi.org/10.1016/j.neucom.2023.126243. https://www.sciencedirect.com/science/article/pii/S0925231223003661

  21. Mansour AE, Mohammed A, Elsayed HAEA et al (2022) Spatial-net for human-object interaction detection. IEEE Access 10:88920–88931

    Article  Google Scholar 

  22. Arulalan V, Kumar D (2023) Efficient object detection and classification approach using htyolov4 and m2rfo-cnn. Comput Syst Sci Eng 44(2):1703–1717. https://doi.org/10.32604/csse.2023.026744. http://www.techscience.com/csse/v44n2/48281

  23. Cores D, Brea VM, Mucientes M (2023) Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos. Appl Intell 53(1):1205–1217. https://doi.org/10.1007/s10489-022-03529-w

    Article  Google Scholar 

  24. Gkioxari G, Girshick R, Dollár P et al (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359–8367

  25. Wan B, Zhou D, Liu Y et al (2019) Pose-aware multi-level feature network for human object interaction detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9468–9477. https://doi.org/10.1109/ICCV.2019.00956

  26. Liu L, Tan RT (2022) Human object interaction detection using two-direction spatial enhancement and exclusive object prior. Pattern Recognit 124(108):438

    Google Scholar 

  27. Xu B, Li J, Wong Y et al (2019) Interact as you intend: Intention-driven human-object interaction detection. IEEE Trans Multimed 22(6):1423–1432

    Article  Google Scholar 

  28. Yang W, Chen G, Zhao Z et al (2022) icgpn: Interaction-centric graph parsing network for human-object interaction detection. Neurocomputing 502:98–109. https://doi.org/10.1016/j.neucom.2022.06.100

    Article  Google Scholar 

  29. Ye Q, Wang X, Li R et al (2023) Human object interaction detection based on feature optimization and key human-object enhancement. J Vis Commun Image Represent 93(103):824. https://doi.org/10.1016/j.jvcir.2023.103824. https://www.sciencedirect.com/science/article/pii/S1047320323000743

  30. Li YL, Liu X, Wu X et al (2022) Transferable interactiveness knowledge for human-object interaction detection. IEEE Trans Pattern Anal Mach Intell 44(7):3870–3882. https://doi.org/10.1109/TPAMI.2021.3054048

    Article  PubMed  Google Scholar 

  31. Liao Y, Liu S, Wang F et al (2020) Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 479–487. https://doi.org/10.1109/CVPR42600.2020.00056

  32. Zhong X, Qu X, Ding C et al (2021) Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 13229–13238. https://doi.org/10.1109/CVPR46437.2021.01303

  33. Kim B, Choi T, Kang J et al (2020) Uniondet: Union-level detector towards real-time human-object interaction detection. In: Vedaldi A, Bischof H, Brox T et al (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 498–514. https://doi.org/10.1007/978-3-030-58555-6_30

  34. Lim J, Baskaran VM, Lim JMY et al (2023) Ernet: An efficient and reliable human-object interaction detection network. IEEE Trans Image Process 32:964–979. https://doi.org/10.1109/TIP.2022.3231528

    Article  ADS  PubMed  Google Scholar 

  35. Ghimire A, Kakani V, Kim H (2023) Ssrt: A sequential skeleton rgb transformer to recognize fine-grained human-object interactions and action recognition. IEEE Access 11:51930–51948. https://doi.org/10.1109/ACCESS.2023.3278974

    Article  Google Scholar 

  36. Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, pp 213–229

  37. Kim B, Lee J, Kang J et al (2021) Hotr: End-to-end human-object interaction detection with transformers. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 74–83. https://doi.org/10.1109/CVPR46437.2021.00014

  38. Tamura M, Ohashi H, Yoshinaga T (2021) Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10405–10414. https://doi.org/10.1109/CVPR46437.2021.01027

  39. Zou C, Wang B, Hu Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11820–11829. https://doi.org/10.1109/CVPR46437.2021.01165

  40. Cheng Y, Duan H, Wang C et al (2023) Parallel disentangling network for human-object interaction detection. Pattern Recognit 146(110):021

    ADS  Google Scholar 

  41. Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. In: Fleet D, Pajdla T, Schiele B et al (eds) Computer Vision - ECCV 2014. Springer International Publishing, Cham, pp 740–755

    Chapter  Google Scholar 

  42. Yang D, Zou Y, Zhang J et al (2021) Gid-net: Detecting human-object interaction with global and instance dependency. Neurocomputing 444:366–377. https://doi.org/10.1016/j.neucom.2020.02.136. https://www.sciencedirect.com/science/article/pii/S0925231220317768

  43. Li YL, Liu X, Wu X et al (2020) Hoi analysis: Integrating and decomposing human-object interaction. Adv Neural Inf Process Syst 33:5011–5022

    Google Scholar 

  44. Cheng Y, Zhao Z, Wang Z et al (2023) Rethinking vision transformer through human-object interaction detection. Eng Appl Artif Intell 122(106):123. https://doi.org/10.1016/j.engappai.2023.106123. https://www.sciencedirect.com/science/article/pii/S095219762300307X

  45. Kim B, Mun J, On KW et al (2022) Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 19556–19565. https://doi.org/10.1109/CVPR52688.2022.01897

  46. Tu D, Sun W, Zhai G et al (2023) Agglomerative transformer for human-object interaction detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 21614–21624

  47. Zhang A, Liao Y, Liu S et al (2021) Mining the benefits of two-stage and one-stage HOI detection. In: Advances in neural information processing systems 34: annual conference on neural information processing systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp 17209–17220. https://proceedings.neurips.cc/paper/2021/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html

Download references

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 51678075) the Science and Technology Project of Hunan (Grant No. 2017GK2271).

Author information

Authors and Affiliations

Authors

Contributions

All authors have contributed equally.

Corresponding author

Correspondence to Xiaoyue Ding.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, L., Ding, X. Human-object interaction detection based on cascade multi-scale transformer. Appl Intell 54, 2831–2850 (2024). https://doi.org/10.1007/s10489-024-05324-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05324-1

Keywords

Navigation