Contextual Heterogeneous Graph Network for Human-Object Interaction Detection

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12362)


Human-object interaction (HOI) detection is an important task for understanding human activity. Graph structure is appropriate to denote the HOIs in the scene. Since there is an subordination between human and object—human play subjective role and object play objective role in HOI, the relations between homogeneous entities and heterogeneous entities in the scene should also not be equally the same. However, previous graph models regard human and object as the same kind of nodes and do not consider that the messages are not equally the same between different entities. In this work, we address such a problem for HOI task by proposing a heterogeneous graph network that models humans and objects as different kinds of nodes and incorporates intra-class messages between homogeneous nodes and inter-class messages between heterogeneous nodes. In addition, a graph attention mechanism based on the intra-class context and inter-class context is exploited to improve the learning. Extensive experiments on the benchmark datasets V-COCO and HICO-DET verify the effectiveness of our method and demonstrate the importance to extract intra-class and inter-class messages which are not equally the same in HOI detection.


Human-object interaction Heterogeneous graph Neural network 



This work was supported partially by the National Key Research and Development Program of China (2018YFB1004903), NSFC (U1911401, U1811461), Guangdong Province Science and Technology Innovation Leading Talents (2016TX03X157), Guangdong NSF Project (No. 2018B030312002), Guangzhou Research Project (201902010037), and Research Projects of Zhejiang Lab (No. 2019KD0AB03).

Supplementary material

504472_1_En_15_MOESM1_ESM.pdf (64 kb)
Supplementary material 1 (pdf 64 KB)


  1. 1.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR, pp. 7291–7299 (2017)Google Scholar
  2. 2.
    Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: WACV, pp. 381–389 (2018)Google Scholar
  3. 3.
    Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: ICCV, pp. 1017–1025 (2015)Google Scholar
  4. 4.
    Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: CVPR, pp. 1831–1840 (2017)Google Scholar
  5. 5.
    Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: NIPS, pp. 1503–1511 (2011)Google Scholar
  6. 6.
    Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for static human-object interactions. In: Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–16 (2010)Google Scholar
  7. 7.
    Feng, W., Liu, W., Li, T., Peng, J., Qian, C., Hu, X.: Turbo learning framework for human-object interactions recognition and human pose estimation. arXiv preprint arXiv:1903.06355 (2019)
  8. 8.
    Gao, C., Zou, Y., Huang, J.B.: iCAN: instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437 (2018)
  9. 9.
    Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: ICML, pp. 1263–1272 (2017)Google Scholar
  10. 10.
    Girshick, R.: Fast R-CNN. In: ICCV, pp. 1440–1448 (2015)Google Scholar
  11. 11.
    Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: CVPR, pp. 8359–8367 (2018)Google Scholar
  12. 12.
    Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. PAMI 31(10), 1775–1789 (2009)CrossRefGoogle Scholar
  13. 13.
    Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
  14. 14.
    Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: ICCV, pp. 9677–9685 (2019)Google Scholar
  15. 15.
    Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS, pp. 1024–1034 (2017)Google Scholar
  16. 16.
    Hu, J.F., Zheng, W.S., Lai, J., Gong, S., Xiang, T.: Recognising human-object interaction via exemplar based modelling. In: ICCV, pp. 3144–3151 (2013)Google Scholar
  17. 17.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
  18. 18.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Lee, C.W., Fang, W., Yeh, C.K., Frank Wang, Y.C.: Multi-label zero-shot learning with structured knowledge graphs. In: CVPR, pp. 1576–1585 (2018)Google Scholar
  20. 20.
    Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: CVPR, pp. 3595–3603 (2019)Google Scholar
  21. 21.
    Li, Y.L., Zhou, S., Huang, X., Xu, L., Ma, Z., Fang, H.S., Wang, Y.F., Lu, C.: Transferable interactiveness prior for human-object interaction detection. arXiv preprint arXiv:1811.08264 (2018)
  22. 22.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)Google Scholar
  23. 23.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  24. 24.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  25. 25.
    Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Detecting unseen visual relations using analogies. In: ICCV, pp. 1981–1990 (2019)Google Scholar
  26. 26.
    Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. PAMI 34(3), 601–614 (2011)CrossRefGoogle Scholar
  27. 27.
    Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.-C.: Learning human-object interactions by graph parsing neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 407–423. Springer, Cham (2018). Scholar
  28. 28.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  29. 29.
    Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: Computational capabilities of graph neural networks. TNN 20(1), 81–102 (2008)Google Scholar
  30. 30.
    Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning. In: WACV, pp. 1568–1576 (2018)Google Scholar
  31. 31.
    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
  32. 32.
    Wan, B., Zhou, D., Liu, Y., Li, R., He, X.: Pose-aware multi-level feature network for human object interaction detection. In: ICCV, pp. 9469–9478 (2019)Google Scholar
  33. 33.
    Wang, T., et al.: Deep contextual attention for human-object interaction detection. arXiv preprint arXiv:1910.07721 (2019)
  34. 34.
    Wang, X., Ye, Y., Gupta, A.: Zero-shot recognition via semantic embeddings and knowledge graphs. In: CVPR, pp. 6857–6866 (2018)Google Scholar
  35. 35.
    Xu, B., Wong, Y., Li, J., Zhao, Q., Kankanhalli, M.S.: Learning to detect human-object interactions with knowledge. In: CVPR, June 2019Google Scholar
  36. 36.
    Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: Computer Society Conference on Computer Vision and Pattern Recognition, pp. 9–16 (2010)Google Scholar
  37. 37.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: Computer Society Conference on Computer Vision and Pattern Recognition, pp. 17–24 (2010)Google Scholar
  38. 38.
    Yu, W., Zhou, J., Yu, W., Liang, X., Xiao, N.: Heterogeneous graph learning for visual commonsense reasoning. In: NIPS, pp. 2765–2775 (2019)Google Scholar
  39. 39.
    Zhou, P., Chi, M.: Relation parsing neural network for human-object interaction detection. In: ICCV, pp. 843–851 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Data and Computer ScienceSun Yat-sen UniversityGuangzhouChina
  2. 2.Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of EducationGuangzhouChina
  3. 3.Peng Cheng LaboratoryShenzhenChina

Personalised recommendations