Skip to main content

Discovering Human-Object Interaction Concepts via Self-Compositional Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13687))

Included in the following conference series:

Abstract

A comprehensive understanding of human-object interaction (HOI) requires detecting not only a small portion of predefined HOI concepts (or categories) but also other reasonable HOI concepts, while current approaches usually fail to explore a huge portion of unknown HOI concepts (i.e., unknown but reasonable combinations of verbs and objects). In this paper, 1) we introduce a novel and challenging task for a comprehensive HOI understanding, which is termed as HOI Concept Discovery; and 2) we devise a self-compositional learning framework (or SCL) for HOI concept discovery. Specifically, we maintain an online updated concept confidence matrix during training: 1) we assign pseudo labels for all composite HOI instances according to the concept confidence matrix for self-training; and 2) we update the concept confidence matrix using the predictions of all composite HOI instances. Therefore, the proposed method enables the learning on both known and unknown HOI concepts. We perform extensive experiments on several popular HOI datasets to demonstrate the effectiveness of the proposed method for HOI concept discovery, object affordance recognition and HOI detection. For example, the proposed self-compositional learning framework significantly improves the performance of 1) HOI concept discovery by over 10% on HICO-DET and over 3% on V-COCO, respectively; 2) object affordance recognition by over 9% mAP on MS-COCO and HICO-DET; and 3) rare-first and non-rare-first unknown HOI detection relatively over 30% and 20%, respectively. Code is publicly available at https://github.com/zhihou7/HOI-CL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: 12th Symposium on Operating Systems Design and Implementation (OSDI), pp. 265–283 (2016)

    Google Scholar 

  2. Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. In: AAAI (2020)

    Google Scholar 

  3. Best, J.B.: Cognitive psychology. West Publishing Co (1986)

    Google Scholar 

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)

    Google Scholar 

  5. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381–389. IEEE (2018)

    Google Scholar 

  6. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: WACV, pp. 381–389. IEEE (2018)

    Google Scholar 

  7. Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: ICCV, pp. 1017–1025 (2015)

    Google Scholar 

  8. Chen, M., Liao, Y., Liu, S., Chen, Z., Wang, F., Qian, C.: Reformulating hoi detection as adaptive set prediction. In: CVPR, pp. 9004–9013 (2021)

    Google Scholar 

  9. Coren, S.: Sensation and perception. Handbook of psychology, pp. 85–108 (2003)

    Google Scholar 

  10. Dabral, R., Shimada, S., Jain, A., Theobalt, C., Golyanik, V.: Gravity-aware monocular 3d human-object reconstruction. In: ICCV, pp. 12365–12374 (2021)

    Google Scholar 

  11. Damen, D., et al.: Scaling egocentric vision: the dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_44

    Chapter  Google Scholar 

  12. De Comité, F., Denis, F., Gilleron, R., Letouzey, F.: Positive and unlabeled examples help learning. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 219–230. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-46769-6_18

    Chapter  Google Scholar 

  13. Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3D affordanceNet: a benchmark for visual object affordance understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1778–1787 (2021)

    Google Scholar 

  14. Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 213–220 (2008)

    Google Scholar 

  15. Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2vec: reasoning object affordances from online videos. In: CVPR (2018)

    Google Scholar 

  16. Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A.A., Laptev, I., Sivic, J.: People watching: human actions as a cue for single view geometry. IJCV 110, 259–274 (2014)

    Article  Google Scholar 

  17. Gao, C., Zou, Y., Huang, J.B.: iCAN: instance-centric attention network for human-object interaction detection. In: BMVC (2018)

    Google Scholar 

  18. Gibson, J.J.: The ecological approach to visual perception (1979)

    Google Scholar 

  19. Gibson, J.J.: The ecological approach to visual perception: classic edition. Psychology Press (2014)

    Google Scholar 

  20. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: ICML, pp. 1263–1272. PMLR (2017)

    Google Scholar 

  21. Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE PAMI 31(10), 1775–1789 (2009)

    Article  Google Scholar 

  22. Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)

  23. Hassan, M., Dharmaratne, A.: Attribute based affordance detection from human-object interaction images. In: Huang, F., Sugimoto, A. (eds.) PSIVT 2015. LNCS, vol. 9555, pp. 220–232. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30285-0_18

    Chapter  Google Scholar 

  24. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  25. Hou, Z., Peng, X., Qiao, Yu., Tao, D.: Visual compositional learning for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 584–600. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_35

    Chapter  Google Scholar 

  26. Hou, Z., Yu, B., Qiao, Y., Peng, X., Tao, D.: Affordance transfer learning for human-object interaction detection. In: CVPR (2021)

    Google Scholar 

  27. Hou, Z., Yu, B., Qiao, Y., Peng, X., Tao, D.: Detecting human-object interaction via fabricated compositional learning. In: CVPR (2021)

    Google Scholar 

  28. Huynh, D., Elhamifar, E.: Interaction compass: Multi-label zero-shot learning of human-object interactions via spatial relations. In: ICCV, pp. 8472–8483 (2021)

    Google Scholar 

  29. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456. PMLR (2015)

    Google Scholar 

  30. Ji, J., Desai, R., Niebles, J.C.: Detecting human-object relationships in videos. In: ICCV, pp. 8106–8116 (2021)

    Google Scholar 

  31. Kato, K., Li, Y., Gupta, A.: Compositional learning for human object interaction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 247–264. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_15

    Chapter  Google Scholar 

  32. Kim, B., Lee, J., Kang, J., Kim, E.S., Kim, H.J.: HOTR: end-to-end human-object interaction detection with transformers. In: CVPR, pp. 74–83 (2021)

    Google Scholar 

  33. Kingma, D.P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learning with deep generative models. In: NIPS (2014)

    Google Scholar 

  34. Kjellström, H., Romero, J., Kragić, D.: Visual object-action recognition: inferring object affordances from human demonstration. Comput. Vis. Image Underst. 115(1), 81–90 (2011)

    Article  Google Scholar 

  35. Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML (2013)

    Google Scholar 

  36. Li, Y.L., Liu, X., Wu, X., Li, Y., Lu, C.: Hoi analysis: integrating and decomposing human-object interaction. In: NeuIPS 33 (2020)

    Google Scholar 

  37. Li, Y.L., et al.: Transferable interactiveness prior for human-object interaction detection. In: CVPR (2019)

    Google Scholar 

  38. Liao, Y., Liu, S., Wang, F., Chen, Y., Feng, J.: PPDM: parallel point detection and matching for real-time human-object interaction detection. In: CVPR (2020)

    Google Scholar 

  39. Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  40. Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., Darrell, T.: Something-else: Compositional action recognition with spatial-temporal interaction networks. In: CVPR, pp. 1049–1059 (2020)

    Google Scholar 

  41. Nagarajan, T., Grauman, K.: Learning affordance landscapes for interaction exploration in 3D environments. Adv. Neural. Inf. Process. Syst. 33, 2005–2015 (2020)

    Google Scholar 

  42. Nawhal, M., Zhai, M., Lehrmann, A., Sigal, L., Mori, G.: Generating videos of zero-shot compositions of actions and objects. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 382–401. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_23

    Chapter  Google Scholar 

  43. Norman, D.A.: The design of everyday things. Basic Books Inc, USA (2002)

    Google Scholar 

  44. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) NeurIPS, pp. 8024–8035. Curran Associates, Inc. (2019). https://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

  45. Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Detecting unseen visual relations using analogies. In: ICCV (2019)

    Google Scholar 

  46. Scott, C., Blanchard, G.: Novelty detection: Unlabeled data definitely help. In: Artificial intelligence and statistics, pp. 464–471. PMLR (2009)

    Google Scholar 

  47. Scudder, H.: Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory 11(3), 363–371 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  48. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV (2017)

    Google Scholar 

  49. Shao, S., Li, Z., Zhang, T., Peng, C., Sun, J.: Objects365: a large-scale, high-quality dataset for object detection. In: ICCV (2019)

    Google Scholar 

  50. Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning. In: WACV, pp. 1568–1576. IEEE (2018)

    Google Scholar 

  51. Springenberg, J.T.: Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390 (2015)

  52. Tamura, M., Ohashi, H., Yoshinaga, T.: QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information. In: CVPR (2021)

    Google Scholar 

  53. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: ICLR (2017)

    Google Scholar 

  54. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 1225–1234 (2016)

    Google Scholar 

  55. Wang, S., Yap, K.H., Ding, H., Wu, J., Yuan, J., Tan, Y.P.: Discovering human interactions with large-vocabulary objects via query and multi-scale detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13475–13484 (2021)

    Google Scholar 

  56. Wang, S., Yap, K.H., Yuan, J., Tan, Y.P.: Discovering human interactions with novel objects via zero-shot learning. In: CVPR, pp. 11652–11661 (2020)

    Google Scholar 

  57. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imageNet classification. In: CVPR, pp. 10687–10698 (2020)

    Google Scholar 

  58. Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419 (2017)

    Google Scholar 

  59. Yang, X., Song, Z., King, I., Xu, Z.: A survey on deep semi-supervised learning. arXiv preprint arXiv:2103.00550 (2021)

  60. Yao, B., Ma, J., Li, F.F.: Discovering object functionality. In: ICCV (2013)

    Google Scholar 

  61. Zhai, W., Luo, H., Zhang, J., Cao, Y., Tao, D.: One-shot object affordance detection in the wild. arXiv preprint arXiv:2108.03658 (2021)

  62. Zhang, A., et al.: Mining the benefits of two-stage and one-stage hoi detection. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  63. Zhang, J.Y., Pepose, S., Joo, H., Ramanan, D., Malik, J., Kanazawa, A.: Perceiving 3D human-object spatial arrangements from a single image in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 34–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_3

    Chapter  Google Scholar 

  64. Zheng, S., Chen, S., Jin, Q.: Skeleton-based interactive graph network for human object interaction detection. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2020)

    Google Scholar 

  65. Zhong, X., Ding, C., Qu, X., Tao, D.: Polysemy deciphering network for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 69–85. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_5

    Chapter  Google Scholar 

  66. Zhu, Y., Fathi, A., Fei-Fei, L.: Reasoning about object affordances in a knowledge base representation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 408–424. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_27

    Chapter  Google Scholar 

  67. Zou, C., et al.: End-to-end human object interaction detection with hoi transformer. In: CVPR, pp. 11825–11834 (2021)

    Google Scholar 

Download references

Acknowledgments

Mr. Zhi Hou and Dr. Baosheng Yu are supported by ARC FL-170100117, DP-180103424, IC-190100031, and LE-200100049.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baosheng Yu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2276 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hou, Z., Yu, B., Tao, D. (2022). Discovering Human-Object Interaction Concepts via Self-Compositional Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13687. Springer, Cham. https://doi.org/10.1007/978-3-031-19812-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19812-0_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19811-3

  • Online ISBN: 978-3-031-19812-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics