Discovering Human-Object Interaction Concepts via Self-Compositional Learning

Hou, Zhi; Yu, Baosheng; Tao, Dacheng

doi:10.1007/978-3-031-19812-0_27

Zhi Hou¹²,
Baosheng Yu¹² &
Dacheng Tao^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13687))

Included in the following conference series:

European Conference on Computer Vision

2169 Accesses
3 Citations

Abstract

A comprehensive understanding of human-object interaction (HOI) requires detecting not only a small portion of predefined HOI concepts (or categories) but also other reasonable HOI concepts, while current approaches usually fail to explore a huge portion of unknown HOI concepts (i.e., unknown but reasonable combinations of verbs and objects). In this paper, 1) we introduce a novel and challenging task for a comprehensive HOI understanding, which is termed as HOI Concept Discovery; and 2) we devise a self-compositional learning framework (or SCL) for HOI concept discovery. Specifically, we maintain an online updated concept confidence matrix during training: 1) we assign pseudo labels for all composite HOI instances according to the concept confidence matrix for self-training; and 2) we update the concept confidence matrix using the predictions of all composite HOI instances. Therefore, the proposed method enables the learning on both known and unknown HOI concepts. We perform extensive experiments on several popular HOI datasets to demonstrate the effectiveness of the proposed method for HOI concept discovery, object affordance recognition and HOI detection. For example, the proposed self-compositional learning framework significantly improves the performance of 1) HOI concept discovery by over 10% on HICO-DET and over 3% on V-COCO, respectively; 2) object affordance recognition by over 9% mAP on MS-COCO and HICO-DET; and 3) rare-first and non-rare-first unknown HOI detection relatively over 30% and 20%, respectively. Code is publicly available at https://github.com/zhihou7/HOI-CL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: 12th Symposium on Operating Systems Design and Implementation (OSDI), pp. 265–283 (2016)
Google Scholar
Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. In: AAAI (2020)
Google Scholar
Best, J.B.: Cognitive psychology. West Publishing Co (1986)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Google Scholar
Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381–389. IEEE (2018)
Google Scholar
Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: WACV, pp. 381–389. IEEE (2018)
Google Scholar
Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: ICCV, pp. 1017–1025 (2015)
Google Scholar
Chen, M., Liao, Y., Liu, S., Chen, Z., Wang, F., Qian, C.: Reformulating hoi detection as adaptive set prediction. In: CVPR, pp. 9004–9013 (2021)
Google Scholar
Coren, S.: Sensation and perception. Handbook of psychology, pp. 85–108 (2003)
Google Scholar
Dabral, R., Shimada, S., Jain, A., Theobalt, C., Golyanik, V.: Gravity-aware monocular 3d human-object reconstruction. In: ICCV, pp. 12365–12374 (2021)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_44
Chapter Google Scholar
De Comité, F., Denis, F., Gilleron, R., Letouzey, F.: Positive and unlabeled examples help learning. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 219–230. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-46769-6_18
Chapter Google Scholar
Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3D affordanceNet: a benchmark for visual object affordance understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1778–1787 (2021)
Google Scholar
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 213–220 (2008)
Google Scholar
Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2vec: reasoning object affordances from online videos. In: CVPR (2018)
Google Scholar
Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A.A., Laptev, I., Sivic, J.: People watching: human actions as a cue for single view geometry. IJCV 110, 259–274 (2014)
Article Google Scholar
Gao, C., Zou, Y., Huang, J.B.: iCAN: instance-centric attention network for human-object interaction detection. In: BMVC (2018)
Google Scholar
Gibson, J.J.: The ecological approach to visual perception (1979)
Google Scholar
Gibson, J.J.: The ecological approach to visual perception: classic edition. Psychology Press (2014)
Google Scholar
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: ICML, pp. 1263–1272. PMLR (2017)
Google Scholar
Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE PAMI 31(10), 1775–1789 (2009)
Article Google Scholar
Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
Hassan, M., Dharmaratne, A.: Attribute based affordance detection from human-object interaction images. In: Huang, F., Sugimoto, A. (eds.) PSIVT 2015. LNCS, vol. 9555, pp. 220–232. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30285-0_18
Chapter Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hou, Z., Peng, X., Qiao, Yu., Tao, D.: Visual compositional learning for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 584–600. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_35
Chapter Google Scholar
Hou, Z., Yu, B., Qiao, Y., Peng, X., Tao, D.: Affordance transfer learning for human-object interaction detection. In: CVPR (2021)
Google Scholar
Hou, Z., Yu, B., Qiao, Y., Peng, X., Tao, D.: Detecting human-object interaction via fabricated compositional learning. In: CVPR (2021)
Google Scholar
Huynh, D., Elhamifar, E.: Interaction compass: Multi-label zero-shot learning of human-object interactions via spatial relations. In: ICCV, pp. 8472–8483 (2021)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456. PMLR (2015)
Google Scholar
Ji, J., Desai, R., Niebles, J.C.: Detecting human-object relationships in videos. In: ICCV, pp. 8106–8116 (2021)
Google Scholar
Kato, K., Li, Y., Gupta, A.: Compositional learning for human object interaction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 247–264. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_15
Chapter Google Scholar
Kim, B., Lee, J., Kang, J., Kim, E.S., Kim, H.J.: HOTR: end-to-end human-object interaction detection with transformers. In: CVPR, pp. 74–83 (2021)
Google Scholar
Kingma, D.P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learning with deep generative models. In: NIPS (2014)
Google Scholar
Kjellström, H., Romero, J., Kragić, D.: Visual object-action recognition: inferring object affordances from human demonstration. Comput. Vis. Image Underst. 115(1), 81–90 (2011)
Article Google Scholar
Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML (2013)
Google Scholar
Li, Y.L., Liu, X., Wu, X., Li, Y., Lu, C.: Hoi analysis: integrating and decomposing human-object interaction. In: NeuIPS 33 (2020)
Google Scholar
Li, Y.L., et al.: Transferable interactiveness prior for human-object interaction detection. In: CVPR (2019)
Google Scholar
Liao, Y., Liu, S., Wang, F., Chen, Y., Feng, J.: PPDM: parallel point detection and matching for real-time human-object interaction detection. In: CVPR (2020)
Google Scholar
Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., Darrell, T.: Something-else: Compositional action recognition with spatial-temporal interaction networks. In: CVPR, pp. 1049–1059 (2020)
Google Scholar
Nagarajan, T., Grauman, K.: Learning affordance landscapes for interaction exploration in 3D environments. Adv. Neural. Inf. Process. Syst. 33, 2005–2015 (2020)
Google Scholar
Nawhal, M., Zhai, M., Lehrmann, A., Sigal, L., Mori, G.: Generating videos of zero-shot compositions of actions and objects. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 382–401. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_23
Chapter Google Scholar
Norman, D.A.: The design of everyday things. Basic Books Inc, USA (2002)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) NeurIPS, pp. 8024–8035. Curran Associates, Inc. (2019). https://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Detecting unseen visual relations using analogies. In: ICCV (2019)
Google Scholar
Scott, C., Blanchard, G.: Novelty detection: Unlabeled data definitely help. In: Artificial intelligence and statistics, pp. 464–471. PMLR (2009)
Google Scholar
Scudder, H.: Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory 11(3), 363–371 (1965)
Article MathSciNet MATH Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV (2017)
Google Scholar
Shao, S., Li, Z., Zhang, T., Peng, C., Sun, J.: Objects365: a large-scale, high-quality dataset for object detection. In: ICCV (2019)
Google Scholar
Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning. In: WACV, pp. 1568–1576. IEEE (2018)
Google Scholar
Springenberg, J.T.: Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390 (2015)
Tamura, M., Ohashi, H., Yoshinaga, T.: QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information. In: CVPR (2021)
Google Scholar
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: ICLR (2017)
Google Scholar
Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 1225–1234 (2016)
Google Scholar
Wang, S., Yap, K.H., Ding, H., Wu, J., Yuan, J., Tan, Y.P.: Discovering human interactions with large-vocabulary objects via query and multi-scale detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13475–13484 (2021)
Google Scholar
Wang, S., Yap, K.H., Yuan, J., Tan, Y.P.: Discovering human interactions with novel objects via zero-shot learning. In: CVPR, pp. 11652–11661 (2020)
Google Scholar
Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imageNet classification. In: CVPR, pp. 10687–10698 (2020)
Google Scholar
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419 (2017)
Google Scholar
Yang, X., Song, Z., King, I., Xu, Z.: A survey on deep semi-supervised learning. arXiv preprint arXiv:2103.00550 (2021)
Yao, B., Ma, J., Li, F.F.: Discovering object functionality. In: ICCV (2013)
Google Scholar
Zhai, W., Luo, H., Zhang, J., Cao, Y., Tao, D.: One-shot object affordance detection in the wild. arXiv preprint arXiv:2108.03658 (2021)
Zhang, A., et al.: Mining the benefits of two-stage and one-stage hoi detection. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Google Scholar
Zhang, J.Y., Pepose, S., Joo, H., Ramanan, D., Malik, J., Kanazawa, A.: Perceiving 3D human-object spatial arrangements from a single image in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 34–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_3
Chapter Google Scholar
Zheng, S., Chen, S., Jin, Q.: Skeleton-based interactive graph network for human object interaction detection. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2020)
Google Scholar
Zhong, X., Ding, C., Qu, X., Tao, D.: Polysemy deciphering network for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 69–85. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_5
Chapter Google Scholar
Zhu, Y., Fathi, A., Fei-Fei, L.: Reasoning about object affordances in a knowledge base representation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 408–424. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_27
Chapter Google Scholar
Zou, C., et al.: End-to-end human object interaction detection with hoi transformer. In: CVPR, pp. 11825–11834 (2021)
Google Scholar

Download references

Acknowledgments

Mr. Zhi Hou and Dr. Baosheng Yu are supported by ARC FL-170100117, DP-180103424, IC-190100031, and LE-200100049.

Author information

Authors and Affiliations

The University of Sydney, Sydney, Australia
Zhi Hou, Baosheng Yu & Dacheng Tao
JD Explore Academy, Beijing, China
Dacheng Tao

Authors

Zhi Hou
View author publications
You can also search for this author in PubMed Google Scholar
Baosheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Dacheng Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baosheng Yu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2276 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hou, Z., Yu, B., Tao, D. (2022). Discovering Human-Object Interaction Concepts via Self-Compositional Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13687. Springer, Cham. https://doi.org/10.1007/978-3-031-19812-0_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-19812-0_27
Published: 30 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19811-3
Online ISBN: 978-3-031-19812-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Discovering Human-Object Interaction Concepts via Self-Compositional Learning