Advertisement

Visual Compositional Learning for Human-Object Interaction Detection

Conference paper
  • 662 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12360)

Abstract

Human-Object interaction (HOI) detection aims to localize and infer relationships between human and objects in an image. It is challenging because an enormous number of possible combinations of objects and verbs types forms a long-tail distribution. We devise a deep Visual Compositional Learning (VCL) framework, which is a simple yet efficient framework to effectively address this problem. VCL first decomposes an HOI representation into object and verb specific features, and then composes new interaction samples in the feature space via stitching the decomposed features. The integration of decomposition and composition enables VCL to share object and verb features among different HOI samples and images, and to generate new interaction samples and new types of HOI, and thus largely alleviates the long-tail distribution problem and benefits low-shot or zero-shot HOI detection. Extensive experiments demonstrate that the proposed VCL can effectively improve the generalization of HOI detection on HICO-DET and V-COCO and outperforms the recent state-of-the-art methods on HICO-DET. Code is available at https://github.com/zhihou7/VCL.

Keywords

Human-object interaction Compositional learning 

Notes

Acknowledgement

This work is partially supported by Science and Technology Service Network Initiative of Chinese Academy of Sciences (KFJ-STS-QYZX-092), Guangdong Special Support Program (2016TX03X276), National Natural Science Foundation of China (U1813218, U1713208), Shenzhen Basic Research Program (JCYJ20170818164704758, CXB201104220032A), the Joint Lab of CAS-HK, Australian Research Council Projects (FL-170100117).

Supplementary material

504470_1_En_35_MOESM1_ESM.pdf (1.2 mb)
Supplementary material 1 (pdf 1272 KB)

References

  1. 1.
    Alfassy, A., et al.: LaSo: label-set operations networks for multi-label few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6548–6557 (2019)Google Scholar
  2. 2.
    Azadi, S., Pathak, D., Ebrahimi, S., Darrell, T.: Compositional GAN: Learning conditional image composition. arXiv preprint arXiv:1807.07560 (2018)
  3. 3.
    Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. arXiv preprint arXiv:1904.03181 (2019)
  4. 4.
    Battaglia, P.W., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)
  5. 5.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)CrossRefGoogle Scholar
  6. 6.
    Burgess, C.P., et al.: Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390 (2019)
  7. 7.
    Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381–389. IEEE (2018)Google Scholar
  8. 8.
    Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1017–1025 (2015)Google Scholar
  9. 9.
    Gao, C., Zou, Y., Huang, J.B.: iCAN: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437 (2018)
  10. 10.
    Garnelo, M., Shanahan, M.: Reconciling deep learning with symbolic artificial intelligence: representing objects and relations. Current Opin. Behav. Sci. 29, 17–23 (2019)CrossRefGoogle Scholar
  11. 11.
    Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
  12. 12.
    Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection: Factorization, appearance and layout encodings, and training techniques. arXiv preprint arXiv:1811.05967 (2018)
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  14. 14.
    Higgins, I., et al.: beta-VAE: Learning basic visual concepts with a constrained variational framework. ICLR 2(5), 6 (2017)Google Scholar
  15. 15.
    Higgins, I., et al.: Scan: Learning hierarchical compositional visual concepts. arXiv preprint arXiv:1707.03389 (2017)
  16. 16.
    Hoffman, D.D., Richards, W.: Parts of recognition (1983)Google Scholar
  17. 17.
    Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)CrossRefGoogle Scholar
  18. 18.
    Kato, K., Li, Y., Gupta, A.: Compositional learning for human object interaction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 247–264. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01264-9_15CrossRefGoogle Scholar
  19. 19.
    Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. Behav. Brain Sci. 40 (2017)Google Scholar
  20. 20.
    Li, Y.L., et al.: Transferable interactiveness prior for human-object interaction detection. arXiv preprint arXiv:1811.08264 (2018)
  21. 21.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  22. 22.
    Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2537–2546 (2019)Google Scholar
  23. 23.
    Locatello, F., Bauer, S., Lucic, M., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359 (2018)
  24. 24.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_51CrossRefGoogle Scholar
  25. 25.
    van den Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)Google Scholar
  26. 26.
    Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Detecting unseen visual relations using analogies. In: The IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  27. 27.
    Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.-C.: Learning human-object interactions by graph parsing neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 407–423. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01240-3_25CrossRefGoogle Scholar
  28. 28.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  29. 29.
    Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1568–1576. IEEE (2018)Google Scholar
  30. 30.
    Spelke, E.S.: Principles of object perception. Cogn. Sci. 14(1), 29–56 (1990)CrossRefGoogle Scholar
  31. 31.
    Wan, B., Zhou, D., Liu, Y., Li, R., He, X.: Pose-aware multi-level feature network for human object interaction detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9469–9478 (2019)Google Scholar
  32. 32.
    Wang, T., et al.: Deep contextual attention for human-object interaction detection. arXiv preprint arXiv:1910.07721 (2019)
  33. 33.
    Wang, Y.X., Girshick, R., Hebert, M., Hariharan, B.: Low-shot learning from imaginary data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7278–7286 (2018)Google Scholar
  34. 34.
    Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learning–a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2251–2265 (2018)CrossRefGoogle Scholar
  35. 35.
    Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning-the good, the bad and the ugly. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4582–4591 (2017)Google Scholar
  36. 36.
    Xu, B., Wong, Y., Li, J., Zhao, Q., Kankanhalli, M.S.: Learning to detect human-object interactions with knowledge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  37. 37.
    Yang, X., Zhang, H., Cai, J.: Shuffle-then-assemble: learning object-agnostic visual relationship features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 38–54. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01258-8_3CrossRefGoogle Scholar
  38. 38.
    Zhou, P., Chi, M.: Relation parsing neural network for human-object interaction detection. In: The IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.UBTECH Sydney AI Centre, School of Computer Science, Faculty of EngineeringThe University of SydneyDarlingtonAustralia
  2. 2.Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced TechnologyChinese Academy of SciencesBeijingChina

Personalised recommendations