Advertisement

Detecting Human-Object Interactions with Action Co-occurrence Priors

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)

Abstract

A common problem in human-object interaction (HOI) detection task is that numerous HOI classes have only a small number of labeled examples, resulting in training sets with a long-tailed distribution. The lack of positive labels can lead to low classification accuracy for these classes. Towards addressing this issue, we observe that there exist natural correlations and anti-correlations among human-object interactions. In this paper, we model the correlations as action co-occurrence matrices and present techniques to learn these priors and leverage them for more effective training, especially on rare classes. The utility of our approach is demonstrated experimentally, where the performance of our approach exceeds the state-of-the-art methods on both of the two leading HOI detection benchmark datasets, HICO-Det and V-COCO.

Notes

Acknowledgements

This work was supported by the Institute for Information & Communications Technology Promotion (2017-0-01772) grant funded by the Korea government.

Supplementary material

504479_1_En_43_MOESM1_ESM.pdf (6.8 mb)
Supplementary material 1 (pdf 6995 KB)

References

  1. 1.
    Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)Google Scholar
  2. 2.
    Baradel, F., Neverova, N., Wolf, C., Mille, J., Mori, G.: Object level visual reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 106–122. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_7CrossRefGoogle Scholar
  3. 3.
    Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2018)Google Scholar
  4. 4.
    Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  5. 5.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  6. 6.
    Delaitre, V., Fouhey, D.F., Laptev, I., Sivic, J., Gupta, A., Efros, A.A.: Scene semantics from long-term observation of people. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 284–298. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33783-3_21CrossRefGoogle Scholar
  7. 7.
    Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: British Machine Vision Conference (BMVC) (2010)Google Scholar
  8. 8.
    Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: Advances in Neural Information Processing Systems (NIPS) (2011)Google Scholar
  9. 9.
    Deng, J., et al.: Large-scale object classification using label relation graphs. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 48–64. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_4CrossRefGoogle Scholar
  10. 10.
    Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  11. 11.
    Gao, C., Zou, Y., Huang, J.B.: iCAN: instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (BMVC) (2018)Google Scholar
  12. 12.
    Gibson, J.J.: The Ecological Approach to Visual Perception, Classic edn. Psychology Press, New York (2014)CrossRefGoogle Scholar
  13. 13.
    Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  14. 14.
    Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  15. 15.
    Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2007)Google Scholar
  16. 16.
    Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
  17. 17.
    Gupta, T., Schwing, A., Hoiem, D.: No-Frills Pytorch Github. https://github.com/BigRedT/no_frills_hoi_det
  18. 18.
    Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  19. 19.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  20. 20.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  21. 21.
    Horiguchi, S., Ikami, D., Aizawa, K.: Significance of softmax-based features in comparison to distance metric learning-based features. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 42, 1279–1285 (2019)Google Scholar
  22. 22.
    Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing deep neural networks with logic rules. In: Annual Meeting of the Association for Computational Linguistics (ACL) (2016)Google Scholar
  23. 23.
    Hwang, S.J., Sha, F., Grauman, K.: Sharing features between objects and their attributes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  24. 24.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  25. 25.
    Kato, K., Li, Y., Gupta, A.: Compositional learning for human object interaction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 247–264. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01264-9_15CrossRefGoogle Scholar
  26. 26.
    Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Dense relational captioning: triple-stream networks for relationship-based captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  27. 27.
    Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Image captioning with very scarce supervised data: adversarial semi-supervised learning approach. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)Google Scholar
  28. 28.
    Kim, D.J., Choi, J., Oh, T.H., Yoon, Y., Kweon, I.S.: Disjoint multi-task learning between heterogeneous human-centric tasks. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2018)Google Scholar
  29. 29.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  30. 30.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (IJCV) 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Li, Y., Ouyang, W., Wang, X.: VIP-CNN: a visual phrase reasoning convolutional neural network for visual relationship detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  32. 32.
    Li, Y.L., et al.: Detailed 2D–3D joint representation for human-object interaction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  33. 33.
    Li, Y.L., et al.: Transferable interactiveness knowledge for human-object interaction detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  34. 34.
    Li, Z., Hoiem, D.: Learning without forgetting. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 614–629. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_37CrossRefGoogle Scholar
  35. 35.
    Liao, Y., Liu, S., Wang, F., Chen, Y., Feng, J.: PPDM: parallel point detection and matching for real-time human-object interaction detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  36. 36.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  37. 37.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_51CrossRefGoogle Scholar
  38. 38.
    Marszalek, M., Schmid, C.: Semantic hierarchies for visual object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2007)Google Scholar
  39. 39.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  40. 40.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)Google Scholar
  41. 41.
    Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Detecting unseen visual relations using analogies. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  42. 42.
    Plummer, B.A., Mallya, A., Cervantes, C.M., Hockenmaier, J., Lazebnik, S.: Phrase localization and visual relationship detection with comprehensive linguistic cues (2017)Google Scholar
  43. 43.
    Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.-C.: Learning human-object interactions by graph parsing neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 407–423. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01240-3_25CrossRefGoogle Scholar
  44. 44.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  45. 45.
    Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2018)Google Scholar
  46. 46.
    Stark, L., Bowyer, K.: Achieving generalized object recognition through reasoning about association of function to structure. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 13(10), 1097–1104 (1991)CrossRefGoogle Scholar
  47. 47.
    Sun, X., Li, C., Lin, S.: Explicit spatiotemporal joint relation learning for tracking human pose. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)Google Scholar
  48. 48.
    Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cascaded hand pose regression. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  49. 49.
    Ulutan, O., Iftekhar, A., Manjunath, B.: VSGNet: spatial attention network for detecting human object interactions using graph convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  50. 50.
    Wan, B., Zhou, D., Liu, Y., Li, R., He, X.: Pose-aware multi-level feature network for human object interaction detection. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  51. 51.
    Wang, T., et al.: Deep contextual attention for human-object interaction detection. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  52. 52.
    Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Learning human-object interaction detection using interaction points. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  53. 53.
    Xu, B., Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.S.: Interact as you intend: intention-driven human-object interaction detection. IEEE Trans. Multimed. 22, 1423–1432 (2019)CrossRefGoogle Scholar
  54. 54.
    Xu, B., Wong, Y., Li, J., Zhao, Q., Kankanhalli, M.S.: Learning to detect human-object interactions with knowledge. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  55. 55.
    Yan, Z., et al.: HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  56. 56.
    Yang, X., Zhang, H., Cai, J.: Shuffle-then-assemble: learning object-agnostic visual relationship features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 38–54. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01258-8_3CrossRefGoogle Scholar
  57. 57.
    Yao, B., Fei-Fei, L.: GroupLet: a structured image representation for recognizing human and object interactions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar
  58. 58.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar
  59. 59.
    Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: IEEE International Conference on Computer Vision (ICCV) (2011)Google Scholar
  60. 60.
    Yin, G., et al.: Zoom-Net: mining deep feature interactions for visual relationship recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 330–347. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_20CrossRefGoogle Scholar
  61. 61.
    Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  62. 62.
    Zhan, Y., Yu, J., Yu, T., Tao, D.: On exploring undetermined relationships for visual relationship detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  63. 63.
    Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  64. 64.
    Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  65. 65.
    Zhang, J., Elhoseiny, M., Cohen, S., Chang, W., Elgammal, A.: Relationship proposal networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  66. 66.
    Zhou, P., Chi, M.: Relation parsing neural network for human-object interaction detection. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  67. 67.
    Zhuang, B., Liu, L., Shen, C., Reid, I.: Towards context-aware interaction recognition for visual relationship detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  68. 68.
    Zhuang, B., Wu, Q., Shen, C., Reid, I., van den Hengel, A.: HCVRD: a benchmark for large-scale human-centered visual relationship detection. In: AAAI Conference on Artificial Intelligence (AAAI) (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.KAISTDaejeonSouth Korea
  2. 2.Microsoft ResearchBeijingChina

Personalised recommendations