Advertisement

Semantic Inference Network for Human-Object Interaction Detection

  • Hongyi Liu
  • Lisha Mo
  • Huimin MaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11901)

Abstract

Recently many efforts have been made to understand the scenes in images. The interactions between human and objects are usually of great significance to scene understanding. In this paper, we focus on the task of detecting human-object interactions (HOI), which is to detect triplets \(<\!human, verb, object\!>\) in challenging daily images. We propose a novel model which introduces a semantic stream and a new form of loss function. Our intuition is that the semantic information of object classes is beneficial to HOI detection. Semantic information is extracted by embedding the category information of objects with pre-trained BERT model. On the other hand, we find that the HOI task suffers severely from extreme imbalance between positive and negative samples. We propose a weighted focal loss (WFL) to tackle this problem. The results show that our method achieves a gain of 5% compared with our baseline.

Keywords

Human-object interaction Visual relationship detection Word embedding 

Notes

Acknowledgement

This research is supported by The National Key Basic Research and Development Program of China (No. 2016YFB0100900) and National Natural Science Foundation of China (No. 61773231).

References

  1. 1.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1261–1270 (2017)Google Scholar
  2. 2.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419 (2017)Google Scholar
  3. 3.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)Google Scholar
  4. 4.
    Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)Google Scholar
  5. 5.
    Gupta, S., Malik, J.: Visual semantic role labeling. Computer Science (2015)Google Scholar
  6. 6.
    Gao, C., Zou, Y., Huang, J.-B.: iCAN: instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (2018)Google Scholar
  7. 7.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  8. 8.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  9. 9.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  10. 10.
    He, K., Gkioxari, G., Dollár, P., Girshick, P.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)Google Scholar
  11. 11.
    Girshick, R., Radosavovic, I., Dollár, P., He, K., Gkioxari, G.: Detectron (2018)Google Scholar
  12. 12.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3076–3086 (2017)Google Scholar
  13. 13.
    Zhuang, B., Liu, L., Shen, C., Reid, C.: Towards context-aware interaction recognition for visual relationship detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 589–598 (2017)Google Scholar
  14. 14.
    Liang, K., Guo, Y., Chang, H., Chen, X.: Visual relationship detection with deep structural ranking. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  15. 15.
    Chao, Y.W., Liu, Y., Liu, Y., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381–389. IEEE (2018)Google Scholar
  16. 16.
    Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8359–8367 (2018)Google Scholar
  17. 17.
    Xiao, H.: Bert-as-service (2018). https://github.com/hanxiao/bert-as-service

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Electronic EngineeringTsinghua UniversityBeijingChina

Personalised recommendations