Advertisement

Detection of Generic Human-Object Interactions in Video Streams

  • Lilli BruckschenEmail author
  • Sabrina Amft
  • Julian Tanke
  • Jürgen Gall
  • Maren Bennewitz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11876)

Abstract

The detection of human-object interactions is a key component in many applications, examples include activity recognition, human intention understanding or the prediction of human movements. In this paper, we propose a novel framework to detect such interactions in RGB-D video streams based on spatio-temporal and pose information. Our system first detects possible human-object interactions using position and pose data of humans and objects. To counter false positive and false negative detections, we calculate the likelihood that such an interaction really occurs by tracking it over subsequent frames. Previous work mainly focused on the detection of specific activities with interacted objects in short prerecorded video clips. In contrast to that, our framework is able to find arbitrary interactions with 510 different objects exploiting the detection capabilities of R-CNNs as well as the Open Image dataset and can be used on online video streams. Our experimental evaluation demonstrates the robustness of the approach on various published videos recorded in indoor environments. The system achieves precision and recall rates of 0.82 on this dataset. Furthermore, we also show that our system can be used for online human motion prediction in robotic applications.

Keywords

Intention understanding Video understanding Domestic robots 

Notes

Acknowledgments

We would like to thank Nils Dengler, Sandra Höltervennhoff, Sophie Jenke, Saskia Rabich, Jenny Mack, Marco Pinno, Mosadeq Saljoki and Dominik Wührer for their help during our experiments.

References

  1. 1.
    Bruckschen, L., Dengler, N., Bennewitz, M.: Human motion prediction based on object interactions. In: Proceedings of the European Conference on Mobile Robots (ECMR) (2019)Google Scholar
  2. 2.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR) (2017)Google Scholar
  3. 3.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR) (2014)Google Scholar
  4. 4.
    Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR) (2018)Google Scholar
  5. 5.
    Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(10), 1775–1789 (2009)CrossRefGoogle Scholar
  6. 6.
    Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR) (2017)Google Scholar
  7. 7.
    Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification (2017)Google Scholar
  8. 8.
    Li, H., Ye, C., Sample, A.P.: IDSense: a human object interaction detection system based on passive UHF RFID. In: Proceeding of the ACM Conference on Human Factors in Computing Systems. ACM (2015)Google Scholar
  9. 9.
    Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(12), 2935–2947 (2018)CrossRefGoogle Scholar
  10. 10.
    Prest, A., Ferrari, V., Schmid, C.: Explicit modeling of human-object interactions in realistic videos. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 35(4), 835–848 (2013)CrossRefGoogle Scholar
  11. 11.
    Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 34(3), 601–614 (2012)CrossRefGoogle Scholar
  12. 12.
    Rohrbach, A., et al.: Coherent multi-sentence video description with variable level of detail. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 184–195. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11752-2_15CrossRefGoogle Scholar
  13. 13.
    Weisstein: Gamma function. http://mathworld.wolfram.com/GammaFunction.html. Accessed 24 Feb 2019
  14. 14.
    Yang, C., et al.: Knowledge-based role recognition by using human-object interaction and spatio-temporal analysis. In: Proceedings of IEEE International Conference on Robotics and Biomimetics (ROBIO) (2017)Google Scholar
  15. 15.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Lilli Bruckschen
    • 1
    Email author
  • Sabrina Amft
    • 3
  • Julian Tanke
    • 2
  • Jürgen Gall
    • 2
  • Maren Bennewitz
    • 1
  1. 1.Humanoid Robots LabUniversity of BonnBonnGermany
  2. 2.Institute of Computer Science IIIUniversity of BonnBonnGermany
  3. 3.Human-Centered SecurityLeibniz University HannoverHanoverGermany

Personalised recommendations