Detection of Generic Human-Object Interactions in Video Streams
The detection of human-object interactions is a key component in many applications, examples include activity recognition, human intention understanding or the prediction of human movements. In this paper, we propose a novel framework to detect such interactions in RGB-D video streams based on spatio-temporal and pose information. Our system first detects possible human-object interactions using position and pose data of humans and objects. To counter false positive and false negative detections, we calculate the likelihood that such an interaction really occurs by tracking it over subsequent frames. Previous work mainly focused on the detection of specific activities with interacted objects in short prerecorded video clips. In contrast to that, our framework is able to find arbitrary interactions with 510 different objects exploiting the detection capabilities of R-CNNs as well as the Open Image dataset and can be used on online video streams. Our experimental evaluation demonstrates the robustness of the approach on various published videos recorded in indoor environments. The system achieves precision and recall rates of 0.82 on this dataset. Furthermore, we also show that our system can be used for online human motion prediction in robotic applications.
KeywordsIntention understanding Video understanding Domestic robots
We would like to thank Nils Dengler, Sandra Höltervennhoff, Sophie Jenke, Saskia Rabich, Jenny Mack, Marco Pinno, Mosadeq Saljoki and Dominik Wührer for their help during our experiments.
- 1.Bruckschen, L., Dengler, N., Bennewitz, M.: Human motion prediction based on object interactions. In: Proceedings of the European Conference on Mobile Robots (ECMR) (2019)Google Scholar
- 2.Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR) (2017)Google Scholar
- 3.Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR) (2014)Google Scholar
- 4.Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR) (2018)Google Scholar
- 6.Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR) (2017)Google Scholar
- 7.Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification (2017)Google Scholar
- 8.Li, H., Ye, C., Sample, A.P.: IDSense: a human object interaction detection system based on passive UHF RFID. In: Proceeding of the ACM Conference on Human Factors in Computing Systems. ACM (2015)Google Scholar
- 13.Weisstein: Gamma function. http://mathworld.wolfram.com/GammaFunction.html. Accessed 24 Feb 2019
- 14.Yang, C., et al.: Knowledge-based role recognition by using human-object interaction and spatio-temporal analysis. In: Proceedings of IEEE International Conference on Robotics and Biomimetics (ROBIO) (2017)Google Scholar
- 15.Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar