Abstract
Humans regularly interact with their surrounding objects. Such interactions often result in strongly correlated motions between humans and the interacting objects. We thus ask: “Is it possible to infer object properties from skeletal motion alone, even without seeing the interacting object itself?” In this paper, we present a fine-grained action recognition method that learns to infer such latent object properties from human interaction motion alone. This inference allows us to disentangle the motion from the object property and transfer object properties to a given motion. We collected a large number of videos and 3D skeletal motions of performing actors using an inertial motion capture device. We analyzed similar actions and learned subtle differences between them to reveal latent properties of the interacting objects. In particular, we learned to identify the interacting object, by estimating its weight, or its spillability. Our results clearly demonstrate that motions and interacting objects are highly correlated and that related object latent properties can be inferred from 3D skeleton sequences alone, leading to new synthesis possibilities for motions involving human interaction. Our dataset is available at http://vcc.szu.edu.cn/research/2020/IT.html.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Blake, R.; Shiffrar, M. Perception of human motion. Annual Review of Psychology Vol. 58, No. 1, 47–73, 2007.
Runeson, S.; Frykholm, G. Visual perception of lifted weight. Journal of Experimental Psychology: Human Perception and Performance Vol. 7, No. 4, 733–740, 1981.
Podda, J.; Ansuini, C.; Vastano, R.; Cavallo, A.; Becchio, C. The heaviness of invisible objects: Predictive weight judgments from observed real and pantomimed grasps. Cognition Vol. 168, 140–145, 2017.
Vaina, L. M.; Goodglass, H.; Daltroy, L. Inference of object use from pantomimed actions by aphasics and patients with right hemisphere lesions. Synthese Vol. 104, No. 1, 43–57, 1995.
Shahroudy, A.; Liu, J.; Ng, T. T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1010–1019, 2016.
Liu, C.; Hu, Y.; Li, Y.; Song, S.; Liu, J. PKUMMD: A large scale benchmark for continuous multimodal human action understanding. arXiv preprint arXiv:1703.07475, 2017.
Lo Presti, L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognition Vol. 53, 130–147, 2016.
Yao, B. P.; Fei-Fei, L. Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 17–24, 2010.
Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8359–8367, 2018.
Kato, K.; Li, Y.; Gupta, A. Compositional learning for human object interaction. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11218. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 247–264, 2018.
Grabner, H.; Gall, J.; van Gool, L. What makes a chair a chair? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1529–1536, 2011.
Kim, V. G.; Chaudhuri, S.; Guibas, L.; Funkhouser, T. Shape2Pose. ACM Transactions on Graphics Vol. 33, No. 4, Article No. 120, 2014.
Hu, R. Z.; Yan, Z. H.; Zhang, J. W.; van Kaick, O., Shamir, A., Zhang, H.; Huang, H. Predictive and generative neural networks for object functionality. ACM Transactions on Graphics Vol. 37, No. 4, Article No. 151, 2018.
Savva, M.; Chang, A. X.; Hanrahan, P.; Fisher, M.; Nießner, M. SceneGrok. ACM Transactions on Graphics Vol. 33, No. 6, Article No. 212, 2014.
Li, X. T.; Liu, S. F.; Kim, K.; Wang, X. L.; Yang, M. H.; Kautz, J. Putting humans in a scene: Learning affordance in 3D indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12360–12368, 2019.
Hu, R.; Savva, M.; van Kaick, O. Functionality representations and applications for shape analysis. Computer Graphics Forum Vol. 37, No. 2, 603–624, 2018.
Jiang, Y.; Koppula, H.; Saxena, A. Hallucinated humans as the hidden context for labeling 3D scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2993–3000, 2013.
Jiang, Y.; Koppula, H. S.; Saxena, A. Modeling 3D environments through hidden human context. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 38, No. 10, 2040–2053, 2016.
Ho, E. S. L.; Komura, T.; Tai, C. L. Spatial relationship preserving character motion adaptation. ACM Transactions on Graphics Vol. 29, No. 4, Article No. 33, 2010.
Shen, Y. J.; Yang, L. Z.; Ho, E. S. L.; Shum, H. P. H. Interaction-based human activity comparison. IEEE Transactions on Visualization and Computer Graphics Vol. 26, No. 8, 2620–2633, 2019.
Stoffregen, T. A.; Flynn, S. B. Visual perception of support-surface deformability from human body kinematics. Ecological Psychology Vol. 6, No. 1, 33–64, 1994.
Hamilton, A. F.; Joyce, D. W.; Flanagan, J. R.; Frith, C. D.; Wolpert, D. M. Kinematic cues in perceptual weight judgement and their origins in box lifting. Psychological Research Vol. 71, No. 1, 13–21, 2007.
Schmidt, F.; Paulun, V. C.; van Assen, J. J. R.; Fleming, R. W. Inferring the stiffness of unfamiliar objects from optical, shape, and motion cues. Journal of Vision Vol. 17, No. 3, 18, 2017.
Koppula, H. S.; Gupta, R.; Saxena, A. Learning human activities and object affordances from RGB-D videos. The International Journal of Robotics Research Vol. 32, No. 8, 951–970, 2013.
Kang, C. G.; Lee, S. H. Scene reconstruction and analysis from motion. Graphical Models Vol. 94, 25–37, 2017.
Monszpart, A.; Guerrero, P.; Ceylan, D.; Yumer, E.; Mitra, N. J. iMapper: Interaction-guided joint scene and human motion mapping from monocular videos. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 92, 2019.
Davis, J. W.; Gao, H. Recognizing human action efforts: An adaptive three-mode PCA framework. In: Proceedings of the 9th IEEE International Conference on Computer Vision, 1463–1469, 2003.
Gupta, A.; Davis, L. S. Objects in action: An approach for combining action understanding and object perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–8, 2007.
Wu, J.; Yildirim, I.; Lim, J. J.; Freeman, W. T.; Tenenbaum, J. B. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 1, 127–135, 2015.
Wu, J. J.; Lim, J.; Zhang, H. Y.; Tenenbaum, J.; Freeman, W. Physics 101: Learning physical object properties from unlabeled videos. In: Proceedings of the British Machine Vision Conference, 39.1–39.12, 2016.
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9907. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 816–833, 2016.
Liu, J.; Wang, G.; Hu, P.; Duan, L. Y.; Kot, A. C. Global context-aware attention LSTM networks for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3671–3680, 2017.
Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 4263–4270, 2017.
Yan, S. J.; Xiong, Y. J.; Lin, D. H. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
Ke, Q. H.; Bennamoun, M.; An, S. J.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4570–4579, 2017.
Li, C.; Zhong, Q. Y.; Xie, D.; Pu, S. L. Cooccurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 786–792, 2018.
Aristidou, A.; Cohen-Or, D.; Hodgins, J. K.; Chrysanthou, Y.; Shamir, A. Deep motifs and motion signatures. ACM Transactions on Graphics Vol. 37, No. 6, Article No. 187, 2018.
Hsu, E.; Pulli, K.; Popović, J. Style translation for human motion. ACM Transactions on Graphics Vol. 24, No. 3, 1082–1089, 2005.
Xia, S. H.; Wang, C. Y.; Chai, J. X.; Hodgins, J. Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 119, 2015.
Yumer, M. E.; Mitra, N. J. Spectral style transfer for human motion between independent actions. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 137, 2016.
Bellini, R.; Kleiman, Y.; Cohen-Or, D. Dance to the beat: Synchronizing motion to audio. Computational Visual Media Vol. 4, No. 3, 197–208, 2018.
Cao, Z.; Simon, T.; Wei, S. H.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1302–1310, 2017.
Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 34–50, 2016.
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 483–499, 2016.
Wei, S. H.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4732, 2016.
Güler, R. A.; Neverova, N.; Kokkinos, I. DensePose: Dense human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7297–7306, 2018.
Tekin, B.; Rozantsev, A.; Lepetit, V.; Fua, P. Direct prediction of 3D body poses from motion compensated sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 991–1000, 2016.
Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3D pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5689–5698, 2017.
Mehta, D.; Sridhar, S.; Sotnychenko, O.; Rhodin, H.; Shafiei, M.; Seidel, H. P.; Xu, W.; Casas, D.; Theobalt, C. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 44, 2017.
Kanazawa, A.; Black, M. J.; Jacobs, D. W.; Malik, J. End-to-end recovery of human shape and pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7122–7131, 2018.
Pavlakos, G.; Zhou, X. W.; Daniilidis, K. Ordinal depth supervision for 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7307–7316, 2018.
Andriluka, M.; Iqbal, U.; Insafutdinov, E.; Pishchulin, L.; Milan, A.; Gall, J.; Schiele, B. PoseTrack: A benchmark for human pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5167–5176, 2018.
CMU. CMU Graphics Lab Motion Capture Database. 2018. Available at http://mocap.cs.cmu.edu/.
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1724–1734, 2014.
Wang, Y.; Sun, Y. B.; Liu, Z. W.; Sarma, S. E.; Bronstein, M. M.; Solomon, J. M. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics Vol. 38, No. 5, Article No. 146, 2019.
Zhang, P. F.; Xue, J. R.; Lan, C. L.; Zeng, W. J.; Gao, Z. N.; Zheng, N. N. Adding attentiveness to the neurons in recurrent neural networks. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Springer Cham, 136–152, 2018.
Holden, D.; Saito, J.; Komura, T. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 138, 2016.
Aberman, K.; Wu, R. D.; Lischinski, D.; Chen, B. Q.; Cohen-Or, D. Learning character-agnostic motion for motion retargeting in 2D. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 75, 2019.
Gui, L. Y.; Wang, Y. X.; Liang, X. D.; Moura, J. M. F. Adversarial geometry-aware human motion prediction. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 823–842, 2018.
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1735–1742, 2006.
Charles, R. Q.; Hao, S.; Mo, K. C.; Guibas, L. J. PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 77–85, 2017.
Wang, H.; Ho, E. S. L.; Shum, H. P. H.; Zhu, Z. X. Spatio-temporal manifold learning for human motions via long-horizon modeling. IEEE Transactions on Visualization and Computer Graphics Vol. 27, No. 1, 216–227, 2021.
Acknowledgements
We sincerely thank the reviewers for their valuable comments. This work was supported in part by Shenzhen Innovation Program (JCYJ20180305125709986), National Natural Science Foundation of China (61861130365, 61761146002), GD Science and Technology Program (2020A0505100064, 2015A030312015), and DEGP Key Project (2018KZDXM058).
Author information
Authors and Affiliations
Corresponding author
Additional information
Qian Zheng received her doctoral degree in computer science from Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, in 2015. She is an assistant professor in the College of Computer Science and Software Engineering, Shenzhen University. Her interests include computer graphics and information visualization.
Weikai Wu is a software engineer in TCL. He received his M.S. degree in computer science from Shenzhen University in 2020.
Hanting Pan is a software engineer in Orbbec. He received his M.S. degree in computer science from Shenzhen University in 2020.
Niloy Mitra leads the Smart Geometry Processing Group in the Department of Computer Science at University College London. He received his Ph.D. degree from Stanford University under the guidance of Leonidas Guibas. His research interests include shape analysis, creativeAI, and computational design and fabrication. Niloy received the Eurographics Outstanding Technical Contributions Award in 2019, the BCS Roger Needham Award in 2015, and the ACM Siggraph Significant New Researcher Award in 2013.
Daniel Cohen-Or is a professor in the School of Computer Science, Tel Aviv University. He received his Ph.D. degree from the State University of New York at Stony Brook in 1991. He was the recipient of a Eurographics Outstanding Technical Contributions Award in 2005, and an ACM SIGGRAPH Computer Graphics Achievement Award in 2018. In 2019 he won a Kadar Family Award for Outstanding Research. In 2020 he received a Eurographics Distinguished Career Award. His research interests are in computer graphics, in particular, synthesis, processing, and modeling techniques.
Hui Huang is a Distinguished TFA Professor at Shenzhen University, where she directs the Visual Computing Research Center. She received her Ph.D. degree in applied math from The University of British Columbia in 2008. Her research interests span computer graphics, 3D vision, and visualization. She is currently a Senior Member of IEEE/ACM/CSIG, a Distinguished Member of CCF, and is on the editorial board of ACM Trans. on Graphics and Computers & Graphics.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorial-manager.com/cvmj.
About this article
Cite this article
Zheng, Q., Wu, W., Pan, H. et al. Inferring object properties from human interaction and transferring them to new motions. Comp. Visual Media 7, 375–392 (2021). https://doi.org/10.1007/s41095-021-0218-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-021-0218-8