Vision During Action: Extracting Contact and Motion from Manipulation Videos—Toward Parsing Human Activity

Zampogiannis, Konstantinos; Ganguly, Kanishka; Fermüller, Cornelia; Aloimonos, Yiannis

doi:10.1007/978-3-030-46732-6_9

Konstantinos Zampogiannis⁴,
Kanishka Ganguly⁴,
Cornelia Fermüller⁴ &
…
Yiannis Aloimonos⁴

827 Accesses

Abstract

When we physically interact with our environment using our hands, we touch objects and force them to move: contact and motion are defining properties of manipulation. In this paper, we present an active, bottom-up method for the detection of actor–object contacts and the extraction of moved objects and their motions in RGBD videos of manipulation actions. At the core of our approach lies non-rigid registration: we continuously warp a point cloud model of the observed scene to the current video frame, generating a set of dense 3D point trajectories. Under loose assumptions, we employ simple point cloud segmentation techniques to extract the actor and subsequently detect actor–environment contacts based on the estimated trajectories. For each such interaction, using the detected contact as an attention mechanism, we obtain an initial motion segment for the manipulated object by clustering trajectories in the contact area vicinity and then we jointly refine the object segment and estimate its 6DOF pose in all observed frames. Because of its generality and the fundamental, yet highly informative, nature of its outputs, our approach is applicable to a wide range of perception and planning tasks. We qualitatively evaluate our method on a number of input sequences and present a comprehensive robot imitation learning example, in which we demonstrate the crucial role of our outputs in developing action representations/plans from observation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.
Article Google Scholar
Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2), 224–241.
Article Google Scholar
Rutishauser, U., Walther, D., Koch, C., & Perona, P. (2004). Is bottom-up attention useful for object recognition? In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004 (Vol. 2, pp. II–II). IEEE.
Google Scholar
Ba, J., Mnih, V., & Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision (ICCV) (pp. 3551–3558). IEEE.
Google Scholar
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (pp. 568–576).
Google Scholar
Krüger, N., Geib, C., Piater, J., Petrick, R., Steedman, M., Wörgötter, F., et al. (2011). Object–action complexes: Grounded abstractions of sensory–motor processes. Robotics and Autonomous Systems, 59(10):740–757.
Google Scholar
Amaro, K. R., Beetz, M., & Cheng, G. (2014). Understanding human activities from observation via semantic reasoning for humanoid robots. In IROS Workshop on AI and Robotics.
Google Scholar
Summers-Stay, D., Teo, C. L., Yang, Y., Fermüller, C., & Aloimonos, Y. (2012). Using a minimal action grammar for activity understanding in the real world. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4104–4111). IEEE.
Google Scholar
Yang, Y., Guha, A., Fermüller, C., & Aloimonos, Y. (2014). A cognitive system for understanding human manipulation actions. Advances in Cognitive Systems, 3, 67–86.
Google Scholar
Yang, Y., Li, Y., Fermüller, C., & Aloimonos, Y. (2015). Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In AAAI (pp. 3686–3693).
Google Scholar
Aksoy, E. E., Abramov, A., Dörr, J., Ning, K., Dellen, B., & Wörgötter, F. (2011). Learning the semantics of object–action relations by observation. The International Journal of Robotics Research, 30(10), 1229–1249.
Google Scholar
Zampogiannis, K., Yang, Y., Fermüller, C., & Aloimonos, Y. (2015). Learning the spatial semantics of manipulation actions through preposition grounding. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1389–1396). IEEE.
Google Scholar
Yan, Z., & Xiang, X. Scene flow estimation: A survey. arXiv preprint arXiv:1612.02590.
Herbst, E., Ren, X., & Fox, D. (2013). RGB-D flow: Dense 3-D motion estimation using color and depth. In 2013 IEEE International Conference on Robotics and Automation (ICRA) (pp. 2276–2282). IEEE.
Google Scholar
Quiroga, J., Brox, T., Devernay, F., & Crowley, J. (2014). Dense semi-rigid scene flow estimation from RGBD images. In European Conference on Computer Vision (pp. 567–582). Berlin: Springer.
Google Scholar
Jaimez, M., Souiai, M., Stückler, J., Gonzalez-Jimenez, J., & Cremers, D. (2015). Motion cooperation: Smooth piece-wise rigid scene flow from RGB-D images. In 2015 International Conference on 3D Vision (3DV) (pp. 64–72). IEEE.
Google Scholar
Jaimez, M., Souiai, M., Gonzalez-Jimenez, J., & Cremers, D. (2015). A primal-dual framework for real-time dense RGB-D scene flow. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 98–104). IEEE.
Google Scholar
Jaimez, M., Kerl, C., Gonzalez-Jimenez, J., & Cremers, D. (2017). Fast odometry and scene flow from RGB-D cameras based on geometric clustering. In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 3992–3999). IEEE.
Google Scholar
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., et al. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4040–4048).
Google Scholar
Sorkine, O., & Alexa, M. (2007). As-rigid-as-possible surface modeling. In Symposium on Geometry Processing (Vol. 4, p. 30).
Google Scholar
Tam, G. K., Cheng, Z.-Q., Lai, Y.-K., Langbein, F. C., Liu, Y., Marshall, D., et al. (2013). Registration of 3D point clouds and meshes: A survey from rigid to nonrigid. IEEE Transactions on Visualization and Computer Graphics, 19(7), 1199–1217.
Article Google Scholar
Amberg, B., Romdhani, S., & Vetter, Y. (2007). Optimal step nonrigid ICP algorithms for surface registration. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07 (pp. 1–8). IEEE.
Google Scholar
Newcombe, R. A., Fox, D., & Seitz, S. M. (2015). DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 343–352).
Google Scholar
Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., & Stamminger, M. VolumeDeform: Real-time volumetric non-rigid reconstruction.
Google Scholar
Slavcheva, M., Baust, M., Cremers, D., & Ilic, S. (2017). KillingFusion: Non-rigid 3D reconstruction without correspondences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 3, p. 7).
Google Scholar
Zampogiannis, K., Fermuller, C., & Aloimonos, Y. (2018). Cilantro: A lean, versatile, and efficient library for point cloud data processing. In Proceedings of the 26th ACM International Conference on Multimedia, MM’18 (pp. 1364–1367). New York, NY, USA: ACM. https://doi.org/10.1145/3240508.3243655.
Yang, Y., Fermuller, C., Li, Y., & Aloimonos, Y. (2015). Grasp type revisited: A modern perspective on a classical feature for vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 400–408).
Google Scholar
Chen, Q., Li, H., Abu-Zhaya, R., Seidl, A., Zhu, F., & Delp, E. J. (2016). Touch event recognition for human interaction. Electronic Imaging, 2016(11), 1–6.
Google Scholar
Yan, J., & Pollefeys, M. (2006). A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In European Conference on Computer Vision (pp. 94–106). Berlin: Springer.
Google Scholar
Tron, R., & Vidal, R. (2007). A benchmark for the comparison of 3-D motion segmentation algorithms. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07 (pp. 1–8). IEEE.
Google Scholar
Costeira, J., & Kanade, T. (1995). A multi-body factorization method for motion analysis. In Proceedings of the Fifth International Conference on Computer Vision (pp. 1071–1076). IEEE.
Google Scholar
Kanatani, K. (2001). Motion segmentation by subspace separation and model selection. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001 (Vol. 2, pp. 586–591). IEEE.
Google Scholar
Rao, S., Tron, R., Vidal, R., & Ma, Y. (2010). Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(10), 1832–1845.
Article Google Scholar
Vidal, R., & Hartley, R. (2004). Motion segmentation with missing data using power factorization and GPCA. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004 (Vol. 2, pp. II–II). IEEE.
Google Scholar
Katz, D., Kazemi, M., Bagnell, J. A., & Stentz, A. (2013). Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5003–5010). IEEE.
Google Scholar
Herbst, E., Ren, X., & Fox, D. (2012). Object segmentation from motion with dense feature matching. In ICRA Workshop on Semantic Perception, Mapping and Exploration (Vol. 2).
Google Scholar
Rünz, M., & Agapito, L. (2017). Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 4471–4478).
Google Scholar
Whelan, T., Leutenegger, S., Salas-Moreno, R., Glocker, B., & Davison, A. (2015). ElasticFusion: Dense slam without a pose graph. In Robotics: Science and Systems.
Google Scholar
Ochs, P., Malik, J., & Brox, T. (2014). Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1187–1200.
Article Google Scholar
Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-D shapes. In Sensor Fusion IV: Control Paradigms and Data Structures (Vol. 1611, pp. 586–607). International Society for Optics and Photonics.
Google Scholar
Rusinkiewicz, S., & Levoy, M. (2001). Efficient variants of the ICP algorithm. In Proceedings of the Third International Conference on 3-D Digital Imaging and Modeling (pp. 145–152). IEEE.
Google Scholar
Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., et al. (2011). KinectFusion: Real-time dense surface mapping and tracking. In 10th IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 127–136). IEEE.
Google Scholar
Cao, Z., Simon, T., Wei, S.-E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In CVPR.
Google Scholar
Dalal, N., & Triggs, B (2005). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005 (Vol. 1, pp. 886–893). IEEE.
Google Scholar
Jones, M. J., & Rehg, J. M. (2002). Statistical color models with application to skin detection. International Journal of Computer Vision, 46(1), 81–96.
Article Google Scholar
Vezhnevets, V., Sazonov, V., & Andreeva, A. (2003). A survey on pixel-based skin color detection techniques. In Proceedings of Graphicon (Vol. 3, pp. 85–92), Moscow, Russia.
Google Scholar
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.
Article MathSciNet Google Scholar
Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4), 376–380.
Article Google Scholar
Hershberger, D., Gossow, D., & Faust, J. (2012). rviz, https://github.com/ros-visualization/rviz.
Coleman, D. T. (2016). “moveit!” simple grasps. https://github.com/davetcoleman/moveit_simple_grasps.
Chitta, S., Sucan, I., & Cousins, S. (2012). MoveIt! [ROS topics]. IEEE Robotics Automation Magazine, 19(1), 18–19. https://doi.org/10.1109/mra.2011.2181749.
Lavalle, S. M. (1998). Rapidly-exploring random trees: A new tool for path planning. Technical Report, Iowa State University.
Google Scholar
Schaal, S. (2002). Dynamic movement primitives—A framework for motor control in humans and humanoid robotics.
Google Scholar

Download references

Acknowledgements

The support of ONR under grant award N00014-17-1-2622 and the support of the National Science Foundation under grants SMA 1540916 and CNS 1544787 are greatly acknowledged.

Author information

Authors and Affiliations

University of Maryland, College Park, MD, USA
Konstantinos Zampogiannis, Kanishka Ganguly, Cornelia Fermüller & Yiannis Aloimonos

Authors

Konstantinos Zampogiannis
View author publications
You can also search for this author in PubMed Google Scholar
Kanishka Ganguly
View author publications
You can also search for this author in PubMed Google Scholar
Cornelia Fermüller
View author publications
You can also search for this author in PubMed Google Scholar
Yiannis Aloimonos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiannis Aloimonos .

Editor information

Editors and Affiliations

MaLGa Center - DIBRIS, Università di Genova, Genoa, Italy
Nicoletta Noceti
Contact Unit, Istituto Italiano di Tecnologia, Genoa, Italy
Alessandra Sciutti
RBCS Unit, Istituto Italiano di Tecnologia, Genoa, Italy
Francesco Rea

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zampogiannis, K., Ganguly, K., Fermüller, C., Aloimonos, Y. (2020). Vision During Action: Extracting Contact and Motion from Manipulation Videos—Toward Parsing Human Activity. In: Noceti, N., Sciutti, A., Rea, F. (eds) Modelling Human Motion. Springer, Cham. https://doi.org/10.1007/978-3-030-46732-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-46732-6_9
Published: 10 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46731-9
Online ISBN: 978-3-030-46732-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics