Abstract
When we physically interact with our environment using our hands, we touch objects and force them to move: contact and motion are defining properties of manipulation. In this paper, we present an active, bottom-up method for the detection of actor–object contacts and the extraction of moved objects and their motions in RGBD videos of manipulation actions. At the core of our approach lies non-rigid registration: we continuously warp a point cloud model of the observed scene to the current video frame, generating a set of dense 3D point trajectories. Under loose assumptions, we employ simple point cloud segmentation techniques to extract the actor and subsequently detect actor–environment contacts based on the estimated trajectories. For each such interaction, using the detected contact as an attention mechanism, we obtain an initial motion segment for the manipulated object by clustering trajectories in the contact area vicinity and then we jointly refine the object segment and estimate its 6DOF pose in all observed frames. Because of its generality and the fundamental, yet highly informative, nature of its outputs, our approach is applicable to a wide range of perception and planning tasks. We qualitatively evaluate our method on a number of input sequences and present a comprehensive robot imitation learning example, in which we demonstrate the crucial role of our outputs in developing action representations/plans from observation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.
Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2), 224–241.
Rutishauser, U., Walther, D., Koch, C., & Perona, P. (2004). Is bottom-up attention useful for object recognition? In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004 (Vol. 2, pp. II–II). IEEE.
Ba, J., Mnih, V., & Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision (ICCV) (pp. 3551–3558). IEEE.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (pp. 568–576).
Krüger, N., Geib, C., Piater, J., Petrick, R., Steedman, M., Wörgötter, F., et al. (2011). Object–action complexes: Grounded abstractions of sensory–motor processes. Robotics and Autonomous Systems, 59(10):740–757.
Amaro, K. R., Beetz, M., & Cheng, G. (2014). Understanding human activities from observation via semantic reasoning for humanoid robots. In IROS Workshop on AI and Robotics.
Summers-Stay, D., Teo, C. L., Yang, Y., Fermüller, C., & Aloimonos, Y. (2012). Using a minimal action grammar for activity understanding in the real world. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4104–4111). IEEE.
Yang, Y., Guha, A., Fermüller, C., & Aloimonos, Y. (2014). A cognitive system for understanding human manipulation actions. Advances in Cognitive Systems, 3, 67–86.
Yang, Y., Li, Y., Fermüller, C., & Aloimonos, Y. (2015). Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In AAAI (pp. 3686–3693).
Aksoy, E. E., Abramov, A., Dörr, J., Ning, K., Dellen, B., & Wörgötter, F. (2011). Learning the semantics of object–action relations by observation. The International Journal of Robotics Research, 30(10), 1229–1249.
Zampogiannis, K., Yang, Y., Fermüller, C., & Aloimonos, Y. (2015). Learning the spatial semantics of manipulation actions through preposition grounding. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1389–1396). IEEE.
Yan, Z., & Xiang, X. Scene flow estimation: A survey. arXiv preprint arXiv:1612.02590.
Herbst, E., Ren, X., & Fox, D. (2013). RGB-D flow: Dense 3-D motion estimation using color and depth. In 2013 IEEE International Conference on Robotics and Automation (ICRA) (pp. 2276–2282). IEEE.
Quiroga, J., Brox, T., Devernay, F., & Crowley, J. (2014). Dense semi-rigid scene flow estimation from RGBD images. In European Conference on Computer Vision (pp. 567–582). Berlin: Springer.
Jaimez, M., Souiai, M., Stückler, J., Gonzalez-Jimenez, J., & Cremers, D. (2015). Motion cooperation: Smooth piece-wise rigid scene flow from RGB-D images. In 2015 International Conference on 3D Vision (3DV) (pp. 64–72). IEEE.
Jaimez, M., Souiai, M., Gonzalez-Jimenez, J., & Cremers, D. (2015). A primal-dual framework for real-time dense RGB-D scene flow. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 98–104). IEEE.
Jaimez, M., Kerl, C., Gonzalez-Jimenez, J., & Cremers, D. (2017). Fast odometry and scene flow from RGB-D cameras based on geometric clustering. In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 3992–3999). IEEE.
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., et al. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4040–4048).
Sorkine, O., & Alexa, M. (2007). As-rigid-as-possible surface modeling. In Symposium on Geometry Processing (Vol. 4, p. 30).
Tam, G. K., Cheng, Z.-Q., Lai, Y.-K., Langbein, F. C., Liu, Y., Marshall, D., et al. (2013). Registration of 3D point clouds and meshes: A survey from rigid to nonrigid. IEEE Transactions on Visualization and Computer Graphics, 19(7), 1199–1217.
Amberg, B., Romdhani, S., & Vetter, Y. (2007). Optimal step nonrigid ICP algorithms for surface registration. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07 (pp. 1–8). IEEE.
Newcombe, R. A., Fox, D., & Seitz, S. M. (2015). DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 343–352).
Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., & Stamminger, M. VolumeDeform: Real-time volumetric non-rigid reconstruction.
Slavcheva, M., Baust, M., Cremers, D., & Ilic, S. (2017). KillingFusion: Non-rigid 3D reconstruction without correspondences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 3, p. 7).
Zampogiannis, K., Fermuller, C., & Aloimonos, Y. (2018). Cilantro: A lean, versatile, and efficient library for point cloud data processing. In Proceedings of the 26th ACM International Conference on Multimedia, MM’18 (pp. 1364–1367). New York, NY, USA: ACM. https://doi.org/10.1145/3240508.3243655.
Yang, Y., Fermuller, C., Li, Y., & Aloimonos, Y. (2015). Grasp type revisited: A modern perspective on a classical feature for vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 400–408).
Chen, Q., Li, H., Abu-Zhaya, R., Seidl, A., Zhu, F., & Delp, E. J. (2016). Touch event recognition for human interaction. Electronic Imaging, 2016(11), 1–6.
Yan, J., & Pollefeys, M. (2006). A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In European Conference on Computer Vision (pp. 94–106). Berlin: Springer.
Tron, R., & Vidal, R. (2007). A benchmark for the comparison of 3-D motion segmentation algorithms. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07 (pp. 1–8). IEEE.
Costeira, J., & Kanade, T. (1995). A multi-body factorization method for motion analysis. In Proceedings of the Fifth International Conference on Computer Vision (pp. 1071–1076). IEEE.
Kanatani, K. (2001). Motion segmentation by subspace separation and model selection. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001 (Vol. 2, pp. 586–591). IEEE.
Rao, S., Tron, R., Vidal, R., & Ma, Y. (2010). Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(10), 1832–1845.
Vidal, R., & Hartley, R. (2004). Motion segmentation with missing data using power factorization and GPCA. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004 (Vol. 2, pp. II–II). IEEE.
Katz, D., Kazemi, M., Bagnell, J. A., & Stentz, A. (2013). Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5003–5010). IEEE.
Herbst, E., Ren, X., & Fox, D. (2012). Object segmentation from motion with dense feature matching. In ICRA Workshop on Semantic Perception, Mapping and Exploration (Vol. 2).
Rünz, M., & Agapito, L. (2017). Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 4471–4478).
Whelan, T., Leutenegger, S., Salas-Moreno, R., Glocker, B., & Davison, A. (2015). ElasticFusion: Dense slam without a pose graph. In Robotics: Science and Systems.
Ochs, P., Malik, J., & Brox, T. (2014). Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1187–1200.
Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-D shapes. In Sensor Fusion IV: Control Paradigms and Data Structures (Vol. 1611, pp. 586–607). International Society for Optics and Photonics.
Rusinkiewicz, S., & Levoy, M. (2001). Efficient variants of the ICP algorithm. In Proceedings of the Third International Conference on 3-D Digital Imaging and Modeling (pp. 145–152). IEEE.
Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., et al. (2011). KinectFusion: Real-time dense surface mapping and tracking. In 10th IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 127–136). IEEE.
Cao, Z., Simon, T., Wei, S.-E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In CVPR.
Dalal, N., & Triggs, B (2005). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005 (Vol. 1, pp. 886–893). IEEE.
Jones, M. J., & Rehg, J. M. (2002). Statistical color models with application to skin detection. International Journal of Computer Vision, 46(1), 81–96.
Vezhnevets, V., Sazonov, V., & Andreeva, A. (2003). A survey on pixel-based skin color detection techniques. In Proceedings of Graphicon (Vol. 3, pp. 85–92), Moscow, Russia.
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.
Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4), 376–380.
Hershberger, D., Gossow, D., & Faust, J. (2012). rviz, https://github.com/ros-visualization/rviz.
Coleman, D. T. (2016). “moveit!” simple grasps. https://github.com/davetcoleman/moveit_simple_grasps.
Chitta, S., Sucan, I., & Cousins, S. (2012). MoveIt! [ROS topics]. IEEE Robotics Automation Magazine, 19(1), 18–19. https://doi.org/10.1109/mra.2011.2181749.
Lavalle, S. M. (1998). Rapidly-exploring random trees: A new tool for path planning. Technical Report, Iowa State University.
Schaal, S. (2002). Dynamic movement primitives—A framework for motor control in humans and humanoid robotics.
Acknowledgements
The support of ONR under grant award N00014-17-1-2622 and the support of the National Science Foundation under grants SMA 1540916 and CNS 1544787 are greatly acknowledged.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zampogiannis, K., Ganguly, K., Fermüller, C., Aloimonos, Y. (2020). Vision During Action: Extracting Contact and Motion from Manipulation Videos—Toward Parsing Human Activity. In: Noceti, N., Sciutti, A., Rea, F. (eds) Modelling Human Motion. Springer, Cham. https://doi.org/10.1007/978-3-030-46732-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-46732-6_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46731-9
Online ISBN: 978-3-030-46732-6
eBook Packages: Computer ScienceComputer Science (R0)