Skip to main content

Vision During Action: Extracting Contact and Motion from Manipulation Videos—Toward Parsing Human Activity

  • Chapter
  • First Online:
Modelling Human Motion

Abstract

When we physically interact with our environment using our hands, we touch objects and force them to move: contact and motion are defining properties of manipulation. In this paper, we present an active, bottom-up method for the detection of actor–object contacts and the extraction of moved objects and their motions in RGBD videos of manipulation actions. At the core of our approach lies non-rigid registration: we continuously warp a point cloud model of the observed scene to the current video frame, generating a set of dense 3D point trajectories. Under loose assumptions, we employ simple point cloud segmentation techniques to extract the actor and subsequently detect actor–environment contacts based on the estimated trajectories. For each such interaction, using the detected contact as an attention mechanism, we obtain an initial motion segment for the manipulated object by clustering trajectories in the contact area vicinity and then we jointly refine the object segment and estimate its 6DOF pose in all observed frames. Because of its generality and the fundamental, yet highly informative, nature of its outputs, our approach is applicable to a wide range of perception and planning tasks. We qualitatively evaluate our method on a number of input sequences and present a comprehensive robot imitation learning example, in which we demonstrate the crucial role of our outputs in developing action representations/plans from observation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.

    Article  Google Scholar 

  2. Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2), 224–241.

    Article  Google Scholar 

  3. Rutishauser, U., Walther, D., Koch, C., & Perona, P. (2004). Is bottom-up attention useful for object recognition? In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004 (Vol. 2, pp. II–II). IEEE.

    Google Scholar 

  4. Ba, J., Mnih, V., & Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.

  5. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision (ICCV) (pp. 3551–3558). IEEE.

    Google Scholar 

  6. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (pp. 568–576).

    Google Scholar 

  7. Krüger, N., Geib, C., Piater, J., Petrick, R., Steedman, M., Wörgötter, F., et al. (2011). Object–action complexes: Grounded abstractions of sensory–motor processes. Robotics and Autonomous Systems, 59(10):740–757.

    Google Scholar 

  8. Amaro, K. R., Beetz, M., & Cheng, G. (2014). Understanding human activities from observation via semantic reasoning for humanoid robots. In IROS Workshop on AI and Robotics.

    Google Scholar 

  9. Summers-Stay, D., Teo, C. L., Yang, Y., Fermüller, C., & Aloimonos, Y. (2012). Using a minimal action grammar for activity understanding in the real world. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4104–4111). IEEE.

    Google Scholar 

  10. Yang, Y., Guha, A., Fermüller, C., & Aloimonos, Y. (2014). A cognitive system for understanding human manipulation actions. Advances in Cognitive Systems, 3, 67–86.

    Google Scholar 

  11. Yang, Y., Li, Y., Fermüller, C., & Aloimonos, Y. (2015). Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In AAAI (pp. 3686–3693).

    Google Scholar 

  12. Aksoy, E. E., Abramov, A., Dörr, J., Ning, K., Dellen, B., & Wörgötter, F. (2011). Learning the semantics of object–action relations by observation. The International Journal of Robotics Research, 30(10), 1229–1249.

    Google Scholar 

  13. Zampogiannis, K., Yang, Y., Fermüller, C., & Aloimonos, Y. (2015). Learning the spatial semantics of manipulation actions through preposition grounding. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1389–1396). IEEE.

    Google Scholar 

  14. Yan, Z., & Xiang, X. Scene flow estimation: A survey. arXiv preprint arXiv:1612.02590.

  15. Herbst, E., Ren, X., & Fox, D. (2013). RGB-D flow: Dense 3-D motion estimation using color and depth. In 2013 IEEE International Conference on Robotics and Automation (ICRA) (pp. 2276–2282). IEEE.

    Google Scholar 

  16. Quiroga, J., Brox, T., Devernay, F., & Crowley, J. (2014). Dense semi-rigid scene flow estimation from RGBD images. In European Conference on Computer Vision (pp. 567–582). Berlin: Springer.

    Google Scholar 

  17. Jaimez, M., Souiai, M., Stückler, J., Gonzalez-Jimenez, J., & Cremers, D. (2015). Motion cooperation: Smooth piece-wise rigid scene flow from RGB-D images. In 2015 International Conference on 3D Vision (3DV) (pp. 64–72). IEEE.

    Google Scholar 

  18. Jaimez, M., Souiai, M., Gonzalez-Jimenez, J., & Cremers, D. (2015). A primal-dual framework for real-time dense RGB-D scene flow. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 98–104). IEEE.

    Google Scholar 

  19. Jaimez, M., Kerl, C., Gonzalez-Jimenez, J., & Cremers, D. (2017). Fast odometry and scene flow from RGB-D cameras based on geometric clustering. In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 3992–3999). IEEE.

    Google Scholar 

  20. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., et al. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4040–4048).

    Google Scholar 

  21. Sorkine, O., & Alexa, M. (2007). As-rigid-as-possible surface modeling. In Symposium on Geometry Processing (Vol. 4, p. 30).

    Google Scholar 

  22. Tam, G. K., Cheng, Z.-Q., Lai, Y.-K., Langbein, F. C., Liu, Y., Marshall, D., et al. (2013). Registration of 3D point clouds and meshes: A survey from rigid to nonrigid. IEEE Transactions on Visualization and Computer Graphics, 19(7), 1199–1217.

    Article  Google Scholar 

  23. Amberg, B., Romdhani, S., & Vetter, Y. (2007). Optimal step nonrigid ICP algorithms for surface registration. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07 (pp. 1–8). IEEE.

    Google Scholar 

  24. Newcombe, R. A., Fox, D., & Seitz, S. M. (2015). DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 343–352).

    Google Scholar 

  25. Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., & Stamminger, M. VolumeDeform: Real-time volumetric non-rigid reconstruction.

    Google Scholar 

  26. Slavcheva, M., Baust, M., Cremers, D., & Ilic, S. (2017). KillingFusion: Non-rigid 3D reconstruction without correspondences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 3, p. 7).

    Google Scholar 

  27. Zampogiannis, K., Fermuller, C., & Aloimonos, Y. (2018). Cilantro: A lean, versatile, and efficient library for point cloud data processing. In Proceedings of the 26th ACM International Conference on Multimedia, MM’18 (pp. 1364–1367). New York, NY, USA: ACM. https://doi.org/10.1145/3240508.3243655.

  28. Yang, Y., Fermuller, C., Li, Y., & Aloimonos, Y. (2015). Grasp type revisited: A modern perspective on a classical feature for vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 400–408).

    Google Scholar 

  29. Chen, Q., Li, H., Abu-Zhaya, R., Seidl, A., Zhu, F., & Delp, E. J. (2016). Touch event recognition for human interaction. Electronic Imaging, 2016(11), 1–6.

    Google Scholar 

  30. Yan, J., & Pollefeys, M. (2006). A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In European Conference on Computer Vision (pp. 94–106). Berlin: Springer.

    Google Scholar 

  31. Tron, R., & Vidal, R. (2007). A benchmark for the comparison of 3-D motion segmentation algorithms. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07 (pp. 1–8). IEEE.

    Google Scholar 

  32. Costeira, J., & Kanade, T. (1995). A multi-body factorization method for motion analysis. In Proceedings of the Fifth International Conference on Computer Vision (pp. 1071–1076). IEEE.

    Google Scholar 

  33. Kanatani, K. (2001). Motion segmentation by subspace separation and model selection. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001 (Vol. 2, pp. 586–591). IEEE.

    Google Scholar 

  34. Rao, S., Tron, R., Vidal, R., & Ma, Y. (2010). Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(10), 1832–1845.

    Article  Google Scholar 

  35. Vidal, R., & Hartley, R. (2004). Motion segmentation with missing data using power factorization and GPCA. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004 (Vol. 2, pp. II–II). IEEE.

    Google Scholar 

  36. Katz, D., Kazemi, M., Bagnell, J. A., & Stentz, A. (2013). Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5003–5010). IEEE.

    Google Scholar 

  37. Herbst, E., Ren, X., & Fox, D. (2012). Object segmentation from motion with dense feature matching. In ICRA Workshop on Semantic Perception, Mapping and Exploration (Vol. 2).

    Google Scholar 

  38. Rünz, M., & Agapito, L. (2017). Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 4471–4478).

    Google Scholar 

  39. Whelan, T., Leutenegger, S., Salas-Moreno, R., Glocker, B., & Davison, A. (2015). ElasticFusion: Dense slam without a pose graph. In Robotics: Science and Systems.

    Google Scholar 

  40. Ochs, P., Malik, J., & Brox, T. (2014). Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1187–1200.

    Article  Google Scholar 

  41. Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-D shapes. In Sensor Fusion IV: Control Paradigms and Data Structures (Vol. 1611, pp. 586–607). International Society for Optics and Photonics.

    Google Scholar 

  42. Rusinkiewicz, S., & Levoy, M. (2001). Efficient variants of the ICP algorithm. In Proceedings of the Third International Conference on 3-D Digital Imaging and Modeling (pp. 145–152). IEEE.

    Google Scholar 

  43. Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., et al. (2011). KinectFusion: Real-time dense surface mapping and tracking. In 10th IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 127–136). IEEE.

    Google Scholar 

  44. Cao, Z., Simon, T., Wei, S.-E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In CVPR.

    Google Scholar 

  45. Dalal, N., & Triggs, B (2005). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005 (Vol. 1, pp. 886–893). IEEE.

    Google Scholar 

  46. Jones, M. J., & Rehg, J. M. (2002). Statistical color models with application to skin detection. International Journal of Computer Vision, 46(1), 81–96.

    Article  Google Scholar 

  47. Vezhnevets, V., Sazonov, V., & Andreeva, A. (2003). A survey on pixel-based skin color detection techniques. In Proceedings of Graphicon (Vol. 3, pp. 85–92), Moscow, Russia.

    Google Scholar 

  48. Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.

    Article  MathSciNet  Google Scholar 

  49. Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4), 376–380.

    Article  Google Scholar 

  50. Hershberger, D., Gossow, D., & Faust, J. (2012). rviz, https://github.com/ros-visualization/rviz.

  51. Coleman, D. T. (2016). “moveit!” simple grasps. https://github.com/davetcoleman/moveit_simple_grasps.

  52. Chitta, S., Sucan, I., & Cousins, S. (2012). MoveIt! [ROS topics]. IEEE Robotics Automation Magazine, 19(1), 18–19. https://doi.org/10.1109/mra.2011.2181749.

  53. Lavalle, S. M. (1998). Rapidly-exploring random trees: A new tool for path planning. Technical Report, Iowa State University.

    Google Scholar 

  54. Schaal, S. (2002). Dynamic movement primitives—A framework for motor control in humans and humanoid robotics.

    Google Scholar 

Download references

Acknowledgements

The support of ONR under grant award N00014-17-1-2622 and the support of the National Science Foundation under grants SMA 1540916 and CNS 1544787 are greatly acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiannis Aloimonos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zampogiannis, K., Ganguly, K., Fermüller, C., Aloimonos, Y. (2020). Vision During Action: Extracting Contact and Motion from Manipulation Videos—Toward Parsing Human Activity. In: Noceti, N., Sciutti, A., Rea, F. (eds) Modelling Human Motion. Springer, Cham. https://doi.org/10.1007/978-3-030-46732-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46732-6_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46731-9

  • Online ISBN: 978-3-030-46732-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics