Prediction of Manipulation Actions

Fermüller, Cornelia; Wang, Fang; Yang, Yezhou; Zampogiannis, Konstantinos; Zhang, Yi; Barranco, Francisco; Pfeiffer, Michael

doi:10.1007/s11263-017-0992-z

Prediction of Manipulation Actions

Published: 20 February 2017

Volume 126, pages 358–374, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Cornelia Fermüller ORCID: orcid.org/0000-0003-2044-2386¹,
Fang Wang²,
Yezhou Yang⁶,
Konstantinos Zampogiannis¹,
Yi Zhang¹,
Francisco Barranco³ &
…
Michael Pfeiffer^4,5

2455 Accesses
31 Citations
Explore all metrics

Abstract

By looking at a person’s hands, one can often tell what the person is going to do next, how his/her hands are moving and where they will be, because an actor’s intentions shape his/her movement kinematics during action execution. Similarly, active systems with real-time constraints must not simply rely on passive video-segment classification, but they have to continuously update their estimates and predict future actions. In this paper, we study the prediction of dexterous actions. We recorded videos of subjects performing different manipulation actions on the same object, such as “squeezing”, “flipping”, “washing”, “wiping” and “scratching” with a sponge. In psychophysical experiments, we evaluated human observers’ skills in predicting actions from video sequences of different length, depicting the hand movement in the preparation and execution of actions before and after contact with the object. We then developed a recurrent neural network based method for action prediction using as input image patches around the hand. We also used the same formalism to predict the forces on the finger tips using for training synchronized video and force data streams. Evaluations on two new datasets show that our system closely matches human performance in the recognition task, and demonstrate the ability of our algorithms to predict in real time what and how a dexterous action is performed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Data-Driven Approach to Estimate Human Center of Mass State During Perturbed Locomotion Using Simulated Wearable Sensors

Article 01 April 2024

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

Article 17 August 2020

References

Aloimonos, Y., & Fermüller, C. (2015). The cognitive dialogue: A new model for vision implementing common sense reasoning. Image and Vision Computing, 35(12), 2891–2903.
Google Scholar
Ansuini, C., Giosa, L., Turella, L., Altoé, G., & Castiello, U. (2008). An object for an action, the same object for other actions: Effects on hand shaping. Experimental Brain Research, 185(1), 111–119.
Article Google Scholar
Ansuini, C., Cavallo, A., Bertone, C., & Becchio, C. (2015). Intentions in the brain: The unveiling of Mister Hyde. The Neuroscientist, 21(2), 126–135.
Article Google Scholar
Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), 469–483.
Article Google Scholar
Aviles, A. I., Marban, A., Sobrevilla, P., Fernandez, J., & Casals, A. (2014). A recurrent neural network approach for 3d vision-based force estimation. In International conference on image processing theory, tools and applications (IPTA).
Bradski, G. R. (1998). Computer vision face tracking for use in a perceptual user interface. Intel Technology Journal Q2.
Bullock, I. M., Feix, T., & Dollar, A. M. (2015). The Yale human grasping data set: Grasp, object, and task data in household and machine shop environments. International Journal of Robotics Research. 34(3), 251–255.
Cai, M., Kitani, K. M., & Sato, Y. (2015). A scalable approach for understanding the visual structures of hand grasps. In IEEE international conference on robotics and automation (ICRA), IEEE (pp. 1360–1366).
Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of non-rigid objects using mean shift. In IEEE conference on computer vision and pattern recognition, IEEE (Vol. 2, pp. 142–149).
Crajé, C., Lukos, J., Ansuini, C., Gordon, A., & Santello, M. (2011). The effects of task and content on digit placement on a bottle. Exp Brain Research, 212(1), 119–124.
Article Google Scholar
Cutkosky, M. R. (1989). On grasp choice, grasp models, and the design of hands for manufacturing tasks. IEEE Transactions on Robotics and Automation, 5(3), 269–279.
Article Google Scholar
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). Long-term recurrent convolutional networks for visual recognition and description. In IEEE conference on computer vision and pattern recognition (pp. 2625–2634).
Doyle, J., & Csete, M. (2011). Architecture, constraints, and behavior. Proceedings of the National Academy of Sciences, 108(Sup. 3), 15624–15630.
Article Google Scholar
Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. Computer Vision and Image Understanding, 108(1), 52–73.
Article Google Scholar
Fanello, S. R., Gori, I., Metta, G., & Odone, F. (2013). Keep it simple and sparse: Real-time action recognition. The Journal of Machine Learning Research, 14(1), 2617–2640.
Google Scholar
Fathi, A., Ren, X., & Rehg, J. M. (2011). Learning to recognize objects in egocentric activities. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3281–3288).
Feix, T., Pawlik, R., Schmiedmayer, H., Romero, J., & Kragic, D. (2009). A comprehensive grasp taxonomy. In Robotics, science and systems conference: Workshop on understanding the human hand for advancing robotic manipulation.
Fermüller, C. (2016). Prediction of manipulation actions. http://www.cfar.umd.edu/~fer/action-prediction/.
Fouhey, D. F., & Zitnick, C. (2014). Predicting object dynamics in scenes. In IEEE conference on computer vision and pattern recognition.
Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for kinematic tracking. arXiv:1508.00271.
Gallesse, V., & Goldman, A. (1998). Mirror neurons and the simulation theory of mind-reading. Trends in Cognitive Sciences, 2(12), 493–501.
Article Google Scholar
Gams, A., Do, M., Ude, A., Asfour, T., & Dillmann, R. (2010). On-line periodic movement and force-profile learning for adaptation to new surfaces. In IEEE International conference on robotics research (ICRA) (pp. 3192–3199).
Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6645–6649).
Greminger, M., & Nelson, B. (2004). Vision-based force measurement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3), 290–298.
Article Google Scholar
Hoai, M., & De la Torre, F. (2014). Max-margin early event detectors. International Journal of Computer Vision, 107(2), 191–202.
Article MathSciNet Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hoffman, J., Gupta, S., & Darrell, T. (2016) Learning with side information through modality hallucination. In IEEE conference on computer vision and pattern recognition (CVPR).
Ijina, E., & Mohan, S. (2014). Human action recognition based on recognition of linear patterns in action bank features using convolutional neural networks. In International conference on machine learning and applications.
Jeannerod, M. (1984). The timing of natural prehension movements. Journal of Motor Behavior, 16(3), 235–254.
Article Google Scholar
Joo, J., Li, W., Steen, F. F., & Zhu, S. C. (2014). Visual persuasion: Inferring communicative intents of images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 216–223).
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1725–1732).
Keskin, C., Kıraç, F., Kara, Y. E., & Akarun, L. (2013). Real time hand pose estimation using depth sensors. In Consumer depth cameras for computer vision (pp. 119–137). London: Springer.
Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012). Activity forecasting. In European conference on computer vision (ECCV).
Kober, J., Gienger, M., & Steil, J. (2000). Learning movement primitives for force interaction tasks. IEEE Conference on Computer Vision and Pattern Recognition, 2, 142–149.
Google Scholar
Koppula, H., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on pattern Analysis and Machine Intelligence, 38(1), 14–29.
Kormushev, P., Calinon, S., & Caldwell, D. G. (2011). Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input. Advanced Robotics, 25(5), 581–603.
Article Google Scholar
Lea, C., Reiter, A., Vidal, R., & Hager, G. D. (2016). Segmental spatiotemporal CNNS for fine-grained action segmentation. In European conference of computer vision (ECCV) (pp. 36–52).
Lenz, I., Lee, H., & Saxena, A. (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research. 34(4–5), 705–724.
Li, Y., Fermüller, C., Aloimonos, Y., & Ji, H. (2010). Learning shift-invariant sparse representation of actions. In IEEE conference on computer vision and pattern recognition, San Francisco, CA, pp. 2630–2637.
Liu, J., Feng, F., Nakamura, Y. C., & Pollard, N. S. (2014). A taxonomy of everyday grasps in action. In 14th IEEE-RAS international conference on humanoid robots, Humanoids.
Lv, F., & Nevatia, R.(2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In European conference on computer vision (ECCV) (pp. 359–372). Springer.
Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in LSTMS for activity detection and early detection. In The IEEE conference on computer vision and pattern recognition (CVPR).
Mandary, C., Terlemez, O., Do, M., Vahrenkamp, N., & Asfour, T. (2015). The kit whole-body human motion database. In International conferences on advanced robotics (pp. 329–336).
Melax ,S., Keselman, L., & Orsten, S. (2013). Dynamics based 3d skeletal hand tracking. In Proceedings of graphics interface 2013, Canadian information processing society (pp. 63–70).
Moeslund, T., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2), 90–126.
Article Google Scholar
Molchanov, P., Gupta, S., Kim. K., & Kautz, J. (2015). Hand gesture recognition with 3d convolutional neural networks. In IEEE conference on computer vision and pattern recognition workshops (pp. 1–7).
Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga. R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR 2015.
Ohn-Bar, E., & Trivedi, M. M. (2014). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(6), 2368–2377.
Article Google Scholar
Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2011). Efficient model-based 3d tracking of hand articulations using kinect. In British machine vision conference.
Panteleris, P., Kyriazis, N., & Argyros, A. (2015). 3d tracking of human hands in interaction with unknown objects. In British machine vision conference (BMVC), BMVA Press, pp. 123.
Pham, T. H., Kheddar, A., Qammaz, A., & Argyros, A. A. (2015). Towards force sensing from vision: Observing hand-object interactions to infer manipulation forces. In IEEE conference on computer vision and pattern recognition (CVPR).
Pieropan, A., Ek, C. H., & Kjellstrom, H. (2013). Functional object descriptors for human activity modeling. In IEEE international conference on robotics and automation (ICRA) (pp. 1282–1289).
Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Inferring the why in images. arXiv:1406.5472
Rizzolatti, G., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanisms underlying the understanding and imitation of action. Nature Reviews Neuroscience, 2(9), 661–670.
Article Google Scholar
Rogez, G., Khademi, M., Supančič, J. III, Montiel, J. M. M., & Ramanan, D. (2014). 3d hand pose detection in egocentric RGB-D images. In Workshop at the European conference on computer vision (pp. 356–371). Springer.
Rogez, G., Supancic, J. S. III., & Ramanan, D. (2015). Understanding everyday hands in action from RGB-D images. In IEEE international conference on computer vision (ICCV).
Romero, J., Feix, T., Ek, C. H., Kjellstrom, H., & Kragic, D. (2013). A metric for comparing the anthropomorphic motion capability of artificial hands. IEEE Transactions on Robotics, 29(6), 1342–1352.
Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In IEEE international conference on computer vision (ICCV).
Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2730–2737).
Ryoo, M., Fuchs, T. J., Xia, L., Aggarwal, J., & Matthies, L. (2015). Robot-centric activity prediction from first-person videos: What will they do to me? In ACM/IEEE international conference on human-robot interaction (HRI) (pp. 295–302). ACM.
Saxena, A., Driemeyer, J., & Ng, A. Y. (2008). Robotic grasping of novel objects using vision. The International Journal of Robotics Research, 27(2), 157–173.
Article Google Scholar
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In International conference on pattern recognition.
Shi, Q., Cheng, L., Wang, L., & Smola, A. (2011). Human action segmentation and recognition using discriminative semi-Markov models. International Journal of Computer Vision, 93(1), 22–32.
Article MATH Google Scholar
Shimoga, K. B. (1996). Robot grasp synthesis algorithms: A survey. The International Journal of Robotics Research, 15(3), 230–266.
Article Google Scholar
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM, 56(1), 116–124.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Starner, T., Weaver, J., & Pentland, A. (1998). Real-time american sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.
Article Google Scholar
Stein, S., & McKenna, S. J. (2013). Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous computing (pp. 729–738). ACM.
Supancic, J. S., Rogez, G., Yang, Y., Shotton, J., & Ramanan, D. (2015). Depth-based hand pose estimation: Data, methods, and challenges. In IEEE international conference on computer vision (pp. 1868–1876).
Takano, W., Ishikawa, J., & Nakamura, Y. (2015). Using a human action database to recognize actions in monocular image sequences: Recovering human whole body configurations. Advanced Robotics, 29(12), 771–784.
Article Google Scholar
Tiest, W. M. B., & Kappers, A. M. (2014). Physical aspects of softness perception. In M. D. Luca (Ed.), Multisensory softness (pp. 3–15). London: Springer.
Google Scholar
Turaga, P., Chellappa, R., Subrahmanian, V., & Udrea, O. (2008). Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11), 1473–1488.
Article Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729.
Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., & Bengio, Y. (2015). A recurrent neural network based alternative to convolutional networks. arXiv:1505.00393.
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In IEEE conference on computer vision and pattern recognition.
Walker, J., Gupta, A., & Hebert, M. (2014). Patch to the future: Unsupervised visual prediction. In IEEE conference on computer vision and pattern recognition (pp. 3302–3309).
Wang, S. B., Quattoni, A., Morency, L. P., Demirdjian, D., & Darrell, T. (2006). Hidden conditional random fields for gesture recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2, 1521–1527.
Google Scholar
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2014). Learning actionlet ensemble for 3d human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 914–927.
Article Google Scholar
Xie, D., Todorovic, S., & Zhu, S. C. (2013). Inferring “dark matter” and “dark energy” from videos. In IEEE international conference on computer vision (ICCV) (pp. 2224–2231).
Yang, Y., Fermüller, C., Li, Y., & Aloimonos, Y. (2015). Grasp type revisited: A modern perspective on a classical feature for vision. In IEEE conference on computer vision and pattern recognition (CVPR).
Zhu, Y., Zhao, Y., & Zhu, S. C. (2015). Understanding tools: Task-oriented object modeling, learning and recognition. In IEEE conference on computer vision and pattern recognition (pp. 2855–2864).

Download references

Acknowledgements

This work was funded by the support of the National Science Foundation under Grant SMA 1540917 and Grant CNS 1544797, by Samsung under the GRO Program (Nos. 20477, 355022), and by DARPA through U.S. Army Grant W911NF-14-1-0384.

Author information

Authors and Affiliations

University of Maryland, College Park, MD, 20742, USA
Cornelia Fermüller, Konstantinos Zampogiannis & Yi Zhang
College of Engineering and Computer Science (CECS), Australian National University, Canberra, Australia
Fang Wang
University of Granada, Granada, Spain
Francisco Barranco
Institute of Neuroinformatics, University of Zurich, Zurich, Switzerland
Michael Pfeiffer
Bosch center for Artificial Intelligence - Research, 71272, Renningen, Germany
Michael Pfeiffer
Arizona State University, Tempe, AZ, 85281, USA
Yezhou Yang

Authors

Cornelia Fermüller
View author publications
You can also search for this author in PubMed Google Scholar
Fang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yezhou Yang
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Zampogiannis
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Barranco
View author publications
You can also search for this author in PubMed Google Scholar
Michael Pfeiffer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cornelia Fermüller.

Additional information

Communicated by Deva Ramanan and Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fermüller, C., Wang, F., Yang, Y. et al. Prediction of Manipulation Actions. Int J Comput Vis 126, 358–374 (2018). https://doi.org/10.1007/s11263-017-0992-z

Download citation

Received: 16 March 2016
Accepted: 25 January 2017
Published: 20 February 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11263-017-0992-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prediction of Manipulation Actions

Abstract

Access this article

Similar content being viewed by others

A Data-Driven Approach to Estimate Human Center of Mass State During Perturbed Locomotion Using Simulated Wearable Sensors

A review of convolutional neural networks in computer vision

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Prediction of Manipulation Actions

Abstract

Access this article

Similar content being viewed by others

A Data-Driven Approach to Estimate Human Center of Mass State During Perturbed Locomotion Using Simulated Wearable Sensors

A review of convolutional neural networks in computer vision

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation