Skip to main content

Advertisement

Log in

Prediction of Manipulation Actions

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

By looking at a person’s hands, one can often tell what the person is going to do next, how his/her hands are moving and where they will be, because an actor’s intentions shape his/her movement kinematics during action execution. Similarly, active systems with real-time constraints must not simply rely on passive video-segment classification, but they have to continuously update their estimates and predict future actions. In this paper, we study the prediction of dexterous actions. We recorded videos of subjects performing different manipulation actions on the same object, such as “squeezing”, “flipping”, “washing”, “wiping” and “scratching” with a sponge. In psychophysical experiments, we evaluated human observers’ skills in predicting actions from video sequences of different length, depicting the hand movement in the preparation and execution of actions before and after contact with the object. We then developed a recurrent neural network based method for action prediction using as input image patches around the hand. We also used the same formalism to predict the forces on the finger tips using for training synchronized video and force data streams. Evaluations on two new datasets show that our system closely matches human performance in the recognition task, and demonstrate the ability of our algorithms to predict in real time what and how a dexterous action is performed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Aloimonos, Y., & Fermüller, C. (2015). The cognitive dialogue: A new model for vision implementing common sense reasoning. Image and Vision Computing, 35(12), 2891–2903.

    Google Scholar 

  • Ansuini, C., Giosa, L., Turella, L., Altoé, G., & Castiello, U. (2008). An object for an action, the same object for other actions: Effects on hand shaping. Experimental Brain Research, 185(1), 111–119.

    Article  Google Scholar 

  • Ansuini, C., Cavallo, A., Bertone, C., & Becchio, C. (2015). Intentions in the brain: The unveiling of Mister Hyde. The Neuroscientist, 21(2), 126–135.

    Article  Google Scholar 

  • Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), 469–483.

    Article  Google Scholar 

  • Aviles, A. I., Marban, A., Sobrevilla, P., Fernandez, J., & Casals, A. (2014). A recurrent neural network approach for 3d vision-based force estimation. In International conference on image processing theory, tools and applications (IPTA).

  • Bradski, G. R. (1998). Computer vision face tracking for use in a perceptual user interface. Intel Technology Journal Q2.

  • Bullock, I. M., Feix, T., & Dollar, A. M. (2015). The Yale human grasping data set: Grasp, object, and task data in household and machine shop environments. International Journal of Robotics Research. 34(3), 251–255.

  • Cai, M., Kitani, K. M., & Sato, Y. (2015). A scalable approach for understanding the visual structures of hand grasps. In IEEE international conference on robotics and automation (ICRA), IEEE (pp. 1360–1366).

  • Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of non-rigid objects using mean shift. In IEEE conference on computer vision and pattern recognition, IEEE (Vol. 2, pp. 142–149).

  • Crajé, C., Lukos, J., Ansuini, C., Gordon, A., & Santello, M. (2011). The effects of task and content on digit placement on a bottle. Exp Brain Research, 212(1), 119–124.

    Article  Google Scholar 

  • Cutkosky, M. R. (1989). On grasp choice, grasp models, and the design of hands for manufacturing tasks. IEEE Transactions on Robotics and Automation, 5(3), 269–279.

    Article  Google Scholar 

  • Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). Long-term recurrent convolutional networks for visual recognition and description. In IEEE conference on computer vision and pattern recognition (pp. 2625–2634).

  • Doyle, J., & Csete, M. (2011). Architecture, constraints, and behavior. Proceedings of the National Academy of Sciences, 108(Sup. 3), 15624–15630.

    Article  Google Scholar 

  • Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. Computer Vision and Image Understanding, 108(1), 52–73.

    Article  Google Scholar 

  • Fanello, S. R., Gori, I., Metta, G., & Odone, F. (2013). Keep it simple and sparse: Real-time action recognition. The Journal of Machine Learning Research, 14(1), 2617–2640.

    Google Scholar 

  • Fathi, A., Ren, X., & Rehg, J. M. (2011). Learning to recognize objects in egocentric activities. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3281–3288).

  • Feix, T., Pawlik, R., Schmiedmayer, H., Romero, J., & Kragic, D. (2009). A comprehensive grasp taxonomy. In Robotics, science and systems conference: Workshop on understanding the human hand for advancing robotic manipulation.

  • Fermüller, C. (2016). Prediction of manipulation actions. http://www.cfar.umd.edu/~fer/action-prediction/.

  • Fouhey, D. F., & Zitnick, C. (2014). Predicting object dynamics in scenes. In IEEE conference on computer vision and pattern recognition.

  • Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for kinematic tracking. arXiv:1508.00271.

  • Gallesse, V., & Goldman, A. (1998). Mirror neurons and the simulation theory of mind-reading. Trends in Cognitive Sciences, 2(12), 493–501.

    Article  Google Scholar 

  • Gams, A., Do, M., Ude, A., Asfour, T., & Dillmann, R. (2010). On-line periodic movement and force-profile learning for adaptation to new surfaces. In IEEE International conference on robotics research (ICRA) (pp. 3192–3199).

  • Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6645–6649).

  • Greminger, M., & Nelson, B. (2004). Vision-based force measurement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3), 290–298.

    Article  Google Scholar 

  • Hoai, M., & De la Torre, F. (2014). Max-margin early event detectors. International Journal of Computer Vision, 107(2), 191–202.

    Article  MathSciNet  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hoffman, J., Gupta, S., & Darrell, T. (2016) Learning with side information through modality hallucination. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Ijina, E., & Mohan, S. (2014). Human action recognition based on recognition of linear patterns in action bank features using convolutional neural networks. In International conference on machine learning and applications.

  • Jeannerod, M. (1984). The timing of natural prehension movements. Journal of Motor Behavior, 16(3), 235–254.

    Article  Google Scholar 

  • Joo, J., Li, W., Steen, F. F., & Zhu, S. C. (2014). Visual persuasion: Inferring communicative intents of images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 216–223).

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1725–1732).

  • Keskin, C., Kıraç, F., Kara, Y. E., & Akarun, L. (2013). Real time hand pose estimation using depth sensors. In Consumer depth cameras for computer vision (pp. 119–137). London: Springer.

  • Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012). Activity forecasting. In European conference on computer vision (ECCV).

  • Kober, J., Gienger, M., & Steil, J. (2000). Learning movement primitives for force interaction tasks. IEEE Conference on Computer Vision and Pattern Recognition, 2, 142–149.

    Google Scholar 

  • Koppula, H., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on pattern Analysis and Machine Intelligence, 38(1), 14–29.

  • Kormushev, P., Calinon, S., & Caldwell, D. G. (2011). Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input. Advanced Robotics, 25(5), 581–603.

    Article  Google Scholar 

  • Lea, C., Reiter, A., Vidal, R., & Hager, G. D. (2016). Segmental spatiotemporal CNNS for fine-grained action segmentation. In European conference of computer vision (ECCV) (pp. 36–52).

  • Lenz, I., Lee, H., & Saxena, A. (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research. 34(4–5), 705–724.

  • Li, Y., Fermüller, C., Aloimonos, Y., & Ji, H. (2010). Learning shift-invariant sparse representation of actions. In IEEE conference on computer vision and pattern recognition, San Francisco, CA, pp. 2630–2637.

  • Liu, J., Feng, F., Nakamura, Y. C., & Pollard, N. S. (2014). A taxonomy of everyday grasps in action. In 14th IEEE-RAS international conference on humanoid robots, Humanoids.

  • Lv, F., & Nevatia, R.(2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In European conference on computer vision (ECCV) (pp. 359–372). Springer.

  • Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in LSTMS for activity detection and early detection. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Mandary, C., Terlemez, O., Do, M., Vahrenkamp, N., & Asfour, T. (2015). The kit whole-body human motion database. In International conferences on advanced robotics (pp. 329–336).

  • Melax ,S., Keselman, L., & Orsten, S. (2013). Dynamics based 3d skeletal hand tracking. In Proceedings of graphics interface 2013, Canadian information processing society (pp. 63–70).

  • Moeslund, T., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2), 90–126.

    Article  Google Scholar 

  • Molchanov, P., Gupta, S., Kim. K., & Kautz, J. (2015). Hand gesture recognition with 3d convolutional neural networks. In IEEE conference on computer vision and pattern recognition workshops (pp. 1–7).

  • Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga. R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR 2015.

  • Ohn-Bar, E., & Trivedi, M. M. (2014). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(6), 2368–2377.

    Article  Google Scholar 

  • Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2011). Efficient model-based 3d tracking of hand articulations using kinect. In British machine vision conference.

  • Panteleris, P., Kyriazis, N., & Argyros, A. (2015). 3d tracking of human hands in interaction with unknown objects. In British machine vision conference (BMVC), BMVA Press, pp. 123.

  • Pham, T. H., Kheddar, A., Qammaz, A., & Argyros, A. A. (2015). Towards force sensing from vision: Observing hand-object interactions to infer manipulation forces. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Pieropan, A., Ek, C. H., & Kjellstrom, H. (2013). Functional object descriptors for human activity modeling. In IEEE international conference on robotics and automation (ICRA) (pp. 1282–1289).

  • Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Inferring the why in images. arXiv:1406.5472

  • Rizzolatti, G., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanisms underlying the understanding and imitation of action. Nature Reviews Neuroscience, 2(9), 661–670.

    Article  Google Scholar 

  • Rogez, G., Khademi, M., Supančič, J. III, Montiel, J. M. M., & Ramanan, D. (2014). 3d hand pose detection in egocentric RGB-D images. In Workshop at the European conference on computer vision (pp. 356–371). Springer.

  • Rogez, G., Supancic, J. S. III., & Ramanan, D. (2015). Understanding everyday hands in action from RGB-D images. In IEEE international conference on computer vision (ICCV).

  • Romero, J., Feix, T., Ek, C. H., Kjellstrom, H., & Kragic, D. (2013). A metric for comparing the anthropomorphic motion capability of artificial hands. IEEE Transactions on Robotics, 29(6), 1342–1352.

    Article  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In IEEE international conference on computer vision (ICCV).

  • Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2730–2737).

  • Ryoo, M., Fuchs, T. J., Xia, L., Aggarwal, J., & Matthies, L. (2015). Robot-centric activity prediction from first-person videos: What will they do to me? In ACM/IEEE international conference on human-robot interaction (HRI) (pp. 295–302). ACM.

  • Saxena, A., Driemeyer, J., & Ng, A. Y. (2008). Robotic grasping of novel objects using vision. The International Journal of Robotics Research, 27(2), 157–173.

    Article  Google Scholar 

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In International conference on pattern recognition.

  • Shi, Q., Cheng, L., Wang, L., & Smola, A. (2011). Human action segmentation and recognition using discriminative semi-Markov models. International Journal of Computer Vision, 93(1), 22–32.

    Article  MATH  Google Scholar 

  • Shimoga, K. B. (1996). Robot grasp synthesis algorithms: A survey. The International Journal of Robotics Research, 15(3), 230–266.

    Article  Google Scholar 

  • Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM, 56(1), 116–124.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  • Starner, T., Weaver, J., & Pentland, A. (1998). Real-time american sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.

    Article  Google Scholar 

  • Stein, S., & McKenna, S. J. (2013). Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous computing (pp. 729–738). ACM.

  • Supancic, J. S., Rogez, G., Yang, Y., Shotton, J., & Ramanan, D. (2015). Depth-based hand pose estimation: Data, methods, and challenges. In IEEE international conference on computer vision (pp. 1868–1876).

  • Takano, W., Ishikawa, J., & Nakamura, Y. (2015). Using a human action database to recognize actions in monocular image sequences: Recovering human whole body configurations. Advanced Robotics, 29(12), 771–784.

    Article  Google Scholar 

  • Tiest, W. M. B., & Kappers, A. M. (2014). Physical aspects of softness perception. In M. D. Luca (Ed.), Multisensory softness (pp. 3–15). London: Springer.

    Google Scholar 

  • Turaga, P., Chellappa, R., Subrahmanian, V., & Udrea, O. (2008). Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11), 1473–1488.

    Article  Google Scholar 

  • Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729.

  • Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., & Bengio, Y. (2015). A recurrent neural network based alternative to convolutional networks. arXiv:1505.00393.

  • Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In IEEE conference on computer vision and pattern recognition.

  • Walker, J., Gupta, A., & Hebert, M. (2014). Patch to the future: Unsupervised visual prediction. In IEEE conference on computer vision and pattern recognition (pp. 3302–3309).

  • Wang, S. B., Quattoni, A., Morency, L. P., Demirdjian, D., & Darrell, T. (2006). Hidden conditional random fields for gesture recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2, 1521–1527.

    Google Scholar 

  • Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2014). Learning actionlet ensemble for 3d human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 914–927.

    Article  Google Scholar 

  • Xie, D., Todorovic, S., & Zhu, S. C. (2013). Inferring “dark matter” and “dark energy” from videos. In IEEE international conference on computer vision (ICCV) (pp. 2224–2231).

  • Yang, Y., Fermüller, C., Li, Y., & Aloimonos, Y. (2015). Grasp type revisited: A modern perspective on a classical feature for vision. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhu, Y., Zhao, Y., & Zhu, S. C. (2015). Understanding tools: Task-oriented object modeling, learning and recognition. In IEEE conference on computer vision and pattern recognition (pp. 2855–2864).

Download references

Acknowledgements

This work was funded by the support of the National Science Foundation under Grant SMA 1540917 and Grant CNS 1544797, by Samsung under the GRO Program (Nos. 20477, 355022), and by DARPA through U.S. Army Grant W911NF-14-1-0384.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cornelia Fermüller.

Additional information

Communicated by Deva Ramanan and Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fermüller, C., Wang, F., Yang, Y. et al. Prediction of Manipulation Actions. Int J Comput Vis 126, 358–374 (2018). https://doi.org/10.1007/s11263-017-0992-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-017-0992-z

Keywords

Navigation