Skip to main content
Log in

Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Understanding continuous human actions is a non-trivial but important problem in computer vision. Although there exists a large corpus of work in the recognition of action sequences, most approaches suffer from problems relating to vast variations in motions, action combinations, and scene contexts. In this paper, we introduce a novel method for semantic segmentation and recognition of long and complex manipulation action tasks, such as “preparing a breakfast” or “making a sandwich”. We represent manipulations with our recently introduced “Semantic Event Chain” (SEC) concept, which captures the underlying spatiotemporal structure of an action invariant to motion, velocity, and scene context. Solely based on the spatiotemporal interactions between manipulated objects and hands in the extracted SEC, the framework automatically parses individual manipulation streams performed either sequentially or concurrently. Using event chains, our method further extracts basic primitive elements of each parsed manipulation. Without requiring any prior object knowledge, the proposed framework can also extract object-like scene entities that exhibit the same role in semantically similar manipulations. We conduct extensive experiments on various recent datasets to validate the robustness of the framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25

Similar content being viewed by others

References

  • Abramov, A., Aksoy, E.E., Dörr, J., Pauwels, K., Wörgötter, F., Dellen, B. (2010). 3D semantic representation of actions from efficient stereo-image-sequence segmentation on GPUs. In 5th International Symposium 3D Data Processing, Visualization and Transmission (pp. 1–8).

  • Abramov, A., Pauwels, K., Papon, J., Wörgötter, F., & Dellen, B. (2012). Depth-supported real-time video segmentation with the kinect. In IEEE Workshop on Applications of Computer Vision (pp. 457–464).

  • Ahad, M. A. R. (2011). Computer vision and action recognition: A guide for image processing and computer vision community for action understanding. New York: Atlantis Publishing Corporation.

    Book  Google Scholar 

  • Aksoy, E. E., Abramov, A., Dörr, J., Ning, K., Dellen, B., & Wörgötter, F. (2011). Learning the semantics of object-action relations by observation. The International Journal of Robotics Research, 30, 1229–1249.

    Article  Google Scholar 

  • Aksoy, E. E., Abramov, A., Wörgötter, F., & Dellen, B. (2010). Categorizing object-action relations from semantic scene graphs. In: IEEE International Conference on Robotics and Automation (pp. 398–405).

  • Aksoy, E. E., Tamosiunaite, M., Vuga, R., Ude, A., Geib, C., Steedman, M., & Wörgötter, F. (2013). Structural bootstrapping at the sensorimotor level for the fast acquisition of action knowledge for cognitive robots. In IEEE International Conference on Development and Learning and Epigenetic Robotics (pp. 1–8).

  • Aksoy, E. E., Tamosiunaite, M., & Wörgötter, F. (2015). Model-free incremental learning of the semantics of manipulation actions. Robotics and Autonomous Systems, 71, 118–133.

    Article  Google Scholar 

  • Anscombe, G. E. M. (1963). Intention. Ithaca: Cornell University Press.

    Google Scholar 

  • Badler, N. (1975). Temporal scene analysis: Conceptual descriptions of object movements. Ph.D. thesis. University of Toronto, Toronto.

  • Bobick, A. F., & Ivanov, Y. A. (1998). Action recognition using probabilistic parsing. In Computer Vision and Pattern Recognition (pp. 196–202).

  • Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine intelligence, 23, 257–267.

    Article  Google Scholar 

  • Brand, M. (1996). Understanding manipulation in video. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition (pp. 94–99).

  • Brand, M. (1997). The inverse hollywood problem: From video to scripts and storyboards via causal analysis. In Proceedings, AAAI97 (pp. 12–96).

  • Bullock, I. M., Ma, R. R., & Dollar, M. A. (2013). A hand-centric classification of human and robot dexterous manipulation. IEEE Transactions on Haptics, 6, 129–144.

    Article  Google Scholar 

  • Chen, H. S., Chen, H. T., Chen, Y. W., & Lee, S. Y. (2006). Human action recognition using star skeleton. In Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks (pp. 171–178).

  • Cutkosky, M. R. (1989). On grasp choice, grasp models, and the design of hands for manufacturing tasks. In IEEE International Conference on Robotics and Automation (pp. 269–279).

  • Danelljan, M., Khan, F. S., Felsberg, M., & Weijer, J. (2006). Adaptive color attributes for real-time visual tracking. In Computer Vision and Pattern Recognition (pp. 1090–1097).

  • Efros, A. A., Berg, A. C., Mori, G., Malik, J. (2003). Recognizing action at a distance. In IEEE International Conference on Computer Vision (pp. 726–733).

  • Ekvall, S. & Kragic, D. (2005). Grasp recognition for programming by demonstration. In IEEE International Conference on Robotics and Automation (pp. 748–753).

  • Elliott, J. M., & Connolly, K. J. (1984). A classification of manipulative hand movements. Developmental Medicine & Child Neurology, 26, 283–296.

    Article  Google Scholar 

  • Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In International Conference on Computer Vision (pp. 407–414).

  • Feix, T., Pawlik, R., Schmiedmayer, H., Romero, J., & Kragic, D. (2009). A comprehensive grasp taxonomy. In Robotics, Science and Systems: Workshop on Understanding the Human Hand for Advancing Robotic Manipulation (pp. 407–414).

  • Fern, A., Siskind, J. M., & Givan, R. (2002). Learning temporal, relational, force-dynamic event definitions from video. In National Conference on Artificial Intelligence (pp. 159–166).

  • Fernando, B., Gavves, E., Oramas, M. J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 5378–5387).

  • Graf, J., Puls, S., & Wörn, H. (2010). Recognition and understanding situations and activities with description logics for safe human-robot cooperation. In The Second International Conference on Advanced Cognitive Technologies and Applications: Cognitive 2010 (p. 7).

  • Gupta, A., & Davis, L. (2007). Objects in action: An approach for combining action understanding and object perception. In Computer Vision and Pattern Recognition (pp. 1–8).

  • Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1775–1789.

    Article  Google Scholar 

  • Hoai, M., Zhong Lan, Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3265–3272).

  • Ke, Y., Sukthankar, R., & Hebert, M. (2007). Event detection in crowded videos. In IEEE International Conference on Computer Vision.

  • Kjellström, H., Romero, J., & Kragić, D. (2011). Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115, 81–90.

    Article  Google Scholar 

  • Koo, S. Y., Lee, D., & Kwon, D. S. (2014). Incremental object learning and robust tracking of multiple objects from RGB-D point set data. Journal of Visual Communication and Image Representation, 25, 108–121.

    Article  Google Scholar 

  • Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32, 951–970.

    Article  Google Scholar 

  • Krüger, N., Geib, C., Piater, J., Petrick, R., Steedman, M., Wörgötter, F., et al. (2011). Object-action complexes: Grounded abstractions of sensory-motor processes. Robotics and Autonomous Systems, 59, 740–757.

    Article  Google Scholar 

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64, 107–123.

    Article  MathSciNet  Google Scholar 

  • Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In International Conference on Computer Vision (pp. 432–439).

  • Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Computer Vision and Pattern Recognition (pp. 1–8).

  • Lee, K., Su, Y., Kim, T. K., & Demiris, Y. (2013). A syntactic approach to robot imitation learning using probabilistic activity grammars. Robotics and Autonomous Systems, 61, 1323–1334.

    Article  Google Scholar 

  • Li, Y., Ye, Z., & Rehg, J. M. (2015). Delving into egocentric actions. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Liu, J., Feng, F., Nakamura, Y. C., & Pollard, N. S. (2016). Annotating everyday grasps in action. In J.-P. Laumond & N. Abe (Eds.), Dance notations and robot motion (pp. 263–282). Springer tracts in robotics Berlin: Springer.

  • Luo, G., Bergstrom, N., Ek, C., & Kragic, D. (2011). Representing actions with kernels. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 2028–2035).

  • Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In European Conference on Computer Vision (Vol. IV, pp. 359–372).

  • Martinez, D., Alenya, G., Jimenez, P., Torras, C., Rossmann, J., Wantia, N., Aksoy, E.E., Haller, S., & Piater, J. (2014). Active learning of manipulation sequences. In IEEE International Conference on Robotics and Automation (pp. 5671–5678).

  • Mele, A. (1992). Springs of action: Understanding intentional behavior. Oxford: Oxford University Press.

    Google Scholar 

  • Minnen, D., Essa, J., & Starner, T. (2003). Expectation grammars: Leveraging high-level expectations for activity recognition. In Computer Vision and Pattern Recognition (pp. 626–632).

  • Nagahama, K., Yamazaki, K., Okada, K., & Inaba, M. (2013). Manipulation of multiple objects in close proximity based on visual hierarchical relationships. In IEEE International Conference on Robotics and Automation (pp. 1303–1310).

  • Papon, J., Abramov, A., Aksoy, E. E., & Wörgötter, F. (2012). A modular system architecture for online parallel vision pipelines. In IEEE Workshop on Applications of Computer Vision (WACV) (pp. 361–368).

  • Pardowitz, M., Haschke, R., Steil, J., & Ritter, H. (2008). Gestalt-based action segmentation for robot task learning. In IEEE-RAS International Conference on Humanoid Robots (pp. 347–352).

  • Pauwels, K., Krüger, N., Lappe, M., Wörgötter, F., & Van Hulle, M. (2010). A cortical architecture on parallel hardware for motion processing in real time. Journal of Vision, 10, 18.

    Article  Google Scholar 

  • Pei, M., Si, Z., Yao, B., & Zhu, S. C. (2013). Video event parsing and learning with goal and intent prediction. Computer Vision and Image Understanding, 117, 1369–1383.

    Article  Google Scholar 

  • Peursum, P., Bui, H. H., Venkatesh, S., & West, G. A. W. (2004). Human action segmentation via controlled use of missing data in HMMs. In International Conference on Pattern Recognition (pp. 440–445).

  • Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28, 976–990.

    Article  Google Scholar 

  • Ramirez-Amaro, K., Kim, E.S., Kim, J., Zhang, B.T., Beetz, M., & Cheng, G. (2013). Enhancing human action recognition through spatio-temporal feature learning and semantic rules. In IEEE-RAS International Conference on Humanoid Robots.

  • Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Rui, Y., & Anandan, P. (2000). Segmenting visual actions based on spatio-temporal motion patterns. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 111–118).

  • Ryoo, M. S., & Aggarwal, J. K. (2000). Recognition of composite human activities through context-free grammar based representation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 1709–1718).

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004 (Vol. 3, pp. 32–36).

  • Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th International Conference on Multimedia (pp. 357–360).

  • Shi, Q., Cheng, L., Wang, L., & Smola, A. (2011). Human action segmentation and recognition using discriminative semi-Markov models. International Journal of Computer Vision, 93, 22–32.

    Article  MATH  Google Scholar 

  • Siskind, J. (1994). Grounding language in perception. Artificial Intelligence Review, 8, 371–391.

    Article  Google Scholar 

  • Siskind, J., & Morris, Q. (1996) . A maximum-likelihood approach to visual event classification. In European Conference on Computer Vision (pp. 347–360).

  • Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2006). Conditional models for contextual human motion recognition. Computer Vision and Image Understanding, 104, 210–220.

    Article  Google Scholar 

  • Sridhar, M., Cohn, G. A., & Hogg, D. (2008). Learning functional object-categories from a relational spatio-temporal representation. In Proceedings of 18th European Conference on Artificial Intelligence (pp. 606–610).

  • Thuc, H. L. U., Ke, S. R., Hwang, J. N., Tuan, P. V., & Chau, T. N. (2012). Quasi-periodic action recognition from monocular videos via 3d human models and cyclic hmms. In International Conference on Advanced Technologies for Communications (pp. 110–113).

  • Vitaladevuni, S. N. P., Kellokumpu, V., & Davis, L. S. (2008). Action recognition using ballistic dynamics. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

  • Vuga, R., Aksoy, E. E., Wörgötter, F., & Ude, A. (2013). Augmenting semantic event chains with trajectory information for learning and recognition of manipulation tasks. In International Workshop on Robotics in Alpe-Adria-Danube Region (RAAD).

  • Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision & Pattern Recognition (pp. 3169–3176).

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision.

  • Wang, Z., Wang, J., Xiao, J., Lin, K. H., & Huang, T. (2012). Substructure and boundary modeling for continuous action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1330–1337).

  • Wei, P., Zhao, Y., Zheng, N., & Zhu, S. C. (2006). Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Transactions on Pattern Analysis & Machine Intelligence

  • Weinland, D., Ronfard, R., & Boyer, E. (2006). Automatic discovery of action taxonomies from multiple views. IEEE Conference on Computer Vision and Pattern Recognition, 2, 1639–1645.

    Google Scholar 

  • Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115, 224–241.

    Article  Google Scholar 

  • Wimmer, R. (2011). Grasp sensing for human-computer interaction. In Proceedings of the Fifth International Conference on Tangible, Embedded, and Embodied Interaction (pp. 221–228).

  • Wörgötter, F., Agostini, A., Krüger, N., Shylo, N., & Porr, B. (2009). Cognitive agents: A procedural perspective relying on the predictability of object-action-complexes oacs. Robotics and Autonomous Systems, 57, 420–432.

    Article  Google Scholar 

  • Wörgötter, F., Aksoy, E. E., Krüger, N., Piater, J., Ude, A., & Tamosiunaite, M. (2013). A simple ontology of manipulation actions based on hand-object relations. IEEE Transactions on Autonomous Mental Development, 5, 117–134.

    Article  Google Scholar 

  • Wörgötter, F., Geib, C., Tamosiunaite, M., Aksoy, E. E., Piater, J., Hanchen, X., Ude, A., Nemec, B., Kraft, D., Krüger, N., Wächter, M., & Asfour, T. (2015). Structural bootstrapping: A novel concept for the fast acquisition of action-knowledge. IEEE Transactions on Autonomous Mental Development, 140–154.

  • Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 379–385).

  • Yang, Y., Fermüller, C., & Aloimonos, Y. (2013). Detection of manipulation action consequences (mac). In International Conference on Computer Vision and Pattern Recognition (pp. 2563–2570).

  • Yang, S., Gao, Q., Liu, C., Xiong, C., & Chai, J. (2016). Grounded Semantic Role Labeling. In The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2016).

  • Zhong, H., Shi, J., & Visontai, M. (2004). Detecting unusual activity in video. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 819–826).

  • Zhou, F., De la Torre Frade, F., & Hodgins, J. K. (2013). Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35, 582–596.

    Article  Google Scholar 

Download references

Acknowledgments

The research leading to these results has received funding from the European Communitys Seventh Framework Programme FP7/2007-2013 (Programme and Theme: ICT-2011.2.1, Cognitive Systems and Robotics) under Grant Agreement No. 600578, ACAT. We thank Seongyong Koo for sharing with us the MOT dataset (Koo et al. 2014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eren Erdal Aksoy.

Additional information

Communicated by M. Hebert.

Appendix: Manipulator Estimation

Appendix: Manipulator Estimation

In Algorithm 1, we provide the pseudocode which describes details of the manipulator estimation process in single-hand manipulation actions. This algorithm essentially describes how to compute the probability value \(p_{k}\) of each object \(s_{k}\) existing in the event chain \(\xi \) in order to define the likelihood of being the manipulator. The n and m values in Algorithm 1 stand for the row and column numbers in \(\xi \). The algorithm first searches for the start and end time points of the \([N, T, \cdots , T, N]\) sequences in all rows that include the respective object \(s_{k}\) and then computes the normalized length of the touching relation T to assign as the probability value. The manipulator is finally estimated as the object with the highest probability value.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aksoy, E.E., Orhan, A. & Wörgötter, F. Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences. Int J Comput Vis 122, 84–115 (2017). https://doi.org/10.1007/s11263-016-0956-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0956-8

Keywords

Navigation