Helping a robot to understand human actions and objects: a grammatical view

Wörgötter, Florentin; Tamosiunaite, Minija

doi:10.1007/s10015-020-00600-y

Helping a robot to understand human actions and objects: a grammatical view

Invited Article
Open access
Published: 22 April 2020

Volume 25, pages 388–392, (2020)
Cite this article

Download PDF

You have full access to this open access article

Artificial Life and Robotics Aims and scope Submit manuscript

Helping a robot to understand human actions and objects: a grammatical view

Download PDF

Florentin Wörgötter¹ &
Minija Tamosiunaite¹

1050 Accesses
Explore all metrics

Abstract

Humans are able to perform a wide variety of complex actions manipulating a very large number of objects. We can make predictions on the outcome of our actions and on how to use different objects. Hence, we have excellent action and object understanding. Artificial agents, on the other hand, still miserably fail in this respect. It is particularly puzzling how inexperienced, young humans can acquire such knowledge; bootstrapped by exploration and extended by supervision. In this study we have, therefore, addressed the question how to structure the realm of actions and objects into dynamic representations, which allow for the easy learning of different action and object concepts. Performing different manipulation actions on a table top (e.g. the actions of “making a breakfast”), we show with our robots that this will indeed lead to some kind of implicit (un-reflected) understanding of action and object concepts allowing the agent to generalize actions and redefine object uses according to need.

A modular approach to learning manipulation strategies from human demonstration

Article 05 October 2015

A Relational Approach to Tool-Use Learning in Robots

Robot life-long task learning from human demonstrations: a Bayesian approach

Article 28 July 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Why are humans so good at interpreting actions (performed by others) and why do we also have such an excellent understanding of the use of the different objects in our world? These questions are puzzling from the perspective of cognitive science but also from a robotics viewpoint. Action- and object-understanding seems an unsurmountable problem: The same action can be performed in so many different ways! Why does a 2 years know that his/her mother makes a sandwich regardless of how, when and where this happens? Furthermore, there are so many objects existing! Hence, why does the same child—maybe a bit older—know exactly what knives and forks are good for given that there are vast amounts of different types of such items existing. For example, he/she will correctly infer, with great likelihood, that a never-before-seen three-pronged fork is just “another fork” and nothing else. Robots cannot really do any of this.

Here, we would like to discuss a grammatical approach towards action- and object-understanding. We posit that actions and objects have their own grammar, which acts as a scaffold for our understanding and for the generative modeling and the prediction of the function of unknown actions and objects. We will focus on human manipulation actions and on simple everyday objects, but this analysis is not restricted to those.

Let us first summarize the main ideas before we support this with some theoretical and experimental analyses. In the first place, we claim that action understanding does not require object knowledge. Objects do not even have to be present to understand an action: Just think of a pantomime! Our central assumption is that actions can be understood just from the changing spatio-temporal relations between the therein-present entities (which are indeed objects—only we do not care to know which ones). The main temporal structure of a manipulation action is given by the touching and un-touching events between those. This, however, is not yet enough. In addition, we observe that, between two such (touching or un-touching) events, there are phases of motion of those entities (e.g., approach, distancing, conjoint motion, etc.). Those phases will finally clarify, which action it was. Detailed, quantitative knowledge about movement shapes and trajectories is thereby not needed to classify an action. Furthermore, the action does not even have to be completely finished to arrive at safe classification. Thus, this framework is predictive.

Hence, following this, there is indeed a clear sequential structure present in our actions, where motion phases and touching/un-touching events represent the structuring elements. Thus, we posit that this represents the syntax and grammar of actions [1, 2]. But what about “objects”? Objects seem far more diverse and unstructured than actions. Here, however, we can fall back onto a rather old idea: Around 1987, Biederman [3] suggested that all objects can be constructed from only a few elementary geometrical entities, which he called Geons. Later, Rivlin et al. [4] took this idea further and described objects by part graphs, where every node in such a graph represents a part (a geon) of an object, and graph edges indicate that two parts are connected to each other. The idea had great appeal, but at that time, there were no means existing to extract parts (Geons) from computer vision data from an object. Furthermore, a non-annotated part graph of this kind cannot very well describe an object, because it very much matters, too, how the parts are connected. Hence, it is required to annotate the graph-edges by so-called “relative pose” information, which represents the relative alignment- and attachment geometry of the connected parts. Under these constraints the annotated part-connectivity graph can, however, indeed be considered—yet again—as a syntactical structure, where certain connectivity rules may exist, constituting a “grammar of objects” (see also [5, 6]).

In the following, we will present some methods and results supporting the notion that such a grammatical view onto actions and objects is useful from the perspective of trying to make a robot understand the world.

2 Methods and results

2.1 Actions

We use 3D point-cloud data that are first segmented by computer vision methods [7, 8] into objects and object parts. Figure 1 shows on top in color code such a segmentation performed on several action frames from a movie of a cutting action. All these parts are then consistently tracked along the movie, but we do not perform any recognition process, except for the “Hand”. Thus, objects have just abstract classes: “Main” (M) is the object first touched by the hand. “Primary” (P) the one from which Main first untouches. “Secondary” (S) the one first touched by Main. The big table in Fig. 1 shows on top the temporal sequence of how any two such objects will touch “T” (or untouch “N”), where “U” mean that the relation is still unknown. Touching and untouching events present the main temporal action structure. In addition to this, we define a set of 11 static and six dynamic relations that can exist between two objects. Those could be, static: “right”, “left”, “in front of”, etc.; dynamic: “getting closer”, “moving together”, etc. The sub-tables below show how these relations change in the course of an action. This tabular action representation is called: Extended Semantic Event Chain (ESEC [9]).

We have performed an analysis of 32 different manipulation actions, as shown in Fig. 2. Those can be grouped into different super-classes (top), like actions for destroying or for rearranging things. The same color code is used in the bottom panel that shows a decision tree. The specific combination of object-object relations that changes from column to column allows recognizing actions after several columns, due to the sequential column structure that they have. At most 7 columns are needed to recognize any action. This tree also shows that cutting (e.g. Fig. 1) can be recognized already at the moment when the fourth column of the ESEC emerges, hence, far before the actual action has ended. In summary, we found that on average, these actions are recognized only after 45.06% (SD = 6%) of their total completion time. Hence, we can predict all manipulation actions before they are half-finished.

2.2 Objects

To arrive at object connectivity graphs we use methods described in detail in [10], where we can only give a short summary here leaving all equations out. First, we describe each object-part by a so-called “part-signature” (Fig. 3a). For this, we normalize the maximal distance d between points in the point cloud to one and determine the surface normals at each point. Then, we calculate for all inter-point distances Δd (distances from every point to every other point) the sum of the angular differences Δα of all surface normals. As a result, we observe the every part gets its own characteristic part-signature (Fig. 3a). Next, we determine how parts are attached and aligned with respect to each other. For this, we represent each part with its variance blob V. The variance blob represents the variance of the point distribution of a part in all possible directions normalized to one. Visualization of these variance blobs can be achieved using spherical plots, as shown in Fig. 3b. In these plots, the variance is denoted by the radius of the surface from the origin. In case of constant variance, this results in a constant radius in all directions, thus in a perfect sphere. A variance blob represents the variance in the Euclidean space of the point cloud. This is why the aspect ratio (the elongation) of a part is directly visible in the elongation of its variance blob. Exact shape detail (like round, cylindrical, etc.) is ignored. Remember, we had this represented by the part signature (see Fig. 3a).

To calculate the alignment between parts A and B (AL_AB), we use the histogram intersection similarity V_AB between V_A and V_B (see [10] for details). This is done by calculating the intersection of each bin of variance blob A with each corresponding bin in variance blob B. Finally, we calculate the alignment number AL_AB as the sum of the bins of the intersection blob. Note: AL_AB = AL_BA. In addition to this, the attachment number AT_AB reflects at which location of part A part B is connected. For this, we calculate the vector connecting centroid A to centroid B: r_AB. Then, we retrieve the value on the variance blob V_A in direction of r_AB. Since V is normalized, the values of single bins scale reciprocally with the total number of bins. This is why we multiply the result by the total number of bins to get AT_AB. Note: AT_AB ≠ AT_BA and we always calculate both. Figure 3c shows how AL remains essentially constant, where AT changes much for three objects, which consist of the same two parts A and B. Figure 4 shows the final object connectivity graph for one example (artificial) object.

After arriving at the object connectivity graphs, we can now learn possible object-to-function associations. During training, we provide a conventional small neural network with example graphs and manually associate to those the potential function of the belonging object. We have focused here on simple objects (tools!) that relate to the shapes of our human hands like “borer”, “ladle”, “blade”, “hammer”, etc. [10]. The network will in the test phase produce probability values (in percent) for a certain function. Figure 5 shows several objects and their corresponding highest probability function. Interestingly the network sometimes produces high values for several function like for the hammer in the bottom right panel, which the network associates to the “hammer”- but also to a “poker”-function.

Humans often do the very same and “abuse” and object, like in the case of this hammer where you might consider grasping it at the hammer-head to use the stick (hammer handle) as a poker in your fireplace. The use of skulls for drinking has been reported from ancient Germanic tribes who allegedly used the skulls of their enemies to quaff their mead.

3 Conclusions

The current study showed that it is possible to capture the semantics of human actions and the involved objects by levering from a grammatical perspective. This may also help to improve robots in their interactions with the world and with us.

References

Aksoy EE, Abramov A, Dellen B, Ning K, Dörr J, Wörgötter F (2011) Learning the semantics of object action relations by observation. IJRR 30:1229–1249
Google Scholar
Yang Y, Fermüller C, Aloimonos Y (2013) Detection of manipulation action consequences. In: proceedings of CVPR, pp 2563–2570
Biederman I (1987) Recognition-by-components: a theory of human image understanding. Psychol Rev 94(2):115–147
Article Google Scholar
Rivlin E, Dickinson SJ, Rosenfeld A (1995) (1995), Recognition by functional parts. Comp Vis Image Underst 62(2):164–176
Article Google Scholar
Fergus R, Perona P, Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. In: proceedings of CVPR, pp 264–271
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans PAMI 32(9):1627–1645
Article Google Scholar
Stein S, Schoeler M, Papon J, Wörgötter F (2014) Object Partitioning using Local Convexity. In: proceedings of CVPR, pp 304–311
Schoeler M, Papon J, Wörgötter F (2015) Constrained planar cuts—object partitioning for point clouds. In: proceedings of CVPR, pp 5207–5215
Ziaeetabar F, Aksoy EE, Wörgötter F, Tamosiunaite M (2017) Semantic analysis of manipulation actions using spatial relations. In: proceedings of ICRA, pp 4612–4619
Schoeler M, Wörgötter F (2015) Bootstrapping the Semantics of Tools: Affordance analysis of real world objects on a per-part basis. IEEE TAMD 8(2):84–98
Google Scholar

Download references

Acknowledgements

Open Access funding provided by Projekt DEAL. This work has been funded by the European Commission, H2020 Program Grant 732266 “Plan4Act”.

Author information

Authors and Affiliations

University of Göttingen, Göttingen, Germany
Florentin Wörgötter & Minija Tamosiunaite

Authors

Florentin Wörgötter
View author publications
You can also search for this author in PubMed Google Scholar
Minija Tamosiunaite
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florentin Wörgötter.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was presented in part as an invited speech at the 23rd International Symposium on Artificial Life and Robotics (Beppu, Oita, January 18–20, 2018). The authors accepted the invitation for submission on June 6, 2019.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Wörgötter, F., Tamosiunaite, M. Helping a robot to understand human actions and objects: a grammatical view. Artif Life Robotics 25, 388–392 (2020). https://doi.org/10.1007/s10015-020-00600-y

Download citation

Received: 05 January 2018
Accepted: 09 March 2020
Published: 22 April 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s10015-020-00600-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Helping a robot to understand human actions and objects: a grammatical view

Abstract

Similar content being viewed by others

A modular approach to learning manipulation strategies from human demonstration

A Relational Approach to Tool-Use Learning in Robots

Robot life-long task learning from human demonstrations: a Bayesian approach

1 Introduction