Helping a robot to understand human actions and objects: a grammatical view

Humans are able to perform a wide variety of complex actions manipulating a very large number of objects. We can make predictions on the outcome of our actions and on how to use different objects. Hence, we have excellent action and object understanding. Artificial agents, on the other hand, still miserably fail in this respect. It is particularly puzzling how inexperienced, young humans can acquire such knowledge; bootstrapped by exploration and extended by supervision. In this study we have, therefore, addressed the question how to structure the realm of actions and objects into dynamic representations, which allow for the easy learning of different action and object concepts. Performing different manipulation actions on a table top (e.g. the actions of “making a breakfast”), we show with our robots that this will indeed lead to some kind of implicit (un-reflected) understanding of action and object concepts allowing the agent to generalize actions and redefine object uses according to need.


Introduction
Why are humans so good at interpreting actions (performed by others) and why do we also have such an excellent understanding of the use of the different objects in our world? These questions are puzzling from the perspective of cognitive science but also from a robotics viewpoint. Action-and object-understanding seems an unsurmountable problem: The same action can be performed in so many different ways! Why does a 2 years know that his/her mother makes a sandwich regardless of how, when and where this happens? Furthermore, there are so many objects existing! Hence, why does the same child-maybe a bit older-know exactly what knives and forks are good for given that there are vast amounts of different types of such items existing. For example, he/she will correctly infer, with great likelihood, that a never-before-seen three-pronged fork is just "another fork" and nothing else. Robots cannot really do any of this.
Here, we would like to discuss a grammatical approach towards action-and object-understanding. We posit that actions and objects have their own grammar, which acts as a scaffold for our understanding and for the generative modeling and the prediction of the function of unknown actions and objects. We will focus on human manipulation actions and on simple everyday objects, but this analysis is not restricted to those.
Let us first summarize the main ideas before we support this with some theoretical and experimental analyses. In the first place, we claim that action understanding does not require object knowledge. Objects do not even have to be present to understand an action: Just think of a pantomime! Our central assumption is that actions can be understood just from the changing spatio-temporal relations between the therein-present entities (which are indeed objects-only we do not care to know which ones). The main temporal structure of a manipulation action is given by the touching and un-touching events between those. This, however, is not yet enough. In addition, we observe that, between two such (touching or un-touching) events, there are phases of motion of those entities (e.g., approach, distancing, conjoint motion, etc.). Those phases will finally clarify, which action it was.
Detailed, quantitative knowledge about movement shapes and trajectories is thereby not needed to classify an action. Furthermore, the action does not even have to be completely finished to arrive at safe classification. Thus, this framework is predictive.
Hence, following this, there is indeed a clear sequential structure present in our actions, where motion phases and touching/un-touching events represent the structuring elements. Thus, we posit that this represents the syntax and grammar of actions [1,2]. But what about "objects"? Objects seem far more diverse and unstructured than actions. Here, however, we can fall back onto a rather old idea: Around 1987, Biederman [3] suggested that all objects can be constructed from only a few elementary geometrical entities, which he called Geons. Later, Rivlin et al. [4] took this idea further and described objects by part graphs, where every node in such a graph represents a part (a geon) of an object, and graph edges indicate that two parts are connected to each other. The idea had great appeal, but at that time, there were no means existing to extract parts (Geons) from computer vision data from an object. Furthermore, a nonannotated part graph of this kind cannot very well describe an object, because it very much matters, too, how the parts are connected. Hence, it is required to annotate the graphedges by so-called "relative pose" information, which represents the relative alignment-and attachment geometry of the connected parts. Under these constraints the annotated partconnectivity graph can, however, indeed be considered-yet again-as a syntactical structure, where certain connectivity rules may exist, constituting a "grammar of objects" (see also [5,6]).
In the following, we will present some methods and results supporting the notion that such a grammatical view onto actions and objects is useful from the perspective of trying to make a robot understand the world.

Actions
We use 3D point-cloud data that are first segmented by computer vision methods [7,8] into objects and object parts. Figure 1 shows on top in color code such a segmentation performed on several action frames from a movie of a cutting action. All these parts are then consistently tracked along the movie, but we do not perform any recognition process, except for the "Hand". Thus, objects have just abstract classes: "Main" (M) is the object first touched by the hand. "Primary" (P) the one from which Main first untouches. "Secondary" (S) the one first touched by Main. The big table in Fig. 1 shows on top the temporal sequence of how any two such objects will touch "T" (or untouch "N"), where "U" mean that the relation is still unknown. Touching and untouching events present the main temporal action structure. In addition to this, we define a set of 11 static and six dynamic relations that can exist between two objects. Those could be, static: "right", "left", "in front of", etc.; dynamic: "getting closer", "moving together", etc. The sub-tables below show how these relations change in the course of an action. This tabular action representation is called: Extended Semantic Event Chain (ESEC [9]).
We have performed an analysis of 32 different manipulation actions, as shown in Fig. 2. Those can be grouped into different super-classes (top), like actions for destroying or for rearranging things. The same color code is used in the bottom panel that shows a decision tree. The specific combination of object-object relations that changes from column to column allows recognizing actions after several columns, due to the sequential column structure that they have. At most 7 columns are needed to recognize any action. This tree also shows that cutting (e.g. Fig. 1) can be recognized already at the moment when the fourth column of the ESEC emerges, hence, far before the actual action has ended. In summary, we found that on average, these actions are recognized only after 45.06% (SD = 6%) of their total completion time. Hence, we can predict all manipulation actions before they are half-finished.

Objects
To arrive at object connectivity graphs we use methods described in detail in [10], where we can only give a short summary here leaving all equations out. First, we describe each object-part by a so-called "part-signature" (Fig. 3a). For this, we normalize the maximal distance d between points in the point cloud to one and determine the surface normals at each point. Then, we calculate for all inter-point distances Δd (distances from every point to every other point) the sum of the angular differences Δα of all surface normals. As a result, we observe the every part gets its own characteristic part-signature (Fig. 3a). Next, we determine how parts are attached and aligned with respect to each other. For this, we represent each part with its variance blob V. The variance blob represents the variance of the point distribution of a part in all possible directions normalized to one. Visualization of these variance blobs can be achieved using spherical plots, as shown in Fig. 3b. In these plots, the variance is denoted by the radius of the surface from the origin. In case of constant variance, this results in a constant radius in all directions, thus in a perfect sphere. A variance blob represents the variance in the Euclidean space of the point cloud. This is why the aspect ratio (the elongation) of a part is directly visible in the elongation of its variance blob. Exact shape detail (like round, cylindrical, etc.) is ignored. Remember, we had this represented by the part signature (see Fig. 3a).
To calculate the alignment between parts A and B (AL AB ), we use the histogram intersection similarity V AB between V A and V B (see [10] for details). This is done by calculating the intersection of each bin of variance blob A with each corresponding bin in variance blob B. Finally, we calculate the alignment number AL AB as the sum of the bins of the intersection blob. Note: AL AB = AL BA . In addition to this, the attachment number AT AB reflects at which location of part A part B is connected. For this, we calculate the vector connecting centroid A to centroid B: r AB . Then, we retrieve the value on the variance blob V A in direction of r AB . Since V is normalized, the values of single bins scale reciprocally with the total number of bins. This is why we multiply the result by the total number of bins to get AT AB . Note: AT AB ≠ AT BA and we always calculate both. Figure 3c shows how AL remains essentially constant, where AT changes much for three objects, which consist of the same two parts A and B. Figure 4 shows the final object connectivity graph for one example (artificial) object.
After arriving at the object connectivity graphs, we can now learn possible object-to-function associations. During training, we provide a conventional small neural network with example graphs and manually associate to those the potential function of the belonging object. We have focused here on simple objects (tools!) that relate to the shapes of our human hands like "borer", "ladle", "blade", "hammer", etc. [10]. The network will in the test phase produce probability values (in percent) for a certain function. Figure 5 shows several objects and their corresponding highest probability function. Interestingly the network sometimes produces high values for several function like for the hammer in the bottom right panel, which the network associates to the "hammer"but also to a "poker"-function.
Humans often do the very same and "abuse" and object, like in the case of this hammer where you might consider grasping it at the hammer-head to use the stick (hammer handle) as a poker in your fireplace. The use of skulls for

Conclusions
The current study showed that it is possible to capture the semantics of human actions and the involved objects by levering from a grammatical perspective. This may also help to improve robots in their interactions with the world and with us.  Pour from a container onto the ground when the liquid first un-touches the container then touches the ground; 30: Pour from a container on the ground when the liquid can touch the container and the ground at the same time; 31: Pour from a container to another container when the liquid first un-touches the container then touches another container; 32: Pour from a container to another container when the liquid can touch the container and another container at the same time. Colors correspond to action categories as shown above Fig. 3 Arriving at an object graph. a Part signatures histograms (color represents counts) for three example parts (top). Δd is the normalized inner distance between points, Δα is the value of the angular difference between surface normals. b Alignment AL between parts A and B for the four hammer-like objects shown on the left side. Colored "balls" represent the Variance Blobs (see text) of the parts and their histogram intersection from which AL is calculated. c Alignment and attachment numbers for three example objects