Linking human motions and objects to language for synthesizing action sentences

This paper proposes a novel framework for generating action descriptions from human whole body motions and objects to be manipulated. This generation is based on three modules: the first module categorizes human motions and objects; the second module associates the motion and object categories with words; and the third module extracts a sentence structure as word sequences. Human motions and objects to be manipulated are classified into categories in the first module, then words highly relevant to the motion and object categories are generated from the second module, and finally the words are converted into sentences in the form of word sequences by the third module. The motions and objects along with the relations among the motions, objects, and words are parametrized stochastically by the first and second modules. The sentence structures are parametrized from a dataset of word sequences in a dynamical system by the third module. The link of the stochastic representation of the motions, objects, and words with the dynamical representation of the sentences allows for synthesizing sentences descriptive to human actions. We tested our proposed method on synthesizing action descriptions for a human action dataset captured by an RGB-D sensor, and demonstrated its validity.


Introduction
The demographic trend in advanced countries is that the percentage of elderly people is increasing, even as the total population is shrinking. The availability of comprehensive nursing services to provide living support for elderly people to improve per-capita productivity by covering for labor force shortages are important problems. The use of humanoid robots is expected to address these problems. Because humanoid robots are similar in shape to humans, they can perform human-like actions, and there is no need to adapt the environment from what suits humans. This allows humanoid robots to be used as a replacement labor force for humans for everyday tasks.
Research and development into humanoid robots has been actively pursued in a variety of fields in recent years (Sugano and Kato 1987;Kuroki et al. 2003;Kaneko et al. 2002;Cheng et al. 2007). Although this research has focused on increasing the integration density and accuracy of hardware technology, other elements are essential to constructing intelligent humanoid robots: software for obtaining external information corresponding to the five human senses (sight, sound, touch, taste, and smell), perceiving by using the obtained information, and controlling the motion of the robot. Developments in these areas are expected to bring value for replacement labor force, communication with humans, and information processing systems that exceed the capabilities of humans. Development of not only the hardware but also the software (here, considered as the intelligence of the robots) is an important element that is expected to create new value in robots.
Research into artificial intelligence that is able to perform advanced information processing like a human involves a variety of academic topics beyond robot engineering, including also linguistics, semiotics, anthropology, psychology, neuroscience, brain science, and sociology. The key difference between human intelligence and that of other animals is said to be the ability to use language that incorporates advanced symbols. Humans have acquired language through evolutionary processes. For example, in the phrase gread a book,h the concept of a book is expressed by the word gbook.h, and the concept of a particular bodily action is expressed by the word gread.h Phenomena in the real world can be expressed in this way through language. Human intelligence is built on a symbolic system. Because of this, we are able to think specifically about actions, understand abstract concepts, and share the ideas of other people by using symbols. The information processing of symbols and language forms the foundation of the advanced intelligence expressed in the computations of humans.
Ferdinand de Saussure described the composition of language as using "signe" (sign), "signifiant" (signifier), and "signifie" (signified) (Saussure 1966). The signifier is the symbolic representation of the specified item, the signified is the content of the sign that represents the specified concept, and the relation between the signifier and signified gives rise to signs. In the example above, the word "book" is the signifier, and this word signifies the book that exists in the real world. The book itself is the signified. The ability to manipulate signs and real world phenomena is improved and language is developed by arbitrarily forming associations between signs and the real world.
In brain science, Rizzolatti et al. (2001) discovered the existence of a set of neurons (mirror neurons) in the brains of Macaque monkeys that fire when the behavior of another is observed and when movement is performed by oneself. Mirror neurons are related to the generation and recognition of motion, but have also been found in the Brocafs area, which is responsible for the language processing in humans. This implies a relation among the mirror neuron system, the generation and recognition of motions, and the language.
Based on this knowledge, imitation learning models have been proposed in which the robot learns new actions by imitating the actions of humans (Kuniyoshi et al. 1994;Morimoto and Doya 2001;Mataric 2000). Research into constructing intelligence based on encoding bodily motions into symbols has been conducted. Haruno et al. (2001) proposed the module selection and identification for control (MOSAIC) system, which performs environment recognition and action generation in the framework of reinforcementbased learning of multiple learning modules that store different action primitives. Tani and Ito (2003) proposed the recurrent neural network with parametric bias (RNNPB) method, in which multiple action primitives are encoded into bias parameters to be added to a recurrent neural network in which the parameters switch the action primitives. Inamura et al. (2004) proposed a model of encoding motion patterns into hidden Markov models (HMMs). Furthermore, Takano and Nakamura (2015a, b) proposed a model that combines motion symbols characterized by HMMs with natural language, and developed a computation method for creating sentences that represent motions.
However, these motion recognition systems use only bodily motion information such as the three-dimensional position of each part of the body or the time-series data of joint angles, and it is anticipated that these systems will be extended to handle environment (a) for understanding actions in which meaning is imparted to human motion by interactions with the environment, and (b) for generating actions such as manipulation of objects in the environment. For example, in the case of the action "moving hand towards mouth," there is the problem of not being able to understand whether something is held in the hand and, if so, what that object is. This is because object type and position information are not used. Information about object manipulations associated with human actions is important for associating meaning with actions, and the intelligence to understand this is mandatory for humanoid robots that operate in living environments.
In the field of computer vision, the importance of using information from human motion and objects has been noted for understanding the actions that accompany objects (manipulation targets) in everyday life, such as human-object interactions (HOI) (Gupta et al. 2009;Yao and Fei-Fei 2012). For this, it has been noted that the pose of the human is important when detecting and recognizing objects in images. At the same time, manipulated object information is important when recognizing human motions, and the performance of both is expected to be improved by using both object and human-pose information. Information about body motion and information about target object motion are important elements in understanding human actions, so it is expected that performance can be improved by using both types of information. Recently, there have been extensive works on linking natural language description to static images or videos Kojima et al. 2002). Krishnamoorthy et al (2013). presented an approach to recognizing visual objects and activities, and generating triples of subject, verb and object in two probabilistic vision and natural language scores. Kulkarni et al. (2011) also proposed a probabilistic approach to generating image descriptions . Objects in an image are detected and their regions are processed for the positional relationship. A conditional random field predicts labels for the image description by incorporating the image potentials and natural language potential. Deep neural networks has demonstrated the rich representation for the image classification, and the neural networks has been actively applied to the image encoder and the description decoder (Vinyals et al. 2015;Karpathy and Fei-Fei 2015). These methods focus on generating sentences describing the images. They don't handle the positions of the performer and manipulated object in the three dimensional environment, and therefore cannot be directly reused to generate the activities from the description. Action recognition that includes motion recognition, object recognition, and generation of sentences that represent the action can be approached by extending the statistical information processing system proposed by Takano and Nakamura (2015b), which uses three dimensional body motion and language, and developing an information processing mechanism that includes a function for flexible handling of manipulation target (i.e., object information).
This paper proposes a link of human whole body motions, manipulation target objects and language for synthesizing sentence describing human actions. Human motions and the spatial relations between the object and body parts are classified into motion categories, and objects to be manipulated are classified into object categories by motion recognition and manipulation target object recognition, respectively. Including the spatial relation between an object and body parts as part of the action information allows identification of motions that could not be identified from body motion information alone. This ability extends beyond that of conventional methods. The spatial relation is defined by the distances and relative positions of nodes, in a manner similar to an interaction mesh (Ho et al. 2010;Ho and Shum 2013). The mesh is used in motion synthesis with adaption for objects and obstacles in the environment and to detect objects in the environment. Object segmentation is derived from color and depth information in point cloud data, and object identification is subsequently performed using the extracted image information. We additionally construct a statistical model (the motion object language model) that learns the relations that connect motions, objects, and a sentence. The sentence represents the action by a recurrent neural network (natural language model) that learns the order of words in the sentence as a dynamical system. Words to reflect the body motion and object information are identified by applying motion recognition, object recognition, and the motion object language model, after which sentences are generated by reordering the words according to the natural language model. This is done with the aim of more correctly understanding human actions by using multimodal information comprising body motion information, such as three-dimensional position information of each body part and time-series data of joint angles, and the positions and types of objects in the environment with descriptive sentences representing the action.

Motion and object primitives
A human action consists of a human whole body motion and an object to be manipulated. The classifications of the human whole body motion and the object are required to generate sentences describing the human action. This section describes the representations of the human motion and the object, and their classifiers.

Human whole body primitives
Action recognition is a highly competitive field, and many approaches that handle human body motions and objects to be manipulated have been reported. Wang et al. (2012) used a feature called "local occupancy pattern" in which elements represent the area occupied by an object around each joint, and they defined the feature for action recognition by combining the local occupancy pattern with the positional relations between two joints over the set of all joint pairs. The temporal sequences of these features were converted to a Fourier temporal pyramid and classified into the relevant action category by using a support vector machine (SVM). The method proposed by Yu et al. (2014) is similar to that of Wang et al., but they used the distances between two joints in the whole body as features. SVM-based methods adopt a discriminative approach to classification that generally outperforms the generative approach typified by hidden Markov models (HMMs), but they cannot recover the human motions, such as a sequence of joint positions. In this paper, we use HMMs to encode the human motions to create motion primitives because HMMs can be used for action recognition and generation of human-like motion for robots. Feature x of human motion ( Fig. 1) consists of two elements: position p i of the ith joint in the trunk coordinate system, and distance d i between the ith joint and the manipulated object. p i and d i are concatenated into x over all joints. The human motion is expressed by a sequence of these fea- Human motions are encoded into a set of parameters to characterize HMMs. HMMs are generative models optimized such that the likelihood of the human motion x being  Fig. 1 The motion feature is expressed by the vector x T whose elements are joint positions in the trunk coordinate system or distances between the joints and the manipulated object

Fig. 2
Point cloud data is segmented into object regions, local features are computed in each object region, the global feature is extracted from the computed local features, and the global feature is then classified into an object primitive generated from HMM λ is maximized. The parameters to be optimized are a vector whose entries π i are (for each i) the probability of starting at the ith node, the matrix A whose entries a i j are the probabilities of transition from the ith node to the jth node, and the output distribution B(x) whose entries b i (x) are the probabilities of x being generated from the ith node. The Baum-Welch algorithm can optimize these parameters (Rabiner 1989). Moreover, HMMs can be used to classify human motions x into the specific HMM λ R that is the most likely to generate x.

Object primitives
An object to be manipulated is captured by an RGB-D camera. The image and depth data are segmented into an object region, scale-invariant feature transform (SIFT) descriptors are computed for the local feature in the object region, and the Fisher vector descriptor is extracted from the computed local features as the global feature, which is classified into an object primitive. Figure 2 shows the pipeline to convert captured RGB-D data into the corresponding object primitive. The method of region growing and region merging partitions RGB-D data into object regions (Zhan et al. 2009). The region growing process randomly selects an ungrouped point, and then groups it with all ungrouped points closer than a manually given threshold. This process is iterated until all points are grouped into one of the regions. The region merging process finds two regions that are close to each other and aggregates them into one region. The merging process results in the object region.
The segmented object region is processed to extra the local features of the objects contained in the region. SIFT descriptors are adopted for the local features because they are colored and scale-invariant (Lowe 1999;Abdel-Kalim et al. 2006). The SIFT descriptors represent only local patches in the segmented region. The Fisher vector is introduced as a global feature to represent the entire region. The derivation of the Fisher vector is described in the "Appendix". The Fisher vector is classified into its relevant object primitive by using the SVM technique (Cortes and Vapnik 1995). These processes together convert the captured point cloud data into object primitives.

Fig. 3
Human whole body motion and RGB-D images are classified into the motion primitive and object primitive. These primitives are connected to their relevant words stochastically. The probabilities of the latent node being generated from the motion and object primitives and the probabilities of the word from the latent node are optimized such that the words related to the action are the most likely to be generated from the motion and object to be manipulated

Connection between human actions and description
Our framework to generate the sentences from human actions consists of two modules. The first module combines the human whole body motions and the manipulated objects with their relevant words. The second module models sequences of the words in the sentences. This section describes these modules in details.

Stochastic model of words from motions and objects
The motion primitives and object primitives, which are derived by classifying human whole body motions and images containing an object to be manipulated, are connected to their relevant words stochastically. Figure 3 shows the stochastic model for the connections. This stochastic model is made of three layers: the top layer contains the primitives, the bottom layer contains the words, and the middle layer contains the latent nodes. The latent nodes connect the motions and the objects to the words. The connectivities are characterized by the conditional probability P(s|λ, κ) of latent node s being generated from motion primitive λ and object primitive κ and the probability P(ω|s) of word ω being gen- is the jth word in the word sequence (sentence) that is manually attached to the ith action whose whole body motion is classified into the motion primitive λ (i) , and the manipulated object is classified into the object primitive κ (i) . The E-step estimates the distribution for latent node s conditioned on motion primitive λ, object primitive κ, and word ω as The M-step updates probabilities P(s|λ, κ) and P(ω|s) to where n(λ, κ, ω) is a function that counts the number of words ω attached to the actions for which the whole body motions are classified into motion primitive λ and the objects to be manipulated are classified into object primitive κ.
Alternating the E-step and M-step results in the optimal probabilities for P(s|λ, κ) and P(ω|s). The deviation of the EM algorithm is described in the "Appendix".

Recurrent neural network for action descriptions
Neural networks have been widely used for modeling sentences (Bengio et al. 2006), and have been extended to recurrent neural networks to handle the dynamics of word sequences in sentences (Mikolov et al. 2010). Recurrent neural networks predict words that follow the input words via latent layers that can handle context in sentences. Figure 4 shows a recurrent neural network that consists of input, latent, and output layers. The input and output layers comprise word nodes. The number of nodes in the input and output layers is the same for each layer as the number of words that can appear in the sentences. The input layer is connected to the output layer through latent nodes, which represent the current state and retain the previous state. Specifically, the input vector is x t ∈ R N ω and the output vector is y t ∈ R N ω , where N ω is the number of distinct words. The activities of the latent node are expressed by z t ∈ R N z for current activities and z t−1 ∈ R N z for past activities, where N z is the number of nodes in the latent layer. If the kth word is given for the input, x t is set to the binary vector where z i andz i are the ith elements in z t andz t , respectively, U ∈ R N z ×N ω and W ∈ R N z ×N z are weight matrices, and f (z) is a sigmoid function. y t is computed from z t in a similar manner, where y i andỹ i are the ith elements in y t andỹ t , respectively, V ∈ R N ω ×N z is a weight matrix, and g i (ỹ t ) is the following function. In this function, y i represents the probability of the ith word being generated from the input word sequence.
The weight matrices, U, V , and W , are trained by back propagation through time; this method incrementally updates the weight parameters to reduce the errors between the output vectors y t and the correct vectors d t . Weight matrix V is tuned as The errorsẽ are propagated from the output layer to the latent layer.
e it is the ith element ofẽ t , and s i t is the ith element of s t . The weight matrices U and W are updated by using the error e.
α, β, and γ are learning rates; these have been set to decrease monotonically, following Bergstra and Bengio (2012).

Generation of action descriptions from motions and objects
Integrating the two modules described above allows the generation of sentences describing human actions. Figure 5 shows an overview of this integration. An observation containing a human motion and an object is classified into a motion primitive and an object primitive. The words relevant to the motion and object are associated by the stochastic model, and these words are arranged into a sentence according to the recurrent neural network. Specifically, given motion primitive λ R and object primitive κ R , the pair of primitives is converted to a sentence that is the most likely to be generated from these primitives. The sentence can be formed by searching for a sequence of words according to the probability of the sentence being generated from two modules given the motion primitive and the object primitive as Here, sentence ω is expressed by word sequence ω 1 , ω 2 , . . . , ω l . We assume that a set of words contained in the sentence depends on only the motion primitive and the object primitive, and that a sequence of words depends on only the set of words. The probability of each word being generated from the motion primitive and object primitive can be computed by using Eqs. 3 and 4.
The probability of word sequence ω 1 , ω 2 , . . . , ω i being followed by word ω i+1 can be computed by the recurrent neural network. Taking the logarithm of P(ω|λ R , κ R ) and using Dijkstra's algorithm, we can search for the sentence that is the most likely to be generated from the motion and object.

Experiments
Our proposed approach is tested to see how well it generates descriptions from observations of human actions. The  (Kadaba et al. 1990). The positions of the 34 markers in the character's trunk coordinate system and the distances between these markers and an object to be manipulated together express the human whole body motion to be used for motion primitives. An image from the RGB-D sensor is segmented into an object region. The local features are extracted from the object region, and then global features are computed for the object primitives. We measure actions by three performers, and 320 observations are collected for each of these three performers, giving 960 observations in total. Motion and object data contained in these datasets can be grouped into 24 motion primitives and 30 object primitives. Additionally, five students attached one sentence descriptive of each action. Figure 7 shows several samples of the action and the manually attached descriptions. The dataset contains 960 sentences, with 335 different words. The dataset is grouped into a training dataset containing 576 actions and a test dataset containing 384 actions. Figure 8 qualitatively shows the experimental results. Three sentences that are most likely to be generated from each observation are displayed. The observation of "blowing the nose" is described by three sentences: "a person blows their nose with a tissue", "a person blows their nose with tissue paper", and "a person is blowing their nose". The observation of "sweeping" is expressed by sentences "a person is sweeping the floor with a broom", "they are cleaning the floor" and "they clean the floor with a broom". The observation of "picking" can be described as sentences: "a person picks up the box on the bottle" "they pick up the box on the bottle" and "a person picks up a box". The first sentence is same as the training sentence as shown in Fig. 7. The observation of "drinking" generates the sentences "a person drinks a bottle of tea", "a person is drinking a bottle of tea" and "a person drinks out of a bottle"; these are similar to sentences attached to the action of "drinking", as shown in Fig. 7. Other observations are also described by qualitatively correct sentences.
We also quantitatively test the sentence generation. In the first test, up to five sentences are generated from each test observation. When the generated sentence is the same as the sentence attached to the test observation, this sentence is counted as correct. This is the 5-best condition; more generally, for the m-best condition, the number of generated sentences is set to m, and if any of the generated sentences is correct, the generation is counted as correct. The correct ratios are 0.71, 0.84, 0.89, 0.91, and 0.92 for the 1-best, 2best, 3-best, 4-best, and 5-best conditions, respectively.
It is important to evaluate the improvement by adding the object to be manipulated for the sentence generation. We removes the layer of the object primitives from the module as shown in Fig. 3. More specifically, we tested the sentence generation only from the human motion. The correct ratios of the sentence generation are 0.54, 0.54, 0.55, 0.59, and 0.59 for the 1-best, 2-best, 3-best, 4-best, and 5-best conditions, respectively. The comparison of these correct ratios with those derived in our proposed framework demonstrates that the information of the objects is effectively used to generate sentences from the action observations. Additionally, 9944 sentences with 1174 different words attached to the human action are crowdsourced. After training 576 human actions and 5984 sentences attached to these actions, we tested the sentence generation. Multiple sentences are attached to each human action. When the generated sentence is the same as one of the sentences attached to the test observation, this Motion primitive : drink Object primitive : mug Description: a person drinks from a cup. he drinks a cup of water all in one go. a man drinks a cup of tea. a man is drinking a mug of coffee. a person is drinking a mug of coffee.  Fig. 7 The datasets contain human whole body motions, objects and sentences describing human actions. The actions in the left panel consist of the motion primitive "type", the motion primitive "laptop" and descriptions "a person types on a laptop"  Fig. 8 Observation is classified into its relevant motion primitive and object primitive, and a pair of these primitives is converted to sentences describing the observation. The three mostly likely sentences from each observation are displayed. For example, the observation "blowing the nose" is described by the sentences "a person blows their nose with a tissue", "a person blows their nose with tissue paper", "a person is blowing their nose" generation is counted as correct. The correct ratios of the sentence generation are 0.67, 0.69, 0.78, 0.84, and 0.90 for the 1-best, 2-best, 3-best, 4-best, and 5-best conditions, respectively.

Conclusions
This research is summarized as follows.
1. We proposed a framework for linking human actions (consisting of human whole body motions and objects to be manipulated) with sentences describing the actions. For this, the human whole body motion and positional relation between the body and the object are encoded into a motion primitive; also, an object feature is extracted from an object region in a captured image and is then encoded into an object primitive. A pair of motion primitive and object primitive is stochastically connected to words relevant to the action. Additionally, the dynamics of word sequences in sentences descriptive of the actions is trained by a recurrent neural network, which can predict that word that is likely to follow a sequence of words. 2. We linked two modules: a stochastic model between motions, objects, and words; and a recurrent neural network for the sentence structure. The link makes it possible to search for the sentences that are most likely to be generated from the observation of human action. Specifically, the recurrent neural network efficiently generates the sentence whose words are most likely to be generated from a given motion primitive and object primitive in the stochastic model. This link implies that the observation of the human action can be interpreted by considering corresponding descriptive sentences. 3. We constructed a stochastic model to describe the relations between motions, objects, and words, and a recurrent neural network to describe the sentence structures. Each was trained against a dataset containing 576 pieces of data for triples comprising a human motion, an object, and a (manually chosen) sentence. We conducted an experiment with the stochastic model and the neural network, generating sentences for 384 test observations. We qualitatively and quantitatively confirmed that our proposed method can generate correct sentences from observations of human actions.
where the elements of the gradient vector are Note that k is assumed to be a diagonal matrix, and that γ k (u) is written as Here, F is the Fisher information matrix, and is defined as where E[ * ] is the expectation value.

Appendix B
Human actions contain data about human whole body motions and objects. The whole body motion, including the relative position of the object to be manipulated, is encoded into motion primitive λ, and the image containing the object is encoded into object primitive κ. The sentence describing the human action is manually assigned to the action. The sentence is expressed by a sequence of words, ω. Let the training dataset be λ (i) , κ (i) , ω (i) 1 , . . . , ω (i) n i , consisting of motion primitives, object primitives, and sentences. Then, the logarithm of the probability of the words ω (i) 1 , . . . , ω (i) n i in the sentence being generated from motion primitive λ (i) and object primitive κ (i) over the training dataset is written as where we assume that the word depends on only the motion and the object. According to the marginal distribution of the latent node, Eq. 29 is rewritten as This equation can be written as the expectation The lower limit of this expectation, L , is given by the Jensen inequality.
From and L , the following equations are derived.
Here, P(s|λ, κ, ω) is the estimated distribution of latent node s based on the model, andP(s|λ, κ, ω) is the true distribution of s. Equation 34 implies the Kullback-Leibler information KL P (s|λ, κ, ω)||P (s|λ, κ, ω) between these two distributions. The Kullback-Leibler information is nonnegative, and is zero only when these two distributions are the same. Therefore, estimating the true distribution,P(s|λ, κ, ω), as the model-based distribution, P(s|λ, κ, ω), yields zero for the Kullback-Leibler information. The E-step estimates the distributionP(s|λ, κ, ω) of the latent node such that the Kullback-Leibler information becomes zero.
After the E-step, the model parameters are iteratively optimized such that increases incrementally. Let [t+1] and [t] be the objective functions given at the (t + 1)th and tth   (i) , κ (i) , ω (i) j ) are the estimated distributions of the latent node based on the models derived at the (t + 1)th and tth iteration steps, again respectively. Estimation of the true distributionP(s|λ, κ, ω) of the latent node as the distribution P [t] (s|λ, κ, ω) based on the model derived at the tth iteration step leads to the relation L at the (t + 1)th iteration step results in [t+1] becoming larger than [t] . By assigning the distribution P [t] (s|λ, κ, ω) toP(s|λ, κ, ω) (i) , ω (i) j ln P ω (i)|s k j + ln P s k |λ (i) , κ (i) .
The Lagrange function is obtained as The derivative of the Lagrange function with respect to P(ω i |s) or P(s|λ, κ) is zero at the optimal parameter, which is derived as The M-step searches for the optimal parameters in this manner, and the EM algorithm alternates the E-step and the M-step.