Linking human motions and objects to language for synthesizing action sentences

Takano, Wataru; Yamada, Yoshihiko; Nakamura, Yoshihiko

doi:10.1007/s10514-018-9762-1

Linking human motions and objects to language for synthesizing action sentences

Open access
Published: 02 May 2018

Volume 43, pages 913–925, (2019)
Cite this article

Download PDF

You have full access to this open access article

Autonomous Robots Aims and scope Submit manuscript

Linking human motions and objects to language for synthesizing action sentences

Download PDF

Wataru Takano¹,
Yoshihiko Yamada¹ &
Yoshihiko Nakamura¹

2959 Accesses
10 Citations
Explore all metrics

Abstract

This paper proposes a novel framework for generating action descriptions from human whole body motions and objects to be manipulated. This generation is based on three modules: the first module categorizes human motions and objects; the second module associates the motion and object categories with words; and the third module extracts a sentence structure as word sequences. Human motions and objects to be manipulated are classified into categories in the first module, then words highly relevant to the motion and object categories are generated from the second module, and finally the words are converted into sentences in the form of word sequences by the third module. The motions and objects along with the relations among the motions, objects, and words are parametrized stochastically by the first and second modules. The sentence structures are parametrized from a dataset of word sequences in a dynamical system by the third module. The link of the stochastic representation of the motions, objects, and words with the dynamical representation of the sentences allows for synthesizing sentences descriptive to human actions. We tested our proposed method on synthesizing action descriptions for a human action dataset captured by an RGB-D sensor, and demonstrated its validity.

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Motion2language, unsupervised learning of synchronized semantic motion segmentation

Article 13 December 2023

Classification of Multi-class Daily Human Motion using Discriminative Body Parts and Sentence Descriptions

Article Open access 10 November 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The demographic trend in advanced countries is that the percentage of elderly people is increasing, even as the total population is shrinking. The availability of comprehensive nursing services to provide living support for elderly people to improve per-capita productivity by covering for labor force shortages are important problems. The use of humanoid robots is expected to address these problems. Because humanoid robots are similar in shape to humans, they can perform human-like actions, and there is no need to adapt the environment from what suits humans. This allows humanoid robots to be used as a replacement labor force for humans for everyday tasks.

Research and development into humanoid robots has been actively pursued in a variety of fields in recent years (Sugano and Kato 1987; Kuroki et al. 2003; Kaneko et al. 2002; Cheng et al. 2007). Although this research has focused on increasing the integration density and accuracy of hardware technology, other elements are essential to constructing intelligent humanoid robots: software for obtaining external information corresponding to the five human senses (sight, sound, touch, taste, and smell), perceiving by using the obtained information, and controlling the motion of the robot. Developments in these areas are expected to bring value for replacement labor force, communication with humans, and information processing systems that exceed the capabilities of humans. Development of not only the hardware but also the software (here, considered as the intelligence of the robots) is an important element that is expected to create new value in robots.

Research into artificial intelligence that is able to perform advanced information processing like a human involves a variety of academic topics beyond robot engineering, including also linguistics, semiotics, anthropology, psychology, neuroscience, brain science, and sociology. The key difference between human intelligence and that of other animals is said to be the ability to use language that incorporates advanced symbols. Humans have acquired language through evolutionary processes. For example, in the phrase gread a book,h the concept of a book is expressed by the word gbook.h, and the concept of a particular bodily action is expressed by the word gread.h Phenomena in the real world can be expressed in this way through language. Human intelligence is built on a symbolic system. Because of this, we are able to think specifically about actions, understand abstract concepts, and share the ideas of other people by using symbols. The information processing of symbols and language forms the foundation of the advanced intelligence expressed in the computations of humans.

Ferdinand de Saussure described the composition of language as using “signe" (sign), “signifiant" (signifier), and “signifie" (signified) (Saussure 1966). The signifier is the symbolic representation of the specified item, the signified is the content of the sign that represents the specified concept, and the relation between the signifier and signified gives rise to signs. In the example above, the word “book" is the signifier, and this word signifies the book that exists in the real world. The book itself is the signified. The ability to manipulate signs and real world phenomena is improved and language is developed by arbitrarily forming associations between signs and the real world.

In brain science, Rizzolatti et al. (2001) discovered the existence of a set of neurons (mirror neurons) in the brains of Macaque monkeys that fire when the behavior of another is observed and when movement is performed by oneself. Mirror neurons are related to the generation and recognition of motion, but have also been found in the Brocafs area, which is responsible for the language processing in humans. This implies a relation among the mirror neuron system, the generation and recognition of motions, and the language.

Based on this knowledge, imitation learning models have been proposed in which the robot learns new actions by imitating the actions of humans (Kuniyoshi et al. 1994; Morimoto and Doya 2001; Mataric 2000). Research into constructing intelligence based on encoding bodily motions into symbols has been conducted. Haruno et al. (2001) proposed the module selection and identification for control (MOSAIC) system, which performs environment recognition and action generation in the framework of reinforcement-based learning of multiple learning modules that store different action primitives. Tani and Ito (2003) proposed the recurrent neural network with parametric bias (RNNPB) method, in which multiple action primitives are encoded into bias parameters to be added to a recurrent neural network in which the parameters switch the action primitives. Inamura et al. (2004) proposed a model of encoding motion patterns into hidden Markov models (HMMs). Furthermore, Takano and Nakamura (2015a, b) proposed a model that combines motion symbols characterized by HMMs with natural language, and developed a computation method for creating sentences that represent motions.

However, these motion recognition systems use only bodily motion information such as the three-dimensional position of each part of the body or the time-series data of joint angles, and it is anticipated that these systems will be extended to handle environment (a) for understanding actions in which meaning is imparted to human motion by interactions with the environment, and (b) for generating actions such as manipulation of objects in the environment. For example, in the case of the action “moving hand towards mouth," there is the problem of not being able to understand whether something is held in the hand and, if so, what that object is. This is because object type and position information are not used. Information about object manipulations associated with human actions is important for associating meaning with actions, and the intelligence to understand this is mandatory for humanoid robots that operate in living environments.

In the field of computer vision, the importance of using information from human motion and objects has been noted for understanding the actions that accompany objects (manipulation targets) in everyday life, such as human–object interactions (HOI) (Gupta et al. 2009; Yao and Fei-Fei 2012). For this, it has been noted that the pose of the human is important when detecting and recognizing objects in images. At the same time, manipulated object information is important when recognizing human motions, and the performance of both is expected to be improved by using both object and human-pose information. Information about body motion and information about target object motion are important elements in understanding human actions, so it is expected that performance can be improved by using both types of information. Recently, there have been extensive works on linking natural language description to static images or videos (Li 2011; Kojima et al. 2002). Krishnamoorthy et al (2013). presented an approach to recognizing visual objects and activities, and generating triples of subject, verb and object in two probabilistic vision and natural language scores. Kulkarni et al. (2011) also proposed a probabilistic approach to generating image descriptions (Kulkarni et al. 2011). Objects in an image are detected and their regions are processed for the positional relationship. A conditional random field predicts labels for the image description by incorporating the image potentials and natural language potential. Deep neural networks has demonstrated the rich representation for the image classification, and the neural networks has been actively applied to the image encoder and the description decoder (Vinyals et al. 2015; Karpathy and Fei-Fei 2015). These methods focus on generating sentences describing the images. They don’t handle the positions of the performer and manipulated object in the three dimensional environment, and therefore cannot be directly reused to generate the activities from the description. Action recognition that includes motion recognition, object recognition, and generation of sentences that represent the action can be approached by extending the statistical information processing system proposed by Takano and Nakamura (2015b), which uses three dimensional body motion and language, and developing an information processing mechanism that includes a function for flexible handling of manipulation target (i.e., object information).

This paper proposes a link of human whole body motions, manipulation target objects and language for synthesizing sentence describing human actions. Human motions and the spatial relations between the object and body parts are classified into motion categories, and objects to be manipulated are classified into object categories by motion recognition and manipulation target object recognition, respectively. Including the spatial relation between an object and body parts as part of the action information allows identification of motions that could not be identified from body motion information alone. This ability extends beyond that of conventional methods. The spatial relation is defined by the distances and relative positions of nodes, in a manner similar to an interaction mesh (Ho et al. 2010; Ho and Shum 2013). The mesh is used in motion synthesis with adaption for objects and obstacles in the environment and to detect objects in the environment. Object segmentation is derived from color and depth information in point cloud data, and object identification is subsequently performed using the extracted image information. We additionally construct a statistical model (the motion object language model) that learns the relations that connect motions, objects, and a sentence. The sentence represents the action by a recurrent neural network (natural language model) that learns the order of words in the sentence as a dynamical system. Words to reflect the body motion and object information are identified by applying motion recognition, object recognition, and the motion object language model, after which sentences are generated by reordering the words according to the natural language model. This is done with the aim of more correctly understanding human actions by using multimodal information comprising body motion information, such as three-dimensional position information of each body part and time-series data of joint angles, and the positions and types of objects in the environment with descriptive sentences representing the action.

2 Motion and object primitives

A human action consists of a human whole body motion and an object to be manipulated. The classifications of the human whole body motion and the object are required to generate sentences describing the human action. This section describes the representations of the human motion and the object, and their classifiers.

2.1 Human whole body primitives

Action recognition is a highly competitive field, and many approaches that handle human body motions and objects to be manipulated have been reported. Wang et al. (2012) used a feature called “local occupancy pattern" in which elements represent the area occupied by an object around each joint, and they defined the feature for action recognition by combining the local occupancy pattern with the positional relations between two joints over the set of all joint pairs. The temporal sequences of these features were converted to a Fourier temporal pyramid and classified into the relevant action category by using a support vector machine (SVM). The method proposed by Yu et al. (2014) is similar to that of Wang et al., but they used the distances between two joints in the whole body as features. SVM-based methods adopt a discriminative approach to classification that generally outperforms the generative approach typified by hidden Markov models (HMMs), but they cannot recover the human motions, such as a sequence of joint positions. In this paper, we use HMMs to encode the human motions to create motion primitives because HMMs can be used for action recognition and generation of human-like motion for robots.

Feature ${\varvec{x}}$ of human motion (Fig. 1) consists of two elements: position ${\varvec{p}}_i$ of the ith joint in the trunk coordinate system, and distance $d_i$ between the ith joint and the manipulated object. ${\varvec{p}}_i$ and $d_i$ are concatenated into ${\varvec{x}}$ over all joints. The human motion is expressed by a sequence of these features, ${\varvec{x}}= \left( {\varvec{x}}_1,{\varvec{x}}_2,\ldots , {\varvec{x}}_T \right) $.

Human motions are encoded into a set of parameters to characterize HMMs. HMMs are generative models optimized such that the likelihood of the human motion ${\varvec{x}}$ being generated from HMM $\lambda $ is maximized. The parameters to be optimized are a vector ${{\varvec{\Pi } }}$ whose entries $\pi _i$ are (for each i) the probability of starting at the ith node, the matrix ${{\varvec{A}}}$ whose entries $a_{ij}$ are the probabilities of transition from the ith node to the jth node, and the output distribution ${\varvec{B}}({\varvec{x}})$ whose entries $b_i({\varvec{x}})$ are the probabilities of ${\varvec{x}}$ being generated from the ith node. The Baum–Welch algorithm can optimize these parameters (Rabiner 1989). Moreover, HMMs can be used to classify human motions ${\varvec{x}}$ into the specific HMM $\lambda _\mathcal{R}$ that is the most likely to generate ${\varvec{x}}$.

$$\begin{aligned} \lambda _\mathcal{R} = \arg \max _{\lambda } P({\varvec{x}} | \lambda ) \end{aligned}$$

(1)

2.2 Object primitives

An object to be manipulated is captured by an RGB-D camera. The image and depth data are segmented into an object region, scale-invariant feature transform (SIFT) descriptors are computed for the local feature in the object region, and the Fisher vector descriptor is extracted from the computed local features as the global feature, which is classified into an object primitive. Figure 2 shows the pipeline to convert captured RGB-D data into the corresponding object primitive.

The method of region growing and region merging partitions RGB-D data into object regions (Zhan et al. 2009). The region growing process randomly selects an ungrouped point, and then groups it with all ungrouped points closer than a manually given threshold. This process is iterated until all points are grouped into one of the regions. The region merging process finds two regions that are close to each other and aggregates them into one region. The merging process results in the object region.

The segmented object region is processed to extra the local features of the objects contained in the region. SIFT descriptors are adopted for the local features because they are colored and scale-invariant (Lowe 1999; Abdel-Kalim et al. 2006). The SIFT descriptors represent only local patches in the segmented region. The Fisher vector is introduced as a global feature to represent the entire region. The derivation of the Fisher vector is described in the “Appendix”. The Fisher vector is classified into its relevant object primitive by using the SVM technique (Cortes and Vapnik 1995). These processes together convert the captured point cloud data into object primitives.

3 Connection between human actions and description

Our framework to generate the sentences from human actions consists of two modules. The first module combines the human whole body motions and the manipulated objects with their relevant words. The second module models sequences of the words in the sentences. This section describes these modules in details.

3.1 Stochastic model of words from motions and objects

The motion primitives and object primitives, which are derived by classifying human whole body motions and images containing an object to be manipulated, are connected to their relevant words stochastically. Figure 3 shows the stochastic model for the connections. This stochastic model is made of three layers: the top layer contains the primitives, the bottom layer contains the words, and the middle layer contains the latent nodes. The latent nodes connect the motions and the objects to the words. The connectivities are characterized by the conditional probability $P(s \vert \lambda , \kappa )$ of latent node s being generated from motion primitive $\lambda $ and object primitive $\kappa $ and the probability $P(\omega \vert s)$ of word $\omega $ being generated from latent node s. These probabilities are optimized by expectation maximization, iterating the E-step and M-step so that the training dataset of the motions, objects, and words is the most likely to be generated from this model. The training dataset $\left\{ \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_1, \ldots , \omega ^{(i)}_{n_i} \right\} $ is given; in this, $\omega ^{(i)}_j$ is the jth word in the word sequence (sentence) that is manually attached to the ith action whose whole body motion is classified into the motion primitive $\lambda ^{(i)}$, and the manipulated object is classified into the object primitive $\kappa ^{(i)}$. The E-step estimates the distribution for latent node s conditioned on motion primitive $\lambda $, object primitive $\kappa $, and word $\omega $ as

$$\begin{aligned} P(s \vert \lambda , \kappa , \omega ) = \frac{P(\omega \vert s) P(s \vert \lambda , \kappa )}{ {\displaystyle \sum \nolimits _k} P(\omega \vert s_k) P(s_k \vert \lambda , \kappa ). } \end{aligned}$$

(2)

The M-step updates probabilities $P(s \vert \lambda , \kappa )$ and $P(\omega \vert s)$ to

$$\begin{aligned} P(\omega _i \vert s)= & {} \frac{ {\displaystyle \sum \nolimits _{i,j}} n(\lambda _i, \kappa _j, \omega ) P(s \vert \lambda _i, \kappa _j, \omega ) }{ {\displaystyle \sum \nolimits _{i,j,k}} n(\lambda _i, \kappa _j, \omega _k) P(s \vert \lambda _i, \kappa _j, \omega _k) } \end{aligned}$$

(3)

$$\begin{aligned} P(s \vert \lambda , \kappa )= & {} \frac{ {\displaystyle \sum \nolimits _i } n(\lambda , \kappa , \omega _i) P(s \vert \lambda , \kappa , \omega _i) }{ {\displaystyle \sum \nolimits _{i}} n(\lambda , \kappa , \omega _i), } \end{aligned}$$

(4)

where $n(\lambda , \kappa , \omega )$ is a function that counts the number of words $\omega $ attached to the actions for which the whole body motions are classified into motion primitive $\lambda $ and the objects to be manipulated are classified into object primitive $\kappa $. Alternating the E-step and M-step results in the optimal probabilities for $P(s \vert \lambda , \kappa )$ and $P(\omega \vert s)$. The deviation of the EM algorithm is described in the “Appendix”.

3.2 Recurrent neural network for action descriptions

Neural networks have been widely used for modeling sentences (Bengio et al. 2006), and have been extended to recurrent neural networks to handle the dynamics of word sequences in sentences (Mikolov et al. 2010). Recurrent neural networks predict words that follow the input words via latent layers that can handle context in sentences. Figure 4 shows a recurrent neural network that consists of input, latent, and output layers. The input and output layers comprise word nodes. The number of nodes in the input and output layers is the same for each layer as the number of words that can appear in the sentences. The input layer is connected to the output layer through latent nodes, which represent the current state and retain the previous state. Specifically, the input vector is ${\varvec{x}}_t \in R^{N_{\omega }}$ and the output vector is ${\varvec{y}}_t \in R^{N_{\omega }}$, where $N_{\omega }$ is the number of distinct words. The activities of the latent node are expressed by ${{\varvec{z}}}_t \in R^{N_{z}}$ for current activities and ${{\varvec{z}}}_{t-1} \in R^{N_{z}}$ for past activities, where $N_{z}$ is the number of nodes in the latent layer. If the kth word is given for the input, ${\varvec{x}}_t$ is set to the binary vector

$$\begin{aligned} x_i = \left\{ \begin{array}{c@{\quad }c} 1 &{} if \ \ i=k \\ 0 &{} \hbox {otherwise}, \\ \end{array} \right. \end{aligned}$$

(5)

where $x_i$ is the ith element in ${\varvec{x}}_t$. ${{\varvec{z}}}_t$ is computed from ${\varvec{x}}_t$ as

$$\begin{aligned} \tilde{{\varvec{z}}}_t= & {} {{\varvec{U}}} {\varvec{x}}_t + {{\varvec{W}}} {{\varvec{z}}}_{t-1} \end{aligned}$$

(6)

$$\begin{aligned} z_i= & {} f(\tilde{z}_i), \end{aligned}$$

(7)

where $z_i$ and $\tilde{z}_i$ are the ith elements in ${{\varvec{z}}}_t$ and $\tilde{{{\varvec{z}}}}_t$, respectively, ${\varvec{U}} \in R^{N_z \times N_{\omega }}$ and ${{\varvec{W}}} \in R^{N_z \times N_z} $ are weight matrices, and f(z) is a sigmoid function. ${\varvec{y}}_t$ is computed from ${{\varvec{z}}}_t$ in a similar manner,

$$\begin{aligned} \tilde{{\varvec{y}}}_t= & {} {{\varvec{V}}} {{\varvec{z}}}_t \end{aligned}$$

(8)

$$\begin{aligned} y_i= & {} g_i(\tilde{{\varvec{y}}}_t), \end{aligned}$$

(9)

where $y_i$ and $\tilde{y}_i$ are the ith elements in ${\varvec{y}}_t$ and $\tilde{{\varvec{y}}}_t$, respectively, ${{\varvec{V}}} \in R^{N_{\omega } \times N_{z}}$ is a weight matrix, and $g_i(\tilde{{\varvec{y}}}_t)$ is the following function. In this function, $y_i$ represents the probability of the ith word being generated from the input word sequence.

$$\begin{aligned} g_i(\tilde{{\varvec{y}}}_t) = \frac{\exp ({\tilde{y}_i})}{ {\displaystyle \sum \nolimits _k} \exp ({\tilde{y}_k})} \end{aligned}$$

(10)

The weight matrices, ${{\varvec{U}}}$, ${{\varvec{V}}}$, and ${{\varvec{W}}}$, are trained by back propagation through time; this method incrementally updates the weight parameters to reduce the errors between the output vectors ${{\varvec{y}}}_t$ and the correct vectors ${{\varvec{d}}}_t$. Weight matrix ${{\varvec{V}}}$ is tuned as

$$\begin{aligned}&{{\varvec{e}}}_{t}= {{\varvec{d}}}_{t} -{{\varvec{y}}}_{t} \end{aligned}$$

(11)

$$\begin{aligned}&{{\varvec{V}}}_{t+1} = {{\varvec{V}}}_{t} + \alpha {{\varvec{s}}}_{t} {{{\varvec{e}}}_{t}}^T. \end{aligned}$$

(12)

The errors $\tilde{{\varvec{e}}}$ are propagated from the output layer to the latent layer.

$$\begin{aligned}&\tilde{e}_{it} = h_i({{{\varvec{e}}}_t}^T {{\varvec{V}}}, t) \end{aligned}$$

(13)

$$\begin{aligned}&h_i(x,t)= x s_{it}(1-s_{it}) \end{aligned}$$

(14)

$\tilde{e}_{it}$ is the ith element of $\tilde{{{\varvec{e}}}}_t$, and $s_it$ is the ith element of ${{\varvec{s}}}_t$. The weight matrices ${{\varvec{U}}}$ and ${{\varvec{W}}}$ are updated by using the error $\tilde{{{\varvec{e}}}}$.

$$\begin{aligned}&{{\varvec{U}}}_{t+1} = {{\varvec{U}}}_{t} + \beta {{\varvec{x}}}_{t} {\tilde{{{\varvec{e}}}}_t}^T \end{aligned}$$

(15)

$$\begin{aligned}&{{\varvec{W}}}_{t+1} = {{\varvec{W}}}_{t} + \gamma {{\varvec{s}}}_{t-1} {\tilde{{{\varvec{e}}}}_t}^T \end{aligned}$$

(16)

$\alpha $, $\beta $, and $\gamma $ are learning rates; these have been set to decrease monotonically, following Bergstra and Bengio (2012).

3.3 Generation of action descriptions from motions and objects

Integrating the two modules described above allows the generation of sentences describing human actions. Figure 5 shows an overview of this integration. An observation containing a human motion and an object is classified into a motion primitive and an object primitive. The words relevant to the motion and object are associated by the stochastic model, and these words are arranged into a sentence according to the recurrent neural network. Specifically, given motion primitive $\lambda _\mathcal{R}$ and object primitive $\kappa _\mathcal{R}$, the pair of primitives is converted to a sentence that is the most likely to be generated from these primitives. The sentence can be formed by searching for a sequence of words according to the probability of the sentence being generated from two modules given the motion primitive and the object primitive as

$$\begin{aligned} P({{\varvec{\omega }}} \vert \lambda _\mathcal{R}, \kappa _\mathcal{R}) = \prod _{i=1}^{l} P(\omega _i \vert \lambda _\mathcal{R}, \kappa _\mathcal{R}) \prod _{i=1}^{l-1} P(\omega _{i+1} \vert \omega _1,\ldots ,\omega _i).\nonumber \\ \end{aligned}$$

(17)

Here, sentence ${{\varvec{\omega }}}$ is expressed by word sequence $\omega _1, \omega _2, \ldots , \omega _l$. We assume that a set of words contained in the sentence depends on only the motion primitive and the object primitive, and that a sequence of words depends on only the set of words. The probability of each word being generated from the motion primitive and object primitive can be computed by using Eqs. 3 and 4.

$$\begin{aligned} P(\omega \vert \lambda _\mathcal{R}, \kappa _\mathcal{R}) = \sum _s P(\omega \vert s) P(s \vert \lambda _\mathcal{R}, \kappa _\mathcal{R}) \end{aligned}$$

(18)

The probability of word sequence $\omega _1, \omega _2, \ldots , \omega _i$ being followed by word $\omega _{i+1}$ can be computed by the recurrent neural network. Taking the logarithm of $P({{\varvec{\omega }}} \vert \lambda _\mathcal{R}, \kappa _\mathcal{R})$ and using Dijkstra’s algorithm, we can search for the sentence that is the most likely to be generated from the motion and object.

4 Experiments

Our proposed approach is tested to see how well it generates descriptions from observations of human actions. The observation data are collected by using an RGB-D sensor (Kinect, Microsoft Corporation) and contain human whole body motions and objects to be manipulated. The RGB-D sensor measures the positions of 25 joints in the whole body, as shown in Fig. 6. These positions are converted to the positions of 34 markers that are attached to a human character with 34 degrees of freedom. The attachment follows the Helen Hayes marker set placement (Kadaba et al. 1990). The positions of the 34 markers in the character’s trunk coordinate system and the distances between these markers and an object to be manipulated together express the human whole body motion to be used for motion primitives. An image from the RGB-D sensor is segmented into an object region. The local features are extracted from the object region, and then global features are computed for the object primitives. We measure actions by three performers, and 320 observations are collected for each of these three performers, giving 960 observations in total. Motion and object data contained in these datasets can be grouped into 24 motion primitives and 30 object primitives. Additionally, five students attached one sentence descriptive of each action. Figure 7 shows several samples of the action and the manually attached descriptions. The dataset contains 960 sentences, with 335 different words. The dataset is grouped into a training dataset containing 576 actions and a test dataset containing 384 actions.

Figure 8 qualitatively shows the experimental results. Three sentences that are most likely to be generated from each observation are displayed. The observation of “blowing the nose" is described by three sentences: “a person blows their nose with a tissue”, “a person blows their nose with tissue paper”, and “a person is blowing their nose”. The observation of “sweeping” is expressed by sentences “a person is sweeping the floor with a broom”, “they are cleaning the floor” and “they clean the floor with a broom”. The observation of “picking” can be described as sentences: “a person picks up the box on the bottle” “they pick up the box on the bottle" and “a person picks up a box”. The first sentence is same as the training sentence as shown in Fig. 7. The observation of “drinking" generates the sentences “a person drinks a bottle of tea”, “a person is drinking a bottle of tea" and “a person drinks out of a bottle”; these are similar to sentences attached to the action of “drinking”, as shown in Fig. 7. Other observations are also described by qualitatively correct sentences.

We also quantitatively test the sentence generation. In the first test, up to five sentences are generated from each test observation. When the generated sentence is the same as the sentence attached to the test observation, this sentence is counted as correct. This is the 5-best condition; more generally, for the m-best condition, the number of generated sentences is set to m, and if any of the generated sentences is correct, the generation is counted as correct. The correct ratios are 0.71, 0.84, 0.89, 0.91, and 0.92 for the 1-best, 2-best, 3-best, 4-best, and 5-best conditions, respectively.

It is important to evaluate the improvement by adding the object to be manipulated for the sentence generation. We removes the layer of the object primitives from the module as shown in Fig. 3. More specifically, we tested the sentence generation only from the human motion. The correct ratios of the sentence generation are 0.54, 0.54, 0.55, 0.59, and 0.59 for the 1-best, 2-best, 3-best, 4-best, and 5-best conditions, respectively. The comparison of these correct ratios with those derived in our proposed framework demonstrates that the information of the objects is effectively used to generate sentences from the action observations. Additionally, 9944 sentences with 1174 different words attached to the human action are crowdsourced. After training 576 human actions and 5984 sentences attached to these actions, we tested the sentence generation. Multiple sentences are attached to each human action. When the generated sentence is the same as one of the sentences attached to the test observation, this generation is counted as correct. The correct ratios of the sentence generation are 0.67, 0.69, 0.78, 0.84, and 0.90 for the 1-best, 2-best, 3-best, 4-best, and 5-best conditions, respectively.

5 Conclusions

This research is summarized as follows.

1.
We proposed a framework for linking human actions (consisting of human whole body motions and objects to be manipulated) with sentences describing the actions. For this, the human whole body motion and positional relation between the body and the object are encoded into a motion primitive; also, an object feature is extracted from an object region in a captured image and is then encoded into an object primitive. A pair of motion primitive and object primitive is stochastically connected to words relevant to the action. Additionally, the dynamics of word sequences in sentences descriptive of the actions is trained by a recurrent neural network, which can predict that word that is likely to follow a sequence of words.
2.
We linked two modules: a stochastic model between motions, objects, and words; and a recurrent neural network for the sentence structure. The link makes it possible to search for the sentences that are most likely to be generated from the observation of human action. Specifically, the recurrent neural network efficiently generates the sentence whose words are most likely to be generated from a given motion primitive and object primitive in the stochastic model. This link implies that the observation of the human action can be interpreted by considering corresponding descriptive sentences.
3.
We constructed a stochastic model to describe the relations between motions, objects, and words, and a recurrent neural network to describe the sentence structures. Each was trained against a dataset containing 576 pieces of data for triples comprising a human motion, an object, and a (manually chosen) sentence. We conducted an experiment with the stochastic model and the neural network, generating sentences for 384 test observations. We qualitatively and quantitatively confirmed that our proposed method can generate correct sentences from observations of human actions.

References

Abdel-Kalim, A. E., & Farag, A. A. (2006). Asift: A sift descriptor with color invariant characteristics. In Proceedings of 2006 IEEE Computer society conference on computer vision and pattern recognition, Vol. 2, pp. 1978–1983.
Bengio, Y., Schwenk, H., Senécal, J., Morin, F., & Gauvain, J. (2006). Neural probabilistic language models. In Innovations in machine learning, pp. 137–186.
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1), 281–305.
MathSciNet MATH Google Scholar
Cheng, G., Hyon, S. H., Morimoto, J., Ude, A., Hale, J. G., Colvin, G., et al. (2007). CB: A humanoid research platform for exploring neuroscience. Advanced Robotics, 21(10), 1097–1114.
Article Google Scholar
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
MATH Google Scholar
Gupta, A., Kembhavi, A., & Davis, L. S. (2009). Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789.
Article Google Scholar
Haruno, M., Wolpert, D., & Kawato, M. (2001). Mosaic model for sensorimotor learning and control. Neural Computation, 13, 2201–2220.
Article MATH Google Scholar
Ho, E. S., & Shum, H. (2013). Motion adaptation for humanoid robots in constrained environments. In Proceedings of IEEE international conference on advanced robotics, pp. 3813–3818.
Ho, E. S., Komura, T., & Tai, C. (2010). Spatial relationship preserving character motion adaptation. In ACM transactions on graphics, Vol. 29, p. 33.
Inamura, T., Toshima, I., Tanie, H., & Nakamura, Y. (2004). Embodied symbol emergence based on mimesis theory. International Journal of Robotics Research, 23(4), 363–377.
Article Google Scholar
Kadaba, M. P., Ramakrishnan, H. K., & Wootten, M. E. (1990). Measurement of lower extremity kinematics during level walking. Journal of Orthopaedic Research, 8(3), 383–392.
Article Google Scholar
Kaneko, K., Kanehiro, F., Kajita, S., Yokoyama, K., Akachi, K., Kawasaki, T., et al. (2002). Design of prototype humanoid robotics platform for HRP. In Proceedings of the 2002 IEEE/RSJ international conference on intelligent robots and systems, Vol. 3, pp. 2431–2436.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 664–676.
Article Google Scholar
Kojima, A., Tamura, T., & Fukunaga, K. (2002). Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 50(2), 171–184.
Article MATH Google Scholar
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., & Guadarrama, S. (2013). Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the 27th AAAI conference on artificial intelligence, pp. 541–547.
Kulkarni, G., Premraj, V., Dhar, S., & Li, S. (2011). Baby talk: Understanding and generating simple image descriptions. In Proceedings of IEEE conference on computer vision and pattern recognition, pp. 1601–1608.
Kuniyoshi, Y., Inaba, M., & Inoue, H. (1994). Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Transactions on Robotics and Automation, 10(6), 799–822.
Article Google Scholar
Kuroki, Y., Fujita, M., Ishida, T., Nagasaka, K., & Yamaguchi, J. (2003). A small biped entertainment robot exploring attractive applications. In Proceedings of the IEEE international conference on robotics and automation, Vol. 1, pp. 471–476.
Li, S., Kulkarni, G., Berg, T. L. , Berg, A. C., & Choi, Y. (2011). Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th conference on computational natural language learning, pp. 220–228.
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on Computer vision, Vol. 2, pp. 1150–1157.
Mataric, M. J. (2000). Getting humanoids to move and imitate. IEEE Intelligent Systems, 15(4), 18–24.
Article Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048.
Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36(1), 37–51.
Article MATH Google Scholar
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, Vol. 77, pp. 257–286.
Rizzolatti, G., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanisms underlying the understanding and imitation of action. Nature Reviews, 2, 661–670.
Article Google Scholar
Saussure, F. D. (1966). Course in general linguistics. New York: McGraw-Hill Book Company.
Google Scholar
Sugano, S., & Kato, I. (1987). Wabot-2: Autonomous robot with dexterous finger-arm–finger-arm coordination control in keyboard performance. In Proceedings of 1987 IEEE international conference on robotics and automation, Vol. 4, pp. 90–97.
Takano, W., & Nakamura, Y. (2015). Symbolically structured database for human whole body motions based on association between motion symbols and motion words. Robotics and Autonomous Systems, 66, 75–85.
Article Google Scholar
Takano, W., & Nakamura, Y. (2015). Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions. International Journal of Robotics Research, 34(10), 1314–1328.
Article Google Scholar
Tani, J., & Ito, M. (2003). Self-organization of behavioral primitives as multiple attractor dynamics: A robot experiment. IEEE Transactions on Systems, Man and Cybernetics Part A: Systems and Humans, 33(4), 481–488.
Article Google Scholar
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of IEEE conference on computer vision and pattern recognition, pp. 3156–3164 .
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In IEEE conference on computer vision and pattern recognition, pp. 1290–1297.
Yao, B., & Fei-Fei, L. (2012). Recognizing human–object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1691–1703.
Article Google Scholar
Yu, G., Liu, Z., & Yuan, J. (2014). Discriminative orderlet mining for real-time recognition of human–object interaction. In In Proceedings of Asian conference on computer vision, pp. 50–65.
Zhan, Q., Liang, Y., & Xiao, Y. (2009). Color-based segmentation of point clouds. In Proceedings of ISPRS laser scanning workshop, Vol. 38, pp. 248–252.

Download references

Acknowledgements

This research was partially supported by a Grant-in-Aid for Young Scientists (A) (No. 26700021) and Challenging Research (Exploratory) (No. 17K20000) from the Japan Society for the Promotion of Science.

Author information

Authors and Affiliations

Bunkyoku, Hongo, Tokyo, Japan
Wataru Takano, Yoshihiko Yamada & Yoshihiko Nakamura

Authors

Wataru Takano
View author publications
You can also search for this author in PubMed Google Scholar
Yoshihiko Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Yoshihiko Nakamura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wataru Takano.

Additional information

This research was partially supported by the Strategic Information and Communications R&D Promotion Program (No. 142103011) of the Ministry of Internal Affairs and Communications.

Appendices

Appendix A

The distribution of the local features in the training dataset is assumed to be expressed by a Gaussian mixture model (GMM) $\kappa $. The GMM consists of three kinds of parameters: mean vector ${{\varvec{\mu }}}_i$, covariance matrix ${{\varvec{\Sigma } }}_i$, and weight parameter $w_i$ for the ith Gaussian distribution. The likelihood of local feature ${{\varvec{u}}}$ being generated from GMM $\kappa $ is written as

$$\begin{aligned}&P({{\varvec{u}}} | \kappa ) = \sum _{i=1}^K P({{\varvec{u}}} | {{\varvec{\mu }}}_i,{{\varvec{\Sigma }}}_i,w_i)\end{aligned}$$

(19)

$$\begin{aligned}&P({{\varvec{u}}} | {{\varvec{\mu } }}_i,{{\varvec{\Sigma }}}_i,w_i)= {\displaystyle \frac{w_i}{\sqrt{(2 \pi )^d ||{{\varvec{\Sigma }}}_i ||}} } \nonumber \\&~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\times \exp \left\{ -\frac{1}{2} \left( {{\varvec{u}}} - {{\varvec{\mu } }}_i \right) ^T {{\varvec{\Sigma }}}_i^{-1} \left( {{\varvec{u}}} - {{\varvec{\mu } }}_i \right) \right\} ,\nonumber \\ \end{aligned}$$

(20)

where d is the number of dimensions of ${{\varvec{u}}}$ and the K is the number of Gaussian distributions. Note that $w_1$ can be removed from the parameter set because of the constraint

$$\begin{aligned} \sum _{i=1}^K w_i = 1 \end{aligned}$$

(21)

When N local descriptors, ${{\varvec{u}}}_1,{{\varvec{u}}}_2,\ldots ,{{\varvec{u}}}_N$, are found in an object region, the likelihood of these local descriptors being generated from the GMM is calculated as

$$\begin{aligned} P({{\varvec{u}}}_1,{{\varvec{u}}}_2,\ldots ,{{\varvec{u}}}_N \vert \kappa ) = \sum _{i=1}^N P( {{\varvec{u}}} | \kappa ). \end{aligned}$$

(22)

The Fisher vector, ${{\varvec{v}}}$, is defined as the gradient vector of the log-likelihood of $P({{\varvec{u}}}_1,{{\varvec{u}}}_2,\ldots ,{{\varvec{u}}}_N \vert \kappa )$:

$$\begin{aligned} {{\varvec{v}}} = {{\varvec{F}}}^{-\frac{1}{2}} \nabla _{\kappa } P({{\varvec{u}}}_1,{{\varvec{u}}}_2,\ldots ,{{\varvec{u}}}_N \vert \kappa ), \end{aligned}$$

(23)

where the elements of the gradient vector are

$$\begin{aligned}&\frac{\partial P({{\varvec{u}}}_1,{{\varvec{u}}}_2,\ldots ,{{\varvec{u}}}_N \vert \kappa ) }{\partial w_k} = \sum _{i=1}^N \left( \frac{\gamma _k ({{\varvec{u}}}_i)}{w_k} - \frac{\gamma _k ({{\varvec{u}}}_1)}{w_1} \right) \end{aligned}$$

(24)

$$\begin{aligned}&\frac{\partial P({{\varvec{u}}}_1,{{\varvec{u}}}_2,\ldots ,{{\varvec{u}}}_N \vert \kappa ) }{ \partial {{\varvec{\mu } }} _k } = \sum _{i=1}^{N} \gamma _k({{\varvec{u}}}_i) \left( {{\varvec{u}}}_i - {{\varvec{\mu } }}_k \right) ^T {{\varvec{\Sigma }}}_k^{-1}\end{aligned}$$

(25)

$$\begin{aligned}&\frac{\partial P({{\varvec{u}}}_1,{{\varvec{u}}}_2,\ldots ,{{\varvec{u}}}_N \vert \kappa ) }{ \partial {{\varvec{\Sigma }}} _k }\nonumber \\&\quad = \sum _{i=1}^N \frac{1}{2}\gamma _k({{\varvec{u}}}_i) \left[ \mathrm {Tr} \left\{ \left( ({{\varvec{u}}}_i - ({{\varvec{\mu } }}_k \right) \left( ({{\varvec{u}}}_i - ({{\varvec{\mu } }}_k \right) ^T \right\} \right. \nonumber \\&\qquad \times \left. {{\varvec{\Sigma }}}_k^{-2} - {{\varvec{\Sigma }}}_k^{-1} \right] . \end{aligned}$$

(26)

Note that ${{\varvec{\Sigma }}}_k$ is assumed to be a diagonal matrix, and that $\gamma _k({{\varvec{u}}})$ is written as

$$\begin{aligned} \gamma _k({{\varvec{u}}}) = \frac{w_k P({{\varvec{u}}} | {{\varvec{\mu }}}_k,{{\varvec{\Sigma }}}_k,w_k)}{ P( {{\varvec{u}}} | \kappa ).} \end{aligned}$$

(27)

Here, ${{\varvec{F}}}$ is the Fisher information matrix, and is defined as

$$\begin{aligned} {{\varvec{F}}}= & {} E_{P({{\varvec{u}}}_1,\ldots ,{{\varvec{u}}}_N \vert \kappa )}\nonumber \\&\times \left[ \nabla _{\kappa } P({{\varvec{u}}}_1,\ldots ,{{\varvec{u}}}_N \vert \kappa ) \nabla _{\kappa } P({{\varvec{u}}}_1,\ldots ,{{\varvec{u}}}_N \vert \kappa )^T \right] , \end{aligned}$$

(28)

where $E[*]$ is the expectation value.

Appendix B

Human actions contain data about human whole body motions and objects. The whole body motion, including the relative position of the object to be manipulated, is encoded into motion primitive $\lambda $, and the image containing the object is encoded into object primitive $\kappa $. The sentence describing the human action is manually assigned to the action. The sentence is expressed by a sequence of words, $\omega $. Let the training dataset be $\left\{ \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_1,\ldots ,\omega ^{(i)}_{n_i} \right\} $, consisting of motion primitives, object primitives, and sentences. Then, the logarithm of the probability of the words $\omega ^{(i)}_1,\ldots ,\omega ^{(i)}_{n_i}$ in the sentence being generated from motion primitive $\lambda ^{(i)}$ and object primitive $\kappa ^{(i)}$ over the training dataset is written as

$$\begin{aligned} \Phi= & {} \sum _{i} \ln { P\big ( \omega ^{(i)}_1,\ldots ,\omega ^{(i)}_{n_i} \vert \lambda ^{(i)}, \kappa ^{(i)}\big ) }\end{aligned}$$

(29)

$$\begin{aligned}= & {} \sum _{i,j} \ln P\big (\omega ^{(i)}_j \vert \lambda ^{(i)}, \kappa ^{(i)}\big ), \end{aligned}$$

(30)

where we assume that the word depends on only the motion and the object. According to the marginal distribution of the latent node, Eq. 29 is rewritten as

$$\begin{aligned} \Phi = \sum _{i,j} \ln \sum _k P\big (\omega ^{(i)}_j, s_k \vert \lambda ^{(i)}, \kappa ^{(i)} \big ). \end{aligned}$$

(31)

This equation can be written as the expectation

$$\begin{aligned} \Phi= & {} \sum _{i,j} \ln \sum _k \tilde{P}\left( s_k \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\right) \frac{P\big (\omega ^{(i)}_j, s_k \vert \lambda ^{(i)}, \kappa ^{(i)} \big )}{\tilde{P}\big (s_k \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )} \nonumber \\\end{aligned}$$

(32)

$$\begin{aligned}= & {} \sum _{i,j} \ln E_{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )} \left[ \frac{P\big (\omega ^{(i)}_j, s \vert \lambda ^{(i)}, \kappa ^{(i)} \big )}{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )} \right] . \end{aligned}$$

(33)

The lower limit of this expectation, $\Phi _L$, is given by the Jensen inequality.

$$\begin{aligned} \Phi _L = \sum _{i,j} E_{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )} \left[ \ln \frac{P\big (\omega ^{(i)}_j, s \vert \lambda ^{(i)}, \kappa ^{(i)} \big )}{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )} \right] \end{aligned}$$

(34)

From $\Phi $ and $\Phi _L$, the following equations are derived.

$$\begin{aligned}&\Phi - \Phi _L = \sum _{i,j} \left\{ \ln P\big (\omega ^{(i)}_j \vert \lambda ^{(i)}, \kappa ^{(i)}\big )\right. \nonumber \\&\qquad \qquad \qquad \left. - E_{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )}\left[ \ln \frac{P\big (\omega ^{(i)}_j, s \vert \lambda ^{(i)}, \kappa ^{(i)} \big )}{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )} \right] \right\} \nonumber \\&\quad = \sum _{i,j} \left\{ \ln P\big (\omega ^{(i)}_j \vert \lambda ^{(i)}, \kappa ^{(i)}\big )- E_{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )}\right. \nonumber \\&\qquad \left. \times \left[ \ln \frac{P\big (\omega ^{(i)}_j \vert \lambda ^{(i)}, \kappa ^{(i)} \big ) P\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j \big )}{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )} \right] \right\} \nonumber \\&\quad = \sum _{i,j} E_{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )}\left[ \ln \frac{\tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j\big )}{ P\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega ^{(i)}_j \big )} \right] \end{aligned}$$

(35)

Here, $P(s \vert \lambda , \kappa , \omega )$ is the estimated distribution of latent node s based on the model, and $\tilde{P}(s \vert \lambda , \kappa , \omega )$ is the true distribution of s. Equation 34 implies the Kullback–Leibler information $\textit{KL}\left( \tilde{P}(s \vert \lambda , \kappa , \omega ) || P(s \vert \lambda , \kappa , \omega ) \right) $ between these two distributions. The Kullback–Leibler information is nonnegative, and is zero only when these two distributions are the same. Therefore, estimating the true distribution, $\tilde{P}(s \vert \lambda , \kappa , \omega )$, as the model-based distribution, $P(s \vert \lambda , \kappa , \omega )$, yields zero for the Kullback–Leibler information. The E-step estimates the distribution $\tilde{P}(s \vert \lambda , \kappa , \omega )$ of the latent node such that the Kullback–Leibler information becomes zero.

After the E-step, the model parameters are iteratively optimized such that $\Phi $ increases incrementally. Let $\Phi ^{[t+1]}$ and $\Phi ^{[t]}$ be the objective functions given at the $(t+1)$th and tth iteration steps, respectively. The relation between $\Phi ^{[t+1]}$ and $\Phi ^{[t]}$ is

$$\begin{aligned}&\Phi ^{[t+1]} - \Phi ^{[t]} \nonumber \\&\quad = \Phi _L^{[t+1]} + \sum _{i,j} \textit{KL}\big ( \tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big ) || P^{[t+1]}\nonumber \\&\quad \big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big ) \big ) - \Phi _L^{[t]}\nonumber \\&\quad - \sum {i,j} \textit{KL}\big ( \tilde{P}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big ) || P^{[t]}\big (s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big ) \big ), \nonumber \\ \end{aligned}$$

(36)

where $\Phi _L^{[t+1]}$ and $\Phi _L^{[t]}$ are the lower limits of $\Phi ^{[t+1]}$ and $\Phi ^{[t]}$, respectively, and $ P^{[t+1]}(s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}) $, and $P^{[t]}(s \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)})$ are the estimated distributions of the latent node based on the models derived at the $(t+1)$th and tth iteration steps, again respectively. Estimation of the true distribution $\tilde{P}(s \vert \lambda , \kappa , \omega )$ of the latent node as the distribution $P^{[t]}(s \vert \lambda , \kappa , \omega )$ based on the model derived at the tth iteration step leads to the relation

$$\begin{aligned} \Phi ^{[t+1]} - \Phi ^{[t]} \ge \Phi _L^{[t+1]} - \Phi _L^{[t]} \end{aligned}$$

(37)

because the second and fourth terms in Eq.36 take positive and zero values, respectively. The search for the parameters that maximize $\Phi _L^{[t+1]}$ at the $(t+1)$th iteration step results in $\Phi ^{[t+1]}$ becoming larger than $\Phi ^{[t]}$. By assigning the distribution $P^{[t]}(s \vert \lambda , \kappa , \omega )$ to $\tilde{P}(s \vert \lambda , \kappa , \omega )$ in Eq. 34, $\Phi _L^{[t+1]}$ is rewritten as

$$\begin{aligned} \Phi _L^{[t+1]}= & {} \sum _{i,j,k} P^{[t]}\big (s_k \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big )\nonumber \\&\ln \frac{P\big (\omega _j^{(i)}, s_k \vert \lambda ^{(i)}, \kappa ^{(i)}\big )}{ P^{[t]}\big (s_k \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big )}. \end{aligned}$$

(38)

Ignoring terms that do not depend on the model parameters, the function $\hat{Phi}^{[t+1]}$ to be maximized can be written as

$$\begin{aligned} \hat{Phi}^{[t+1]}= & {} \sum _{i,j,k} P^{[t]}\big (s_k \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big ) \nonumber \\&\ln P\big (\omega _j^{(i)}, s_k \vert \lambda ^{(i)}, \kappa ^{(i)}\big ) \nonumber \\= & {} \sum _{i,j,k} P^{[t]}\big (s_k \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big )\nonumber \\&\quad \left[ \ln P\big (\omega _j^{(i) \vert s_k}\big ) + \ln P\big (s_k \vert \lambda ^{(i)}, \kappa ^{(i)}\big ) \right] . \end{aligned}$$

(39)

The model parameters $P(\omega \vert s)$ and $P(s \vert \lambda , \kappa )$ to be optimized must satisfy the following constraints

$$\begin{aligned}&\sum _i P(\omega _i \vert s) = 1\end{aligned}$$

(40)

$$\begin{aligned}&\sum _k P(s_k \vert \lambda , \kappa ) = 1. \end{aligned}$$

(41)

The Lagrange function is obtained as

$$\begin{aligned} \mathcal{L}= & {} \sum _{i,j,k} P^{[t]}\big (s_k \vert \lambda ^{(i)}, \kappa ^{(i)}, \omega _j^{(i)}\big )\nonumber \\&\left[ \ln P\big (\omega _j^{(i) \vert s_k}\big ) + \ln P\big (s_k \vert \lambda ^{(i)}, \kappa ^{(i)}\big ) \right] \nonumber \\&- \sum _k \alpha _k \left[ \sum _i P(\omega _i \vert s_k) - 1 \right] \nonumber \\&-\sum _{i,j} \beta _{ij} \left[ \sum _k P(s_k \vert \lambda _i, \kappa _j) - 1 \right] . \end{aligned}$$

(42)

The derivative of the Lagrange function with respect to $P(\omega _i \vert s)$ or $P(s \vert \lambda , \kappa )$ is zero at the optimal parameter, which is derived as

$$\begin{aligned}&P^{[t+1]}(\omega _i \vert s) = \frac{ {\displaystyle \sum \nolimits _{i,j}} n(\lambda _i, \kappa _j, \omega ) P^{[t]}(s \vert \lambda _i, \kappa _j, \omega ) }{ {\displaystyle \sum \nolimits _{i,j,k}} n(\lambda _i, \kappa _j, \omega _k) P^{[t]}(s \vert \lambda _i, \kappa _j, \omega _k) }\nonumber \\\end{aligned}$$

(43)

$$\begin{aligned}&P^{[t+1]}(s \vert \lambda , \kappa ) = \frac{ {\displaystyle \sum \nolimits _i } n(\lambda , \kappa , \omega _i) P^{[t]}(s \vert \lambda , \kappa , \omega _i) }{ {\displaystyle \sum \nolimits _{i}} n(\lambda , \kappa , \omega _i) } \end{aligned}$$

(44)

The M-step searches for the optimal parameters in this manner, and the EM algorithm alternates the E-step and the M-step.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Takano, W., Yamada, Y. & Nakamura, Y. Linking human motions and objects to language for synthesizing action sentences. Auton Robot 43, 913–925 (2019). https://doi.org/10.1007/s10514-018-9762-1

Download citation

Received: 06 March 2017
Accepted: 18 April 2018
Published: 02 May 2018
Issue Date: 01 April 2019
DOI: https://doi.org/10.1007/s10514-018-9762-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Linking human motions and objects to language for synthesizing action sentences

Abstract

Similar content being viewed by others

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Motion2language, unsupervised learning of synchronized semantic motion segmentation

Classification of Multi-class Daily Human Motion using Discriminative Body Parts and Sentence Descriptions

1 Introduction