Sensorimotor Input as a Language Generalisation Tool: A Neurorobotics Model for Generation and Generalisation of Noun-Verb Combinations with Sensorimotor Inputs

The paper presents a neurorobotics cognitive model to explain the understanding and generalisation of nouns and verbs combinations when a vocal command consisting of a verb-noun sentence is provided to a humanoid robot. This generalisation process is done via the grounding process: different objects are being interacted, and associated, with different motor behaviours, following a learning approach inspired by developmental language acquisition in infants. This cognitive model is based on Multiple Time-scale Recurrent Neural Networks (MTRNN).With the data obtained from object manipulation tasks with a humanoid robot platform, the robotic agent implemented with this model can ground the primitive embodied structure of verbs through training with verb-noun combination samples. Moreover, we show that a functional hierarchical architecture, based on MTRNN, is able to generalise and produce novel combinations of noun-verb sentences. Further analyses of the learned network dynamics and representations also demonstrate how the generalisation is possible via the exploitation of this functional hierarchical recurrent network.


Introduction
For the design of social robots [4,13], besides of building robots with human-like external morphology, the ability to process, to understand and generate language is one of the key factors to support human-robot interaction.However, to build a model to accomplish similar processes for social robotics, the design of the robot's abilities of understanding, generation and generalisation of natural language is still an open challenge.Particularly, natural language understanding for a social robotic system plays an essential role as it interfaces the vocal command from human users to an internal representation in the robot's own cognitive system.In this study we will follow a developmental robotics approach to the design of language and communication abilities in robots, following an incremental and interactive process to language learning, inspired by language development in infants.

Language Understanding for Robot Systems
Important recent developments in social robotics, such as robots performing human-like emotion expression [76] and social attention for autonomous movement [45], have been accompanied by language understanding approaches focusing on the grounding of natural language into the agent's sensorimotor experience and its situated interaction [5,60].
For instance, in [66,39], syntactic parsing techniques are used to ground the language into primitive motor actions (e.g., pickup, move, place), which can be inferred within graph models.Similarly, Misra et al. [43] developed a system for mobile robots which is able to learn to ground the language instructions from a corpus of pairs of natural language including both verbs and spatial information.[74] proposed that in order to understand the object affordance which can be described by adjectives, the most crucial property is the shape-related one.
Besides the direct modelling methods for robot language learning, an alternative approach to build a learning model for language is based on developmental robotics [71,1,10].Taking inspiration from developmental psychology and developmental neuroscience studies, this approach emphasises the role of the environment and of the interactions that occur during learning, over a progression of learning stages.In the context of language understanding, the core of developmental robotics approaches to language learning is following a similar developmental pathway of infants acquiring grounded representations of natural language and forming a symbol system through embodied interaction with the physical environment [6].Furthermore, via language learning an agent should also be able to generalise by inferring un-trained combinations of words within the lexical constructions acquired.
Various developmental robotics models have been developed that incrementally model the various stages of language acquisition in infants, from phoneme acquisition, to object and action names, to word combinations.For example, the cognitive model presented in [21] outlines the cortical interactions in the syllable generation process which result in different developmental phenomena.This mimics the first stage of language development.The Elija model [28] is a vocal apparatus which strictly follows detailed developmental stages.Working as an articulatory synthesizer, it firstly learns the production of sounds on its own.Then a caregiver is used to produce speech by using speech sounds for object names using reinforcement learning, where the reward is again given by the response of the caregiver.Likewise, a self-organizing map together with reinforcement learning was proposed in [70], which demonstrated that the reinforcement learning based on the similarity of vocalization can improve the post-learning production of the sound of one's language.
From the models mentioned above, we can see that most of the methods for modelling the first stages of phonetics production do not tend to use robotic platforms.On the other hand, for the modelling of the later stages of lexical development, after assuming that phonetics skills are mastered, robotic systems are usually employed to establish the metaknowledge about the association between vocal speech and the referents or the actions.Therefore, except studies focusing on the mental imagination of actions as in [20], the mechanical morphology of a robot is particularly important when modelling the acquisition of words, especially those used to name the motor actions.For instance, the model from [38] gets as input dance-like combinations of human movement primitives plus ambiguous labels associated with these movements.Concentrating on the second and third stages of associating lexicon, words and motor actions, the robot in [16] is able to acquire new motor behaviours in an on-line fashion by grounding the vocal commands on the predefined control motor primitives.Similarly, Siskind [57] proposed a model which uses visual primitives to encode notions of different actions to ground the semantics of events for verb learning.Using structured connectionist models (SCMs), [11] built a layered connectionist model to connect embodied representations and simulative inference for verbs.In [8], the emergence of verb-noun separation is learned while the agents are interacting and manipulating the objects.
[61] further developed the grounding the verbs with more complex meanings (such as "keep", "reject", "accept" and "give") which related to the internal states of the caregivers and which were used to build a robotic model for the grounding of increasingly abstract motor concepts and words.As follow-up studies of [16], [15,25] focused on the understanding of grammatical complexity.They used recurrent neural networks (RNN) to learn grammatical structure based on temporal series learning in artificial neural networks.Also using RNN, [62] reported experiments with a mobile robot implementing a two-level RNN architecture called Recurrent Neural Network with Parametric Bias Units (RNNPB).This allows the robot to map a linguistic command containing verbs and nouns into context-dependent behaviours corresponding to the verb and noun descriptions respectively.Comparing to RNNPB, another kind of RNN architecture called Multiple Timescale Neural Network (MTRNN) is able to ground different scales of sensorimotor information into the hierarchical structure of sentences, such as the spelling of words [46] and words and sentences [26].The kind of recurrent models provide a memory to store the spatial and temporal structure of the environment and the lexical structures.Given the fact that RNN can learn arbitrary length of the dependencies in statistical structures and their context, the storage ability of the RNN out-performs most of the language learning models.On a higher level, concerning the meta-learning principles in these learning models, some developmental studies of language have focused on the intrinsic motivation to learn to speak, not only through reinforcement learning, but also following child psychology evidence that language learning can be driven without the explicit rewards from the caregivers.For instance, language commands can be acquired from learning from demonstration (LfD) [55], intrinsic plasticity [48] and evolution [59,14].These intrinsic motivation capabilities are implemented through learning models, which allow the agents to acquire communication skills through vocal interactions, besides the use of reinforcement learning techniques, such as the learning by demonstration model for grounding vocal commands into situated spatial informa-"mtrnn-generalisation-2015-v12-pdf" -2016/5/12 -0:55 -page 3 -#3 tion [18]; a comparative study for evolving robot language with situated information can be found in [49].These models demonstrate different forces underlying language learning.From a mathematical prospective, those learning methods overcome the natural language learning bottlenecks of building compositionality of lexical structures and maximising the observed content [58] by means of statistical methods (LfD), optimal control (Reinforcement Learning) or other methodologies.

Cognitive Background and Motivation
In the developmental psychology studies focusing specifically on the learning of nouns and verbs, there is still an open debate between the learning stages and their relative temporal acquisition order.For the early stages of verb and noun learning, it is widely accepted that most of the common nouns are generally learned before verbs [19], by first connecting speech sounds (labels, nouns) to physical objects in view.However, some nouns which relate to context, such as "passenger", are learnt relatively late, only after "an extensive range of situations" (contexts or life phases) have been encountered [22], during which verbs may play a crucial role.The embodied learning of verbs and nouns is not correlated to one single modality in sensory percept's: experiments done in [32] suggest that the nouns are grounded from the intrinsic properties of an object, even at different movements and orientations, while verbs are accounted for the movement path of an object.This distinction may be associated with the neuroanatomy distinction between the ventral and dorsal (what/where) visual streams, involved in the generation of nouns and verbs respectively.As [37] suggested, some nouns and verbs can be learnt more straightforward to learn because they can be accessed perceptually.On the other hand, some abstract words, either verbs or nouns, should only be learnt from a social and linguistic context.
For instance, while infants learn the word-gesture combination at the age of two, they associate the meaning of verbs with the meanings of the higher-order nouns [3].Such verbs with complex meaning are obtained from both motor action and visual percept [36].As summarised in [7,9], comparing to the static object perception that associates to simple nouns, the early verb learning involves a temporal dynamic from motion perception.Indeed, we assert that the learning processes of nouns and verbs (especially for those with complex meanings) are not separated; there is a close relation between verb and noun development, during which the embodied sensorimotor information plays a crucial role.During this embodied development, both the perceptual system and the motor system contribute to language comprehension (e.g.[53,31,50,56]).These experiments also extend Piaget's proposal that language learning is a symbolised understanding process for dynamic actions, which is "a situated process, function of the content, the context, the activity and the goal of the learner".
The sensorimotor information is not the only mechanism acting as a learning tool for language acquisition.Conversely, recent research also proposes that language is such a flexible and efficient system for symbolic manipulation which is more than a communication tool of our thoughts (e.g.[35,41,42].)For the predictive effect from language to sensorimotor behaviours, vocal communication can be one of the sources that drives the visual attention to become predictive, by making inferences as to the source-inferences [67].In this process, language can trigger a predictive inference about the appearance of a visual percept, driving a predictive saccade [17].Therefore, the sensorimotor system is affected by the inferences from the auditory modality or even from higher level cognitive processes.
Following the hierarchical cognitive architecture proposed in [75], the language understanding can be represented hierarchically from the neural processes on the (lower) receptor level to the higher level understanding which happens in the (higher) prefrontal cortex.Moreover, we will use a hierarchical recurrent neural architecture, as in [78,79,80], due to the fact that the learning modalities of visual perception and motor actions can be represented as both spatial and temporal sequences, so that the recurrent connections provide possibilities to intertwine these two modalities.In this paper, the MTRNN model will be employed to ground the features from different modalities with language structures in different time-scales.Similar RNNPB [62] or MTRNN [23] networks have been used to learn verbs and nouns features with motor actions and visual features.The proposed model will use a single MTRNN model to learn both the sensory and motor information in a single set of sequences.We regard the perception and action having inseparable links (e.g.[72,44]) and should be encoded solely as similar data structures.Moreover, the training of such a large MTRNN has become more and more feasible in recent years due to the accessibility and affordability of GPU computing .Therefore, the two modalities of our MTRNN can be conceptualised at the same time over the embodied sensorimotor experiences towards abstract and compact representations on the higher level of this hierarchy, similarly to the developmental processes of language conceptualisation and categorisation.

The Multiple Timescale Recurrent Neural Network Model
This model is based on the combination of a MTRNN network with Self-Organizing Maps (SOMs) to control the humanoid robot iCub, being trained on the understanding of a set of noun-verb combinations to perform a variety of actions with different objects.Fig. 1 shows the learning architecture  [73] and the self-organizing maps.The core module of the system is the MTRNN, which will learn sequences of verb-noun instructions and will control the movement of the robot in response to such instructions.The inputs to the MTRNN correspond to the language command inputs, to the visual inputs as well as the proprioceptive inputs.We regard these three modalities as a whole sensorimotor input because the MTRNN model is able to learn the relation between verbs and nouns and seen objects within the context of the non-linearity of the sensorimotor sequences in a hierarchical manner.This network will learn this non-linearity in the functional hierarchy in which the neural activities are self-organised, exploiting the spatiotemporal variations.

Using a Self-organizing Map as a Sparse Structure
The initial input data sets, consisting of speech, camera images, and kinaesthetic imitation proprioceptive states are preprocessed (See Eqs.1-4) using three SOMs respectively for the linguistic, visual and motor input modalities.
Although the MTRNN could be trained with original data representation, we usually employ pre-processing modules for the MTRNN inputs, which results in a sparse structure of the weighting matrices in the network.Also the MTRNN outputs are decoded into the original data structures.The sparseness in weighting matrices has a similar concept of sparse coding in computational neuroscience [47].The weighting matrices are sparsely distributed, which is an analogous form of the sparse distributed representations that are used in our brain, such as in visual [68] and auditory cortex [54].Previous research on language learning in RNN [2] also showed that a sparse encoding results in robustness in training and a better generalisation results and improved robustness with noisy inputs.
Here the sparseness structure in the weight matrices is given by the SOMs [34].During this process, the SOM per-forms as a dimensional mapping function, with an output space with higher dimensions than the input space.Having a discretised and distributed neural encoding in the output space, the pre-processed SOM modules are able to reduce the possible overlap of the original data within the original input space.Therefore, the topological homomorphism produced by the SOM guarantees that the training vectors between the raw training-sets and the input vectors are topologically similar with each other.
In the SOM training here, assuming the input vectors are These input vectors are mapped to an output space whose coordinates define the output topology of the SOM.Connecting between the input and output spaces, the weight vector is defined as where neuron j is one of the input space vectors and n is the total number of those neurons.When a self-organising map receives an input vector, the algorithm finds a neuron associated with weights that are most similar to the input vector.The measure of similarity is usually done using the Euclidean distance metric, which is mathematically equivalent to finding a neuron with the largest inner product w j x.
Thus the very neuron that is the most similar match for the input vector is referred to as best matching unit (BMU) and it is defined as: The dimensionality mapping is achieved when the BMU coordinates are used to update the weights of the neighbourhood neurons around neuron c by driving them closer to the input vector at iteration t: where δ is a neighbourhood function from the distance from BMU.Therefore, the output of the SOM which is encoded in a high-dimensional input space, is still able to preserve the topological properties of the input space due to the use of the neighbourhood function.

Multiple Timescale Recurrent Network (MTRNN)
As shown in Fig. 2, the neurons in the MTRNN form three layers: an input-output layer (IO) and two context layers called Context fast (C f ) and Context slow (C s ).In the following text, we denote the indices of these neurons as: where I IO represents the indices to the neurons at the inputoutput layer, I C f belongs to the neurons at the context fast layer and I Cs belongs to the neurons at the context slow layer.The neurons on a layer own full connectivity to all neurons within the same and adjacent layers, as shown in Fig. 1.The difference between the fast and slow context layers as well as the input-output layer consists in having distinct time constants τ , which determine the speed of the adaptation given a time sequence with a specific length, when updating the neural activity.The larger the value of τ , the slower the neuron adaptation.The difference of adaptation rate of the neurons further assemble features of the input sequences in various timescales.Therefore, given the previous states S(0), S(1), ..., S(t), their spatiotemporal features will be self-organised on different levels of the network.So the MTRNN is not only a continuous time recurrent neural network that can predict the next states S(t + 1) of the time sequence, but also its internal state acts as a hierarchical memory to preserve the temporal features of the non-linear dynamics in different timescales.

Learning
In general, the training of the MTRNN follows the updating rule of classical firing rate models, in which the activity of a neuron is determined by the average firing rate of all the connected neurons.Additionally, the neuronal activity is also decaying over time following an updating rule of leaky integrator model.Therefore, when time-step t > 0, the current membrane potential status of a neuron is determined both by the previous activation as well as the current synaptic inputs, as shown in Eq. 6: where u i,t is the membrane potential, x j,t is the activity of j-th neuron at t-th time-step, w i,j represents the synaptic weight from the j-th neuron to the i-th neuron and τ is the time scale parameter which determines the decay rate of this neuron.One of the features that is similar to the generic continuous time recurrent neural networks (CTRNN) model is that a parameter τ is used to determine the decay rate of the neural activity; a larger τ means their activities change slowly over time compared with those with a smaller τ .Assuming the i-th neuron has the number of N connections (i.e. the total number of the neurons in the network is N ), Eq. 6 can be transformed into When the time-step t = 0, the membrane potential of the IO neurons is set to 0 and the context neurons are set to initial states: The neural activity of a neuron is calculated in two methods, depending on which level the neuron belongs with: Thus there is a sigmoid activation function for context neurons, while the input-output neurons are calculated by the soft-max function.The soft-max activation function gives rise to the recovery of a similar probability distribution as the SOM pre-processing modules.Therefore, this activation function results in a faster convergence to the MTRNN network training.
During training, the neurons of MTRNN self-adapt their weight matrices as well as the internal states of the neurons on the context layers for the processing of the incoming time sequence.The purpose of the training is to minimize the error E which is defined by the Kullback-Leibler divergence in this case: where y * i,t is the desired neural activation of the i-th neuron at the t-th time-step, which acts as the target value for the actual output y i,t .The target of the training is to minimize E by back-propagation through time (BPTT).
In the BPTT algorithm, the input of the IO neuron is calculated from a mixed partition value r (called the feedback rate) of the previous output value y and the desired value y * .(Eq. 11) where we will use r = 0.1 during training, and r = 0 during generation, which means that the network is used to generate the sequences autonomously.
At the n-th iteration of training, the synaptic weights and the biases of the network of neuron i are updated according to Eq. 12.
In Eq. 12 and Eq. 13, the partial derivatives for w and b are the sums of weight and bias which determine the changes over the whole sequence respectively, and η and β denote the learning rates for the weight and bias changes.Particularly, the term ∂E/∂u k,t can be calculated recursively as Eq.14, where the () is the derivative of the Sigmoid Function defined by Eqs. 8 and 9.The term λ i,k is the Kronecker's Delta, whose output is 1 when i = k, otherwise it is set to 0.

Experiments
To examine the network performance, we recorded the real world training data from object manipulation experiments based on an iCub robot [40].This is a child sized humanoid robot which was built as a testing platform for theories and models of cognitive science and neuroscience.Mimicking a two-year old infant, this unique robotic platform has 53 degrees of freedom.As such, using the iCub, we set a learning scenario in which a human instructor was teaching the robotic learner a set of language commands whilst providing kinaesthetic demonstration of the named actions.The aim of these experiments was to evaluate the error for generalisation with a large data-set.We were also interested in the mechanisms, especially the neural activities in the hierarchical architecture, which result in such a generalisation.

Experimental Setup
Fig. 3 shows the setup used in our experiments.The data set was obtained using the following steps: 1. Objects with significantly different colours and shapes were placed at 6 different locations in front of the iCub.2. A vocal command was spoken by an instructor according to the visual scene that was perceived by the iCub.A complete sentence of the vocal command is composed of a verb and a noun.For instance, assuming we have the command "lift [the] ball", this was recognised by the speech recognition software called Dragon dictate 1 , with which the corresponding verb and noun were recognised and then translated into two dedicated discrete values based on the verb and noun dictionaries (Tab.1). 3. The built-in vision tracker of the iCub searches for a ballshaped object based on the dictionary-generated values; the iCub uses its vision tracker system which incorporates visual segmentation algorithm to track a particular type of object, rotate the joints of head and neck and locate it in the visual field.4. Once the object is located, the iCub rotates its head and triggers the object tracking, which will change the encoder values of the neck and eyes. 5. Joint positions of the head and neck are recorded.The sequence recorder module of the iCub was used to record the sensorimotor trajectories while the instructor was guiding the robot by holding its arms to perform a certain action for each object.6.The hand and torso joints rotate to certain angles to accomplish the lifting action toward the ball (with human instructor during training/without human instructor during execution) The whole experimental setup used combinations of 9 actions and 9 objects.From these combinations, both the vocal commands (i.e. a complete sentence includes verb and noun) and the sensorimotor sequences can be created.To the best of our knowledge, this 9 × 9 noun-verb scenario is one of setups with the highest combination of verbs and nouns in grounded robot language experiments (e.g.[65,73]) We used such a large number of data to test the combinatorial complexity and mechanical feasibility of this model, as well as to evaluate the generalisation ability and its internal non-linear dynamics when using such a large data-set.From an engineering point of view, after testing the feasibility of generalisation, it is also possible to apply this model in a real-world robot application.As mentioned before, each speech command was recognised and translated into two semantic command units.Using 9 discretised values for verbs and 9 for nouns, the semantic commands have thus 81 possible combinations.This translation was done according to the verb and noun dictionaries, as shown in Tab. 1.Since we used the visual object tracker in the iCub, the joints of neck and eyes automatically represent the location of the particular object which is presented in the vocal commands.Also the movements of the joint angles in the torso are recorded as the sequences of the motor actions.During the data recording, each recording sequence lasted 5 second and the encoder values of 41 joints were sampled at 50ms intervals.Thus, the complete input vector of the data Three experiments were carried out and are described in the next subsections: in the first experiment, given the 9 actions and 9 objects data set, we will search the parameter space and find the best parameters for the network training.In the second experiment, the training and generalisation performance will be shown given different types of manipulated data sets.For the third experiment, we will further analyse the generalisation ability of the MTRNN network.All these experiments were run using a modified version of the Aquila software [51] in a GPU computer with one Tesla C2050 and two GeForce GTX 580 graphic cards.

Experiment 1 -Training Setting
During this section, we used the data set consisting of the complete 9 × 9 combinations (i.e.N v = 9, N n = 9), which include information about 6 different object locations.Thus the whole data-set contains 9 × 9 × 6 = 486 sequences, which were all used for training the network.The most distinct feature of the MTRNN, with respect to generic RNN or CTRNN networks, is that different neurons have distinct time constants τ , which are also one of the key factors that determine the training performances of the network.In this experiment, we systematically changed these parameters in the parameter space in order to find out the best parameter settings.
A parameter space is defined as (τ s , τ f , N Cs , N C f ), representing the time constants on the context slow layer, context fast layer and the number of neurons on these two layers respectively.In order to minimise the effects of the randomness of the initialisation, a total of 3 training trials were done with the same training setting, as shown in Tab. 3 [24] and [27]).From these experiments, we can discover that the number of neurons on the C s and C f layers were determined mainly according to dimension of the IO layer, but they generally kept a ratio from 1 : 4 to 1 : 3. To start, we firstly set the parameters according to the minimum values (70, 5, 20, 60) from previous research [73].Then we scaled up the numbers of neurons on the context layers and adjusted the time constants.The less crucial parameters were kept constantly: learning rates η = 0.7, β = 0.7, momentum = 0.9, weightRange = 0.025.The stopping criteria for the training process was that the error did not decrease more than 1 −6 within consecutive 100 iterations.From Tab. 3, we can see that the number of neurons on the context layers affected much on the training performance of the network.Comparatively, the time constants played a less significant role for the training error than the number of neuron did.Also results showed that the suitable ratio for numbers of C s and C f neurons should be kept to around 1 : 4 to 1 : 3.

Experiment 2 -Training Performance
From the previous experiment, we found that the best parameters for training the 9×9 verb-noun data-set are those shown in bold in Tab. 3, among which we selected (50, 5, 70, 100).
We then examined the training performance of the network under this parameter setting using different data-sets.To test the generalisation ability, these data-sets were manipulated: a subset of the combinations of actions and objects were removed from the training set, to be used as validation test sets when testing the generalisation ability of the network.
The detailed information about the manipulated data-sets are shown in Tab. 4, where the coloured numbers N indicate the specific verb-noun combination removed in the specific N -th data-set.We can see that the number of removal sets were increasing from the first to the third test-set, indicating the difficulty of generalisation was increasing.Also at the second and the third data-sets, some of the removal sets were next to each other, which further increased the difficulty of generalisation.Using the parameter set of (50, 5, 70, 100), the training curves (Figs. 4) show that the training converged.To further demonstrate the robustness of the generalisation ability given the untrained sensorimotor sequences, the validation sets, which were not included in the training, were fed into the network.In this way we aimed to test how the network responds to noun-verb combinations not used during training.Using the three MTRNNs we trained from three data-sets, we performed three generalisation experiments using the missing verb-noun combinations.In the experiments, only the first time step data in the sequence was provided (i.e.r = 0 "mtrnn-generalisation-2015-v12-pdf" -2016/5/12 -0:55 -page 9 -#9 in Eq. 11), which includes the initial position of the torso, head and eye motors, as well as the vocal command.Then the network prediction was used as the input of the next time-step and formed a closed-loop to complete 100-step of the time sequence generation.The errors of the whole three training-sets, as well as those in different steps are shown in Tab. 5. A more straightforward visualisation of the network performance can be found in Figs. 5, which displays three examples of generated time sequences for motor actions from three MTRNNs.As we calculated in Tab. 5, the training error became larger when the number of training samples was smaller.In particular, a larger error could be found at the beginning of each time sequence, but the network became stable and generated a stable motor trajectory with less error as time elapsed.There were some errors displayed in the trajectories generation, so sometimes the generated robot behaviours based on the trajectories are biased with the original ones.However, in most of cases, the generated robot behaviours correctly followed the semantic commands 2 .

Generalisation Analyses
In this section, we focus on the problem of how the verb-noun generalisation ability of the MTRNN network is achieved.
As shown in the previous section, the network was able to "understand" the un-trained verb-noun combinations in the sense that the generated time sequences for motor actions were close to the originals, so the iCub robot can perform the action named in the vocal commands.For an experiment with a similar aim, [62] reported combining two hierarchical recurrent neural networks which can also accomplish verbnoun generalisation for understanding combinatorial semantics in a situated environment.The model they used, called recurrent neural networks with parametric biases units (RN-NPB), had similar non-linear dynamics as the MTRNN: the non-linear dynamics are determined by a small number of neural units which act as bifurcation for the whole system.
2 https://youtu.be/FOgKbJ-iEhM Particularly, in our case, the learning sequences contain a much larger dimension (35) of the motor joint angles for the iCub movements, compared with motor sequences that trained in [62].These complex sequences result in the bifurcation which occurs hierarchically in the MTRNN structure.
From this point, we hypothesise that the MTRNN, or any other hierarchical RNNs, results in the separation in the network dynamics about different modalities in a self-organised way along the lexicon categories of vocal commands after training.The type of separation depends on the different organisation of the training data structures, This separation, with the constraints of sensorimotor sequences, occurs on different levels of the hierarchical architecture using different strategies.The way the network presents such separation in the hierarchical dynamics is self-organised, and largely depends on the data structure of the training data in the spatial and temporal domains.Particularly in our experiment setting, after enough training, the synaptic weights between a basic motor behaviour are enforced with a particular dimension about the verb input, as it dominates a large portion of the spatio-temporal space in the sensorimotor sequences.The basic motor behaviour here means that such kind of motor actions belong to general definitions such as "slide", "touch" and etc, without a specific goal for directing action.This is similar to the mechanism that the hearing of a verb causes in our brain: a specific area in the pre-motor cortex, corresponding to certain motor action fires when a particular verb is heard or said.On the contrary, the noun also affects part of the sensorimotor outputs in terms of its role to offset the basic motor behaviours into a specific goal-directed action.
In the following experiments, we will examine this hypothesis by means of manipulating data and visualising the training results.

Generalisation with Partial Inputs
In this subsection, we concentrate on the comparisons of the results after the removal of different modalities.These For the first part of the analysis, in order to obtain a more conclusive statement, we used two sets of data 9 × 9 and 3 × 3 of verb-noun combinations.The 3 × 3 data-set (Tab.7) contains all the combinations of three actions and three objects, which were placed into 6 different locations.Tab.6 was used for the vocal command discretisation.For the second part of the experiment, the visualisation of weights was only done with the 3 × 3 data-sets, since its features are easier to observe and its basic principle can be easily extended to the 9 × 9 data-set.
For both parts of the experiment, in order to observe how different lexical categories and visual input affected the training results, especially within the output of the sequences of the motor behaviours, different parts of the input data were removed: 1.No modification (base-line) 2. Remove the noun input (i.e. the first input unit was reset to zero.) 3. Remove the verb input (i.e. the second input unit was reset to zero.) 4. Remove the location of the visual object (i.e. from the third to eighth units were reset to zero.) During the generalisation tests, the full un-trained data was placed into the network.The training and generalisation error of the motor output was compared in the Tab.8 and Tab. 9. From these two tables we can see that the removal of the verb resulted in a larger generalisation error than the other two tests, while the removal of the object location resulted in the lowest generalisation error.
For the second part of the experiment, the main aim was to understand the effect of a particular input modality (e.g.lexical structure and visual input) in the whole network training, by observing the visualization of the weights.We conducted an experiment with a smaller data-set (3 × 3) than the previous experiments, due to the fact that smaller number of weights give a better presentation for the visualization.
But a similar conclusion would be extended into the larger 9 × 9 data-set.Figs. 6 visualised the weighting matrix, where the neurons from number 0 to number 703 were neurons on the IO layer, from number 704 to number 764 were neurons on the C f layer and from number 765 to number 794 were neurons on the C s layer.The weight matrices in Fig. 6a, Fig. 6c and Fig. 6d looked quite similar.But in Fig. 6b, without the verb input, we could easily notice that a large amount of weights from IO layer to C f remain to be untrained.To quantitatively evaluate this observation, Tab. 10 calculated the 2-norm to obtain the Euclidean distances from the manipulated weighting matrices to the base-line matrix.The 2-norm was calculated by: where W m is the weighting matrix after data manipulation, W b is the weighting matrix from base-line experiment, d is the weight from the i-th neuron to j-th neuron.Here n = 795 which is the total number of neurons.
From the comparisons of weight matrices and the Euclidean distances, we further verified our hypothesis that the lexical structure of verbs plays a significant role in the training, since it is further grounded in the differences of motor action trajectories, which dominate a large spatio-temporal space of the sequences.

Internal Dynamics
In the previous analysis, we have looked at the generalisation ability of the MTRNN.A preliminary conclusion suggests that the lexical structure of the verb plays a significant role in maintaining the convergence of the temporal sensorimotor sequences.In this section we are particularly interested in how the generalisation capabilities are brought by the recurrent connected hierarchical structure.We believed that part of these answers can be found by observing the detailed neural activities on each context layer given the selection of different inputs.The neural activities were therefore examined using the 9 × 9 data-set, with a previously trained MTRNN with the parameter setting of (50, 5, 70, 100).
The following figures showed the PCA trajectories of the internal neural dynamics on the C f (Fig. 7, Fig. 8 and Fig. 9) and C s (Fig. 10, Fig. 11 and Fig. 12) layers.Since the complete 9 × 9 data-set contains 486 sequences, whose patterns can hardly be observed in one single figure, only a few samples were presented in the following figures to clearly show the PCA trajectories.Fig. 7 and Fig. 10 showed the selected PCA trajectories on the C f and C s layers.These trajectories mainly concern combinations of verb inputs and a few noun inputs.We can see that the verbs mainly determine the patterns of the trajectories, which implies that the processing of verbs mainly affects the temporal dynamics in the MTRNN.
The following figures mainly show how the differences in lexical structures and visual information result in the differences in the PCA trajectories.Fig. 8 and Fig. 11 show the PCA trajectories of the internal dynamics on C f and C s layers, with different noun inputs; Fig. 9 and Fig. 12 showed the PCA trajectories with different object location inputs.We could observe that the differences of nouns on the C f (Fig. 8) cause divergences at the beginning of the trajectories, but not at the end.From Fig. 9 comparisons show the differences of visual inputs produce even smaller divergences in the trajectories, and that the divergences mainly occurred at the middle of the trajectories.Comparatively, from the activities on the C s layer (Fig. 11 and Fig. 12), the divergences of the trajectories from nouns and visual inputs were even smaller: the C s layer mainly encoded the information from the verbs.
To summarise the MTRNN analysis, the model self-organises similar patterns on various levels for every sensorimotor sequence, reflecting the hierarchical structure for the vocal commands.Particularly, we can see that the difference between verb inputs results in larger divergence of the trajectories than noun and object-location differences.Due to the data structure of our input vectors, the IO layer represents a collection of each word.With a slower adaptation rate than the IO layer, the C f represents the grounded meaning of each verb, noun and visual information.This grounding process is learnt by all temporal sensorimotor sequences.Similarly, using slower changing neurons, the C s layer represents the general motor behaviour (i.e. the verb) of the whole sensorimotor sequence.
Therefore, the C f activation mainly represents the lexical structures (verbs and nouns).The visual location has a limited effect on the C f activation, probably because the information of noun already has overlap with the object information about visual location.As the main factor of the C f layer, the same verbs are represented as a similar pattern on the fast context layer in all Fig. 7, Fig. 8 and Fig. 9.The difference from nouns can be observed at the beginning of the trajectories.It may correspond to the difference of robot behaviours at the beginning of the time sequences, caused by the neck and eye tracking before the actual hand movement starts.Comparing with the C f layer, the C s activation changes even slower.It generally represents the motor behaviours; only the verbs are represented in different patterns.

Functional Hierarchy of RNN and its Bifurcation
It has been reported that quite a few RNN models based on functional hierarchy, such as RNNPB, MTRNN and conceptors [30], allow the bifurcation to occur in the RNN dynamics.We will give a brief discussion of how this bifurcation happens.Assuming we have a simple hierarchical RNN with an additional unit (which can be regarded as a simplified version of RNNPB) as depicted in Fig. 13.The system can be described as Eq.16.
There are three fixed points in this network.After the network has been trained, i.e. the weights a, b and c are fixed, the coordinates of fixed points only depends upon the value of PB.Furthermore, the coordinates of the fixed points [x 1 , x 2 , x 3 ] are first-order functions of the value of P B units (please see appendix for the calculation in details).In other words, the coordinates of the fixed points further determine the domain of different bifurcation properties.This is the reason that changing the parameter of P B units will change the qualitative structure of the non-linear dynamics of the network.From the bifurcation explanation of the simplified RNNPB model, at the next step we can also extend this to other hierarchical RNNs such as MTRNN, as they are holding a fundamentally similar theoretical foundation [64].

Generalisation Ability of MTRNN
In our experiments, the MTRNN was trained under a particular input data structure: Firstly the language commands were recorded as auditory data and transformed into a discrete symbolic representation, and secondly the object locations and the motor behaviours were also stored as the angles of motor joints.This unique structure is a simplified representation of the common coding theory, which proposes that perceptual inputs and motor actions are sharing the same format of the representation within the cognitive processes.
The neural dynamics in our MTRNN exhibited a dynamics which are different from those reported in [27] and [23].Whereas the noun (or object perceptual inputs) play a significant factor in the dynamics of context layers in these two examples, our network has minimised the effects of nouns or the object perception.This is partly because the input data structure, where the motor joints of the iCub robot have much larger dimensions than the visual perception input.Also the spatial information for objects in our experiment setting is much easier to learn, compared with our diversified motor behaviours.The generalisation here concerns more the inference of symbolic meaning of a language command due to the composition of neural dynamics.During the training in a hierarchical network, such as MTRNN or RNNPB, the neural connections strengthen between a particular type of sensorimotor sequence (motor angle changes due to differing behaviours) and visual perception.Particularly, in our case of 9 × 9 data-sets, most of our network weights stores the learning of motor actions.
Note that the generalisation of commands in the verbnoun combinations is not the same as we usually do in the generic recurrent neural networks (e.g.[29,52,77]), which expect the network to do interpolation or extrapolation with a novel input value in either temporal or spatial space.While generalizing dynamical patterns by interpolation is a nontrivial task for training motor patterns in robots, our main concern is the novel combinations in the context of lexicon acquisition.In our case, the learning of verbs and nouns results in the emergence of different dynamics that are mostly stored in different synaptic weights, and thus their combinatorial composition is realised by the non-linearity of the recurrent connections.Considering the different generalisation abilities of generic RNN, RNNPB [33,77] and MTRNN [23], the hierarchical RNNs appear particularly suitable for the production of flexible motor behaviour and language expression simultaneously in the real-world social robot experiments.

Thought Vectors and Further Development
A few machine learning methods have recently been proposed based on the encoder-decoder (ED) architecture [12], which achieved great performance in machine translation [63], image captioning [69], etc.The ED architecture usually consists of two recurrent neural networks.One deep RNN network encodes a sequence of input vectors with arbitrary length into a fix-length vector representation in a hierarchical way, while the other deep RNN network decodes this representation into a target sequence of output vector.This specific representation between the encoder and the decoder RNNs is called "thought vectors" which is claimed to represent the meaning of the sequence in a high-dimensional space.The training of such an architecture is done by maximizing the conditional probability of the target sequence.If the input sequence is denoted as (x 1 , x 2 , • • • , x T ) and the corresponding output sequence is (y 1 , y 2 , • • • , y T ) (T does not necessarily equal to T ), the next symbol generation is done by maximising Eq. 17.
Generic RNNs are not able to approximate the probability of the sequence with arbitrary length because of its vanish gradient problem, but other novel RNNs, such as LSTM, BRNN (Bi-directional Recurrent Neural Networks), have been successfully employed to construct the ED architecture to "understand" (encode) and to "generate" (decode) the temporal sequences.Furthermore, due to the recent popularity of parallel computation by GPU, it has become possible to train and use such architectures to solve problems such as machine translation and image captioning.
As the MTRNN can also avoid the vanish gradient problem, and larger MTRNN can be implemented via GPU, it is also possible to embed the MTRNN into the ED architecture.In fact, the context slow level C s already exhibits a similar feature of "thought vectors", using a stable neural vector to represent the basic profiles of motor actions and object instances (in our robotic experiment).They also have similar information bi-directional flows which allow the networks to recognise and to generate the time sequences.Despite their similarities, compared with LSTM, the MTRNN have other distinct features: First, from the above experiments and from other MTRNN experiments [23,27], it has been shown that the fast context layers and slow context layers exhibit various dynamics to explicitly represent the relationship between the verbs and nouns.The deep LSTM, on the contrary, has not been reported to have similar dynamics.Second, differently from the static vector representation from LSTM, the context layers allow a "slow" change through time which is more realistic for an interaction environment, where it can be used to dynamically exhibit the meaning of sentences and sensorimotor information.
Admittedly, the training of deep RNNs, e.g.LSTMs and MTRNNs, costs a large amount of computational effort.But the recent development of GPU computing provides an opportunity to construct and test such a big scale neural network with a reasonable time and budget.The combination of MTRNN, the concept of "thought vectors" and its embodiment in robotic systems, will allow us to further explore issues such as: 1.The comparison of the performances of MTRNN, LSTM and BRNN within the ED architecture and examine their performances in the robotic platforms.2. The robot motor action, as a natural temporal sequence, can be further incorporated as the training of RNNs of ED architecture with connections to other modalities.

Conclusion
This paper presents a neurorobotic study on noun and verb generation and generalisation, utilising with the MTRNN networks, with a large data-set, consisting of vocal language commands, visual object and motor action data.Although the generalisation abilities of hierarchical RNNs (RNNPB, MTRNN) have been reported in previous research, this is the first study to demonstrate its generalisation capability using such a large data-set, which enables the robot to learn to handle real-world objects and actions.These experiments showed that the generalisation ability of the network are possible even with a large amount of test-sets (9 motor actions and 9 objects placing placed in 6 different locations).This is particularly important because the recurrent connections between the verbs and nouns are associated with different modalities of the training-data, which is strengthened during embodiment training by the sensorimotor interaction.Detailed analyses on the robot's neural controller showed that the dynamics on different layers are self-organized in the MTRNN.These self-organised dynamics further constitute a functional hierarchical representation on different layers, which associate different lexical structures with different modalities of the sensorimotor inputs.The MTRNN showed how the embodied information about the verbs dominates a large portion of the network dynamics, since the proprioception information plays a significant role in the training sequences.As such, the hierarchical RNNs, such as MTRNN, are shown to be particularly beneficial in building a neurorobotics cognitive architecture about language learning for robotic systems, where the recurrent connections are able to self-organise and build associations between embodied information in different modalities and the lexical structure information.

Fig. 4 :
Fig. 4: Training Curves with Three Test-sets (a) Training Curve with Test-set 1 (b) Training Curve with Test-set 2

Fig. 5 :
Fig. 5: Trajectory GenerationThe generated trajectories (dotted) with 41 dimensions were plotted and compared with the original trajectories.Three test-sets were selected to

Fig. 8 :
Fig. 8: Principle Component Analysis on the C f neurons (with different nouns): Comparison of PCA processed neural activation shows that the sequences with different nouns differ at the beginning and at the middle of the trajectories.

Fig. 9 :Fig. 10 :
Fig. 9: Principle Component Analysis on the C f neurons (with different object locations): Comparison of PCA processed neural activation shows the sequences with different visual inputs result in very little divergence in the trajectories, which mainly occurs in the middle of the trajectories.

Table 1 :
Dictionaries of verbs and nouns for the data sets:The instructor showed the robot with different combinations from the 9 actions and 9 nouns.The actions and the objects are represented in two discretised values for semantic command inputs which range from 0 − 0.9.For instance, the command "lift [the] ball" is translated into values [0.8, 0.2].

Table 2 :
. Previous MTRNN experiments in the literature have reported different Structure of the Training Data

Table 3 :
Training Error with Different Parameter Settings (C s , C f , N Cs , N C f )

Table 5 :
RMS Error of the Generalisation Tests

Table 4 :
Some of the sequences containing particular semantic combinations of verbs and nouns were removed during training.The number i in the cell indicates that such a combination was removed in the i-th training set for generalisation experiments.

Table 6 :
Dictionary of verbs and nouns for the 3 × 3 data sets

Table 10 :
Euclidean Distances between Partial Input Matrices and Normal Training Matrix

Table 7 :
Removal of data in the 3 × 3 data-set.The number i in the cell indicates that such a combination was removed in the i-th training set for the generalisation experiments.

Table 8 :
Removal Part of Input (3 verbs and 3 nouns)

Table 9 :
Removal Part of Input (9 verbs and 9 nouns)