Recognizing recurrent neural networks (rRNN): Bayesian inference for recurrent neural networks

Recurrent neural networks (RNNs) are widely used in computational neuroscience and machine learning applications. In an RNN, each neuron computes its output as a nonlinear function of its integrated input. While the importance of RNNs, especially as models of brain processing, is undisputed, it is also widely acknowledged that the computations in standard RNN models may be an over-simplification of what real neuronal networks compute. Here, we suggest that the RNN approach may be made both neurobiologically more plausible and computationally more powerful by its fusion with Bayesian inference techniques for nonlinear dynamical systems. In this scheme, we use an RNN as a generative model of dynamic input caused by the environment, e.g. of speech or kinematics. Given this generative RNN model, we derive Bayesian update equations that can decode its output. Critically, these updates define a 'recognizing RNN' (rRNN), in which neurons compute and exchange prediction and prediction error messages. The rRNN has several desirable features that a conventional RNN does not have, for example, fast decoding of dynamic stimuli and robustness to initial conditions and noise. Furthermore, it implements a predictive coding scheme for dynamic inputs. We suggest that the Bayesian inversion of recurrent neural networks may be useful both as a model of brain function and as a machine learning tool. We illustrate the use of the rRNN by an application to the online decoding (i.e. recognition) of human kinematics.


Introduction
Recurrent neural networks (RNN) have been used for many years now to augment nonlinear mappings with a dynamic representation (Pearlmutter, 1989;Williams and Zipser, 1989;Narendra and Parthasarathy, 1990;Jaeger, 2001;Maass et al., 2002), for example, for the classification of sensory input in machine learning. In computational neuroscience, RNNs are extensively used to investigate the dynamic properties of cortical networks (e.g. Buonomano and Maass, 2009;Legenstein and Maass, 2007), to model the measured activity of networks of neurons (e.g. Friston et al., 2003;Kiebel et al., 2006Kiebel et al., , 2009aSotero et al., 2007;Rodrigues et al., 2010) and more generally to model brain processes like perception, memory and attention (e.g. Elman, 1990;Miller and Cohen, 2001;Hamker, 2005). The recurrent connections of these networks capture two of the most prominent features of neuronal networks observed in the brain: Firstly, connections between two neurons are rarely unidirectional but more often bi-directional, potentially via more than one synapse. Secondly, neurons perform highly nonlinear operations, i.e. they transform their input to spiking output. Recurrent neural networks capture both these features where often the input (post-synaptic potentials) and out-put (action potentials) are replaced by summary measures, i.e. the post-membrane potential function and firing rate. In such a continuous-time RNN, each neuron (often called unit) performs a simple operation: In each moment in time, it applies a nonlinear function to the sum of its input and passes this on to other neurons. This simple mechanism can provide for extremely rich patterns of activity in each neuron, even with a network of small size. Literally thousands of contributions in computational neuroscience and machine learning are based on networks of these firing ratecoding units (Rabinovich et al., 2006;Cessac and Samuelides, 2007).
As powerful as RNNs are as a model class, recent contributions have questioned the usefulness of RNNs to explain neurobiological phenomena observed in real networks of neurons (Spruston, 2008;Mel, 2008;Debanne et al., 2011). Here, we suggest that a simple re-interpretation of the functional role of RNN dynamics leads to a novel and potentially more plausible account of what RNN units may compute: We suggest that neuronal networks serve as Bayesian decoders of dynamics caused by the environment. For example, in action observation, humans decode the kinematics of other people from visual input dynamics. For Bayesian recognition, one needs a so-called generative model for the hidden dynamics of the environment which cause sensory input to the brain. We suggest that RNNs are an ideal generative model for these hidden dynamics in our environment. The task of the recognition system is to decode the sensory input generated by the hidden RNN dynamics. To do this, we derive Bayesian update equations from the generative RNN model and call these 'recognizing RNN'. The difference to a standard RNN is that each unit in the recognizing RNN computes more sophisticated updates involving predictions and prediction error messages from other units in the network. Here we show that a recognizing RNN can decode real-world dynamics (human kinematics) and display several features which can also be observed with real neuronal systems, e.g., the online decoding of hidden dynamics in the environment, computation of predictions and prediction error, robustness to noisy input and fast adaptation to sudden changes in the environment. These features are not only general hallmarks of brain function but may, in principle, also be useful for machine learning applications for decoding dynamics in an online fashion.
In computational neuroscience, models of recurrently connected networks of neurons, which optimally estimate dynamically changing states from noisy observations, have recently been proposed (Rao, 2004;Denève et al., 2007;Natarajan et al., 2008;Wilson and Finkel, 2009;Boerlin and Denève, 2011). While these models provide important insights, results were reported for relatively restrictive conditions such as linear dynamics (Denève et al., 2007;Wilson and Finkel, 2009;Boerlin and Denève, 2011), discrete states (Rao, 2004;Denève et al., 2007;Boerlin and Denève, 2011), or a one-dimensional state (Natarajan et al., 2008;Wilson and Finkel, 2009;Boerlin and Denève, 2011). Although Natarajan et al. (2008) allow for nonlinear dynamics they assume knowledge of an ideal observer which provides an instantaneous error signal for learning of network connections. Similarly, reservoir computing approaches (Jaeger, 2001;Maass et al., 2002;Verstraeten et al., 2007) rely on a teaching signal which provides a desired output at every point in time during learning. In contrast, we propose an approach combining multi-dimensional, continuous-time hidden nonlinear dynamics where learning proceeds without an externally provided error signal. Our main contribution is to demonstrate that a recognizing RNN is well suited to recognize dynamic stimuli and may be used as a functional model for neuronal ensemble dynamics. In particular, we will illustrate this by showing that the prediction errors computed by a recognizing RNN provide sufficient information to discriminate dynamic stimuli, in an online fashion.
The present approach may also lead to a better understanding of the role of recurrently connected networks of neurons in the brain: predictive coding has been suggested as a theory for hierarchical processing in the brain in which different levels exchange prediction and prediction error messages (Mumford, 1996;Rao andBallard, 1997, 1999;). Rao and Ballard (1997) already described RNN-like dynamic models to implement predictive coding for static stimuli. The present approach can be seen as an extension to Rao and Ballard's original work to provide inference for dynamic stimuli by resorting to approximate inference methods for nonlinear, continuous dynamic models .
The remainder of the paper is organised as fol-lows. In Materials and Methods we (i) present RNNs as generative models, (ii) describe the Bayesian inference framework and (iii) show that dynamic updates of the posterior state critically depend on prediction error. We illustrate the rRNN approach using human motion capture data. In Results we demonstrate that the rRNN can successfully recognize human kinematics and discriminate between different walking styles based on the prediction error of rRNN units.

Materials and Methods
In the following, we will describe the two key elements of the present approach: a recurrent neural network (RNN) as a generative model of the sensory dynamics and the Bayesian inference framework to derive the update equations for a recognizing RNN (rRNN). Subsequently, we will apply the rRNN technique to the recognition of human kinematics, for which we describe the kinematic data and the rRNN settings.
To motivate the present approach, we will start with a brief summary of the conventional RNN technique as used in machine learning for classification of stimuli. Note, however, that it is not our aim to compare discrimination performance of conventional and recognizing RNNs. Rather, description of the conventional RNN is given as a reference for understanding the conceptual differences between the two approaches.

Conventional Recurrent Neural Network
The RNN technique has been used in many machine learning applications such as classification of static or dynamic stimuli, or time-series prediction. This approach has a long history which took off with the development of a supervised learning routine (Pearlmutter, 1989;Williams and Zipser, 1989). Recently, this learning approach has been complemented by the so-called reservoir computing technique (Jaeger, 2001;Maass et al., 2002). In general, in a conventional RNN, sensory units provide input, which drives the dynamics of the hidden units (see Fig. 1A). Output units readout the result of the dynamic computations based on a mixing of the sensory and hidden states.
An example of such a network is discussed in (Jaeger et al., 2007) where the continuous-time dynamics based on leaky-integrator units is given bẏ where x i is the state of hidden unit i ∈ {1, . . . , H}, y ∈ R I×1 are the states of the input (sensory) units, o ∈ R D×1 are the states of the readout units, W ∈ R H×H is a weight matrix defining the interaction between the H hidden units, similarly W in and W fb define the connections from the input to the hidden units and the (optional) feedback connections from the readout units, respectively. k i is a rate constant for unit i and a is the amount of leakage. The output states are determined by where V ∈ R D×(H+I) is a weight matrix. In a conventional RNN, the overall flow of information is from sensory to output units, because the RNN serves as a model for neuronal dynamics (hidden states) which are used to compute, e.g., a classification of the sensory input. We now use the same dynamics where we reverse the flow of information to model the generation of sensory dynamics by hidden states of the environment (e.g. body movements cause visual output dynamics).

Generative Recurrent Neural Network
Our overall aim is to construct a recognition system which can recognize its sensory observations based on its internal dynamics. For a Bayesian recognition system we require a dynamic generative model, for which we choose a RNN. This 'generative recurrent neural network' (gRNN) runs independently of any input and generates sensory data, i.e. observations. Note that, in comparison to a conventional RNN (Eq. 1), here the sensory units become the output of the network while no input units are defined (hence the missing units which acted as output in the conventional RNN). Consequently, the flow of information is reversed in the gRNN and its autonomously running hidden dynamics drive its sensory units (see Fig. 1B). In particular, we  RNN (rRNN). Each RNN has dynamic hidden units, but the overall direction of information flow differs (indicated by the grey triangles). The conventional RNN is designed to compute an output given sensory input. In contrast, the gRNN computes sensory states. Finally, the rRNN computes predictions (black arrows) and prediction error messages (red arrows) to recognize the hidden causes that generated the sensory input. define a gRNN aṡ where now V ∈ R D×H linearly translates hidden states x into sensory states y. This gRNN computes sensory output y as caused by a hidden, dynamic process defined by the RNN dynamics f (x). In the following section, we describe how a recognizing recurrent neural network (rRNN) is constructed from the gRNN using Bayesian inversion. This rRNN receives sensory observations (as in a conventional RNN, Eq. 1) and infers about the hidden states that caused these observations. Effectively, conventional RNN computations are aimed at doing the same (c.f. Fig. 1A,C); however, the update equations of an rRNN are explicitly derived for this recognition task.

Recognizing Recurrent Neural Network
Generative models like a gRNN (eqs. 3,4) can be used to derive a Bayesian inference system that recognizes input caused by the generative model. In the probabilistic setting, when observations and state transitions are noisy, or uncertain, Bayesian inversion of the generative model leads to updates of the hidden states which make them an optimal representation of the observations given the model. For example, the well-known Kalman-Bucy filter (Jazwinski, 1970) implements such a Bayesian inversion scheme for linear dynamic processes. The gRNN uses highly nonlinear dynamics (Eq. 3) and, therefore, we require approximate inversion schemes (Jazwinski, 1970;Wan and van der Merwe, 2001;Doucet et al., 2001;Daunizeau et al., 2009;Friston et al., 2010). Here, we derived the update equations using the D-step of Friston's dynamic expectation maximization (DEM) framework . This choice was based on our previous experience with inversion of continuous-time dynamic models using DEM (Kiebel et al., 2009b) but, in principle, other inversion schemes could be used as well. DEM uses generalised coordinates, local linearisation and point-estimates at strategically important positions. See the appendix for a high-level derivation of the algorithm and an explanation of generalised coordinates which are a dynamically extended representation of state variables, the use of which we indicate by a tilde in the subsequent for-mulas.
In the following, we will briefly describe the key computations performed by DEM. This description is aimed at giving an intuitive description of the update equations governing the rRNN and will allow interpretation of these updates in terms of prediction and prediction error messages.
The most important equation resulting from inversion with DEM describes the evolution of the posterior mode of the hidden states in generalised coordinates and is given bẏ The motion defined in this equation consists of two parts: 1) Dx which, in absence of other contributions, implements that the motion of the posterior mode follows its local trajectory represented in generalised coordinates using a derivative operator D and 2) the derivative of the variational energy V (x) with respect to hidden states which acts as a corrective force to make the motion consistent with the gRNN and the observations. With fixed parameters, the variational energy is the log-joint probability of observations (sensory states)ỹ and hidden statesx which defines the probabilistic gRNN. In particular, the variational energy is given by where c is a constant, θ is a vector consisting of all parameters of the model andf are the dynamic predictions defined by Eq. 3 in generalised coordinates. The last term illustrates that the updates are a dynamic form of maximum a posteriori estimation of hidden states. Gaussian distributions are assumed for the state transition and observation densities: where (I ⊗ V)x is the predicted sensory state given the hidden states as defined by Eq. 4 in generalised coordinates 1 , andΣ x andΣ y are the prior covariances of sensory and hidden states in generalised 1 I is the identity matrix with size equal to the number of used generalised coordinates and ⊗ is the Kronecker product coordinates, respectively. This leads to a simple interpretation of the posterior mode updates in terms of prediction errors. In particular, the gradients of these densities with respect to hidden states become where the prediction errors are defined as This means that the updates of the posterior hidden states follow the gradient of the prediction error with step sizes determined by the prediction error itself weighted by the prior precisions. The contribution from the prediction error on the sensory states,ε y , ensures that the sensory states are well explained by the hidden states while the contribution from the prediction error on the hidden states, ε x , ensures that the posterior dynamics of hidden states as encoded by the generalised coordinates is consistent with the learnt model dynamics. In particular, for the first generalised coordinate, the prediction error ensures that the posterior velocity corresponds to the learnt, noise free hidden unit dynamics as defined in Eq. 3. Conversely, we will argue below that a consistently large prediction error ε x provides evidence for an inconsistency between observed and learnt dynamics and can be used to discriminate among different dynamic stimuli. The question remains how the system got to know a suitable gRNN which generates specific sensory dynamics. In our experiments we let the system learn its generative model by adapting connectivity parameters W, V and rate constants k using an approach which was developed for the identification of dynamical (neural-mass) systems (Friston et al., 2003;Kiebel et al., 2009a) and is based on maximum a-posteriori estimation of the parameters (Friston, 2002;. See the appendix for details. Note that this initial learning step is not our main point in this paper; any learning approach that successfully learns hidden gRNN dynamics to represent a given dynamic stimulus could be used here (e.g. Wan and Nelson, 2001;Roweis and Ghahramani, 2001;Valpola and Karhunen, 2002;Doucet and Tadić, 2003;Archambeau et al., 2008;Daunizeau et al., 2009;Kantas et al., 2009;Lazar et al., 2009;Schön et al., 2011), but also standard RNN learning may be used, if the hidden state dynamics is assumed to be deterministic during learning.

Message passing in the rRNN
The updates defined by equations 5, 9, 10 and 11 can be interpreted as network dynamics based on messages sent by sensory and hidden units. Algebraically, this can be seen by exemplarily inspecting the observation density update equation, Eq. 10, for the first generalised coordinate of a single hidden unit i: where the sum over j runs over sensory units y j in generalised coordinates. Note that the partial derivative of the prediction error of sensory unit j with respect to the state of hidden unit i describes how a change in the state of unit i effects the prediction error of unit j. Therefore, the state update for hidden unit i is a weighted sum of these prediction error gradients where each element of this sum corresponds to a 'prediction error message' from a single sensory unit j. To compute the prediction error message a sensory unit first has to compute a prediction. This is done using the forward equation (4) of the gRNN which is a weighted sum of the hidden statesx where the weights are determined by the connectivity of the gRNN. In the following we call the elements of this sum 'prediction messages' which are sent from a hidden unit x i to a sensory unit y j . In summary, the update equations define a recognizing RNN (rRNN) where a hidden unit sends prediction messages to connected sensory and hidden units such that these can compute prediction error messages which are returned to the hidden unit to update its state (see also Fig. 1C). The updates resulting from the dynamics density, Eq. 9, follow the same logic, where the hidden unit x j takes the place of sensory unit y j . Each hidden unit, therefore, sends and receives two kinds of messages: prediction and prediction error messages.

Induced connectivity of the rRNN
The connectivity matrices W and V of the gRNN (Eq. 3, 4) are not necessarily the same as in the rRNN. Generally, the rRNN will have all connections of the gRNN plus the corresponding reciprocal connections, plus some additional ones. To see this, note that the prediction error messages in the rRNN in Eq. 13 are 0, when hidden unit i is not connected to sensory unit j, i.e., when hidden unit i has no direct influence on the computation of predictions in sensory unit j (then ∂[εy]j ∂xi = 0). Only sensory units which receive a connection from a hidden unit i in the gRNN will contribute messages containing the derivative of the prediction error. However, in the rRNN, sensory units j, which are not connected in the gRNN to a hidden unit i, may also contribute messages, containing only their prediction error, through the weights computation w j = [Σ −1 yεy ] j . In particular, if the j-th row of the sensory precision matrix,Σ −1 y , has nonzero entries in positions other than j, e.g., k, the weight of sensory unit j in the update equation (Eq. 13) depends on the prediction error of unit k. In this case, sensory unit k contributes to the update of hidden unit i, even though hidden unit i is not connected to sensory unit k in the gRNN. This means, that there is an additional connection from sensory unit k to hidden unit i in the rRNN.
In conclusion, only if the covariance matrixΣ y is diagonal, the connectivity matrix of sensory to hidden units in the rRNN will only contain those connections which are reciprocal to the hidden to sensory unit connections in the gRNN. Conversely, if there are off-diagonal entries inΣ y , there will be corresponding additional connections from sensory to hidden units in the rRNN, relative to the gRNN. The same considerations apply to the connectivity between hidden units. In summary, the connectivity of the rRNN directly follows from the gRNN, only if the units' states in the gRNN are a priori independent. For simplicity, this case is shown in Fig. 1C and used in the following simulations. Note that a diagonal covariance matrixΣ y is a natural assumption for the present data because we assume that the measurement noise is white and any correlation among observations is caused by the underlying dynamics which are modelled by the RNN dynamics.

Human Movement Data
We use human movements to demonstrate the properties of the rRNN in the experiments below. The kinematics of humans is highly dynamic and nonlinear through complicated interactions between individual joints. Kinematics, therefore, provides a good example of the kind of complex, dynamically changing, real-world stimuli which can be modelled using rRNNs. Here, we used three walks of the same subject, each of which expresses a different walking style (categorized as 'childish', 'depressed' and 'shy'; freely available from the CMU motion capture database, http://mocap.cs.cmu.edu, subject 142, motions 1, 5 and 19). We chose this particular subject because a large range of different movements were available among which we chose the selected walks because of their similar time-scales. The advantage of using motion capture data as compared to video is that we can focus on modelling the kinematics of the subject in terms of changing joint angles without the need to model detailed processing of visual information.
For each walk we removed the global translation of the body and computed the 3D positions of the joints and extremities for all time points. This removed potential 'jumps' introduced by the circularity of the joint angles. As a result we obtained a set of 30 points moving in 3D space (see Fig. 2 for an example). Subsequently, we selected four representative seconds of data starting when the left foot touched the ground for each walk and subsampled the data using 30 frames per second resulting in N = 120 data points per walk. These data covered roughly two footsteps for each movement. We then found a common, low-dimensional representation for the three walks using principal component analysis (of all three walks combined) which reduced the dimensionality from 90 dimen-sions to D = 5 (maintaining 95.5% of the original variance). Additionally, we scaled the coordinates of each walk such that the maximum absolute value in each dimension was 1 over all walks. In summary, we obtained for each of the three walks a sequential data set containing five trajectories each consisting of 120 time points, see Fig. 3.

Learning of generative RNNs
For each of the three walks, we constructed one gRNN by learning suitable parameters W, V and k (eqs. 3 and 4) so that the dynamics of each generative RNN replicated the movement data. Each RNN had five sensory units, each of which generated one of the scaled principal component coordinates. In initial tests, we found that a network with H = 12 hidden units was the smallest network which gave consistently acceptable learning results and we consequently used this network size in our experiments. These tests also showed that good learning results were obtained, if the hidden units were sparsely connected. In particular, we fixed 2/3 of all connections in W and 1/3 of all connections in V to 0. Other entries in W, the rate constants k and the initial vector of hidden unit states x(0) l were chosen randomly before learning while V was initially chosen to correctly predict the first data point of a walk given x(0) l . For details of this initialisation and the learning procedure see the appendix. Note that any learning procedure could have been used here. The main point made by this initial learning step is that a dynamic representation for each walk can be found using RNNs with few units.
The sensory state trajectories of the learnt gRNNs are shown in Fig. 3. Each of the three different walks was learnt well: The amount of variance explained for each walk was 99%, 97% and 97% for the childish, depressed and shy walks, respectively.

Results
Here we demonstrate the utility of Bayesian inference for RNNs for online recognition of dynamic stimuli. As a proof of principle we apply the approach to the multi-dimensional, nonlinear kinematics of a walking human. We will first show   While the fit between data and its gRNN replications was not perfect, it was sufficiently close such that the gRNN was an appropriate generative model for recognition (see Fig. 5).
that recognizing RNNs quickly and successfully recognize the hidden dynamics, i.e. decode the type of movement. Then, we will demonstrate that the prediction errors of the hidden units can be used to discriminate the three different walks. Finally, we will show that the rRNNs are robust against noisy observations and initial conditions. Note that all of the following experiments with the rRNN use the original motion capture data as observations.

Fast Recognition of Dynamics
In this section, we show that the rRNN can quickly start recognizing a movement online. In particular, we show that this 'quick response' is robust against the initial hidden states at the beginning of the recognition process. This robustness is obtained despite the fact that gRNNs have a large dependence on their internal initial conditions. This is because RNNs are in general rich dynamic models which are capable of simultaneously representing many different dynamic stimuli depending on their initial conditions (hidden states). We demonstrate this for the gRNN for the childish walk. This gRNN was initialised during learning with the state x(0) l . When this specific gRNN is started, after learning, in this state, the learnt shy walk is generated as shown in Fig. 3 (left). However, when we initialise the same gRNN with a state x(0) r = x(0) l + , which was perturbed by noise of the same size as the natural variability of the hidden states, it generated very different trajectories of sensory states y as well as hidden states x as shown in Fig. 4. In other words, for deviating starting conditions, the gRNN generates dynamics that look different from the learnt kinematics and, when plotted in motion capture space, can deviate severely from a natural walk.
In contrast, the rRNN based on this gRNN for the shy walk was robust against such differences in initial conditions. Even though we perturbed the rRNN initial states severely, the rRNN always switched rapidly to the appropriate dynamics which best described the sensory input of a shy walk. In other words, the prediction error updates of the hidden units forced the dynamics on a trajectory which predicted the observed walk. We depict a characteristic example of this quick response behaviour for the rRNN (childish walk) in Fig. 5  (A,B). After only one time step the rRNN accu-rately predicted all subsequent observations while hidden unit trajectories followed those typical for the learnt gRNN to a large extent. (Note that these results partially depend on an appropriate choice of the prior covariances, see Appendix C.) This means that the rRNN can represent the dynamic repertoire of the gRNN but, in addition, can rapidly switch to the specific dynamic regime that best explains the sensory input.

Discrimination of Dynamic Stimuli
After learning, we have three different rRNNs, each of which has learnt to predict one of the three walking styles childish, depressed and shy. Here, we will show that the prediction error, on observations ε y = y − Vx or hidden states ε x =ẋ − f (x) (Eq. 12), of all three rRNNs can be used to discriminate between the three different walks. In particular, we will show that the dynamic prediction errors ε x are smallest for the rRNN that has learnt a specific walking style. This means that a potential readout mechanism can use the relative amplitudes of prediction error of the three rRNNs to decide which of the three walks is currently observed. Fig. 5 (D-I) shows the response of the rRNN which learnt the childish walk, but now given the depressed and shy walks as observations. Although this rRNN did not learn these walks it represented them well by exploiting alternative dynamics embedded in the twelve unit network. However, the rRNN frequently had to use prediction error on its hidden states to explain away the remaining mismatch between internal predictions and actual input. See Fig. 5 (C,F,I) for this relative increase in prediction error in response to the non-learnt depressed and shy walks. This increase in prediction error when recognizing the two non-learnt walks is consistent over the three different rRNNs and may be used to discriminate dynamic stimuli as shown in Fig. 6 (D-F). For each of the three walks the prediction error was smallest for the rRNN which actually had learnt this specific walk, see also Table 1. The prediction errors on observations showed this effect as well, although not as clearly (Fig. 6 A-C).
We also investigated the effect of learning on the accumulated prediction errors by comparing the prediction errors of the learnt rRNNs with those of Dotted trajectories resulted from initial hidden states used during learning (x(0) l ) while solid trajectories resulted from random initial hidden states (x(0) r ). In this example we used the RNN parameters learnt for walk 1 (childish), but results are qualitatively similar for other RNN parameters.   4B). (C,F,I) show the dynamic prediction errors of the hidden states (Eq. 12, note that these prediction errors do not correspond to the difference between solid and dotted lines in the middle panels). The different rows of panels correspond to the different walks which were recognized (from top to bottom: childish, depressed and shy). Prediction errors were markedly lower, when the rRNN recognized the walk it was adapted for (C vs. F,I). absolute prediction errors of sensory (A-C) and hidden (D-F, Eq. 12) states of the three different rRNNs when data from one of the three different walks were observed. Each rRNN corresponds to one colour (childish: black, depressed: red, shy: yellow). Prediction errors of the rRNN, which has been learnt for the observed type of walk, are indicated by thick lines. random rRNNs. We generated 30 random rRNNs by drawing random parameters W, V and k while using the same connectivity constraints as for the gRNNs which were used for learning the walks. The accumulated prediction errors for the random rRNNs, thus, give an estimate for the total amount of prediction error expected in a random rRNN, i.e. without learning. As expected, the prediction errors of random rRNNs were always higher than those of the rRNNs with learnt parameters (see Table 1). Furthermore, for non-learnt stimuli, the learnt rRNNs often produced larger prediction errors than random rRNNs. This indicates that learning a specific walk restricts the dynamic repertoire of an rRNN. We conclude that the learning procedure resulted in rRNNs which were suited to discriminate the walks.
In an additional experiment we concatenated data from all three walks into a single sequence to simulate online recognition of three walks, see Fig. 7. The resulting saccade-like, abrupt transitions between walking styles led to a transient increase in prediction errors correctly signalling a discrepancy between predictions and actually observed kinematics. Furthermore, we implemented a simple readout mechanism for dynamic prediction errors using a filter which sums the absolute prediction errors over the last 30 time points and weights recent time points more strongly. This operation smoothes prediction errors temporally and stresses differences that stretch over a similar period as the filter size, see Fig. 7. After each transition, the rRNNs reduced their smoothed prediction errors quickly until the rRNN with parameters learnt for the currently observed walk was the only one for which the magnitude of prediction errors fell below an ad-hoc threshold. This shows that the present approach can be successfully used to recognize a specific walk by choosing the model with the lowest prediction error, after some initial transient has died away.

Robustness against Noise and Initial Conditions
Here, we demonstrate that the recognition scheme is robust to both noise and variations in initial conditions. We repeated the experiments above for increasing amounts of white observation noise and twelve different, randomly chosen sets of initial con-ditions, see Fig. 8. We found that the overall magnitude of dynamic prediction errors is proportional to the amount of observation noise. This indicates that observation noise is explained away by prediction errors of both sensory and hidden units. Importantly, the discrimination ability of the three rRNNs is maintained up to moderate amounts of noise, i.e., prediction errors still contained sufficient information to discriminate the three walks. As expected, for large amounts of noise, the contribution from observation noise eventually masked the prediction error contributed by the difference in walks. Also note that the dynamic prediction errors of the learnt rRNNs on their learnt walks (bottom trajectories in the panels of Fig. 8) had very low variability across initial conditions. This means that the rRNN, which was learnt for a specific walk and observes this walk as input, was much less dependent on its initial conditions than the rRNNs learnt on different walks. Yet, the variability of prediction error due to initial conditions within each rRNN was not large enough to influence the result of discrimination of the walks up to moderate amounts of noise. In other words, in our experiments accumulated prediction errors of the rRNN learnt for the current walk were always smaller than those of the other rRNNs (up to moderate amounts of noise), even when beneficial initial conditions for them led to better than average accumulated prediction errors.

Discussion
In this paper we have described the 'recognizing recurrent neural network' (rRNN), which is a recurrent neural network where each unit computes both predictions and prediction errors to recognize sensory input in a Bayes-optimal fashion. We derived the update equations of both sensory and hidden units by using an approximate Bayesian inference framework for nonlinear dynamical systems, i.e., dynamic expectation maximization . The rRNN approach is motivated by two considerations: Firstly, in an rRNN, the additional prediction error messages, as compared to a conventional RNN, increase the complexity of the messages passed between units in the network.
We propose that these computations may be better descriptions than conventional RNNs of what We concatenated sensory input from all walks into a single sequence and inferred the hidden states for all three rRNNs. Shown are temporally smoothed, summed absolute prediction errors for the rRNNs learnt on the childish (black), depressed (red) and shy (yellow) walks. We also plotted an ad-hoc decision boundary (dashed, grey line) which could be used to select models with a low prediction error.  Figure 8: Dependence of dynamic prediction errors on noise and initial conditions. Each panel shows average sums of absolute prediction errors for the three different rRNNs on one of the walks. The averages are over 12 randomly chosen initial states x(0) r and shading indicates the region around the mean of twice the standard deviation. The x-axis indicates the standard deviation of independent Gaussian noise added to the principal components of the walks on a log-scale. Note that the exponential increase of prediction errors with noise in the log-plot means that prediction errors depend approximately linear on the observation noise magnitude. real neuronal ensembles implement. We base this hypothesis on recent findings and theoretical considerations which show that single neurons (and consequently neuronal ensembles) compute much more complex functions than previously thought (Sidiropoulou et al., 2006;Spruston, 2008;Mel, 2008;Pissadaki et al., 2010;Debanne et al., 2011). The general idea is that a single neuron may in principle compute complex, nonlinear and dynamic functions using its spatiotemporal voltage depolarisations and other dynamics like calcium fluctuations (Mel, 2008). Although it is yet unclear how the computation of predictions and prediction errors may map to cellular dynamics, intracellular dynamics may have in principle the computational complexity to perform Bayesian decoding of their synaptic input (Denève, 2008). Secondly, the computation of predictions and prediction errors in an rRNN fits well with the recent reappraisal in cognitive neuroscience that the brain may use a predictive coding approach to perception and cognition (van Wassenhove et al., 2005;Summerfield et al., 2006;Bar, 2009;). The rRNN approach is a mathematical description of how this predictive coding scheme could be implemented for complex, multi-dimensional dynamic stimuli.
To illustrate that rRNNs may be an interesting model for understanding the brain function of recognition and prediction for naturalistic stimuli, we showed that rRNNs can robustly recognize kinematics as observed with motion capture data. We found that the prediction error computed by an rRNN can be used to recognize and discriminate between different human walking movements in an online fashion. Furthermore, this recognition mechanism is robust against both noise on the observations and variations in the initial state of the rRNN. In other words, rRNNs may be used as functional models for human action observation studies, e.g. (Blake and Shiffrar, 2007). The rRNN approach unifies many important aspects of brain processing such as statistically optimal inference in highly variable and noisy environments, recurrent connections, online recognition of dynamics, and quick adaptation to sudden changes in the environment. Therefore, the rRNN approach may be a useful device to bridge the gap between behaviour-driven models of cognition and neurobiologically plausible models of neuronal ensembles.
The idea to use autonomous RNNs as generative models is not entirely new. In previous work, we have used this approach in system identification where we explained neuroimaging data as generated by a network of cortical nodes, called 'Dynamic Causal Modelling' (DCM) (Friston et al., 2003;Kiebel et al., 2006Kiebel et al., , 2009a. Critically, the equations governing the dynamics of each node took the form of a rate model as in Eq. 3. The difference to the present approach is that DCM uses specific, highly constrained connectivity schemes based on neural mass models and does not allow for errors in the hidden states. Similarly, we used the present approach  to model recognition of multi-scale dynamics (Kiebel et al., 2008(Kiebel et al., , 2009b where the rRNN generalizes these previous contributions by using a more generic generative model (RNNs) and learning of natural stimuli.
To our knowledge, the explicit use of (generic) recurrent neural networks as generative models for recognizing dynamic sensory input using online Bayesian inference has not been described before. Both techniques, Bayesian inference for dynamic stimuli and artificial recurrent neural networks have existed in parallel for many years now (Jazwinski, 1970;Pearlmutter, 1989;Williams and Zipser, 1989;Narendra and Parthasarathy, 1990). We propose the combination of these two approaches in which RNNs act as dynamic models in a nonlinear, Bayesian filtering framework. Indeed, this idea has already been used implicitly in the field of machine learning and control. For example, Connor et al. (1994) used a related approach in the context of autoregressive models to remove outliers from sequences of discrete states which were represented by the hidden states of a RNN. Also, in dual extended Kalman filter methods for RNNs (Wan and Nelson, 2001) an extended Kalman filter is used to estimate RNN hidden states. However, these contributions focus on the usefulness of Bayesian filtering of RNN states to make conventional RNN learning more robust. Here we describe the idea that the combination of RNN equations and filter updates can themselves be interpreted as network equations which are better suited for recognizing dynamic stimuli. Therefore, the present approach also goes beyond previous suggestions of using RNNs as functions approximating the update equations of a nonlinear filter (Parlos et al., 2001), or its output (Ting-Ho Lo, 1994). We, thus, pro-vide a novel perspective on the role of RNNs also in possible machine learning applications.
We motivated the present approach by considering the potential functional role of recurrently connected neuronal ensembles in cortical processing. This allowed us to address recognition of arbitrary nonlinear dynamics embedded in multidimensional, continuous stimuli -something that has not been reported with spiking neuron models of neuronal coding in recurrent networks (Rao, 2004;Denève et al., 2007;Wilson and Finkel, 2009;Boerlin and Denève, 2011). In contrast, the 'reservoir computing' approach (Jaeger, 2001;Maass et al., 2002;Verstraeten et al., 2007) can recognize the class of stimuli we considered here. The reservoir computing approach has reinvigorated RNN research by establishing that very large networks (hundreds to thousands of units) combined with a simple readout function can be used to learn and recognize both dynamic and static stimuli. However, reservoir computing approaches typically do not adapt the dynamics within the network and rely on the chance probability that, among the many units, there exist some dynamical regimes which are appropriate generative models of the data. Here, we describe an alternative approach and speculate that small networks of 'smart' rRNN units may be sufficient to recognize dynamic stimuli.
Our use of RNNs as generative models of dynamic stimuli requires learning of their parameters (W, V and k in eqs. 3, 4). In particular, our results depend on learning the connections between hidden units in the recurrent network (W). This type of learning has been proven to be difficult in the past (Hammer and Steil, 2002). While the learning procedure used here was capable of learning a sufficiently good dynamic representation of the present walks alternative learning approaches may have to be used to achieve similar performance on other data sets. For example, the different principal component coordinates of our walks had similar time scales (Fig. 3). More complex and longer movements may demand the use of hierarchical models and corresponding learning algorithms (Hinton and Salakhutdinov, 2006).
While learning is an important issue with RNNs, we focused on providing a proof-of-principle that the rRNN approach can solve high-level problems such as discriminating visual dynamics in an online fashion. Importantly, our results appear to be ro-bust against sub-optimally learnt generative RNNs. This can be seen in Fig. 3, which shows a residual difference between learnt trajectories and the input. In other words, we found that prediction error messages were sufficiently informative even though the sensory input observed by the rRNN deviated slightly from the internally predicted dynamics. We also found robustness of the discrimination against white observation noise, see Fig. 8. It is an open question whether the rRNN approach is also robust against structured variations in human movements, e.g. as induced by a variation of a specific movement. We speculate that such variations require a different generative model, either where multiple movements are embedded in a single gRNN or where one uses a hierarchical gRNN similar to the approach used in Taylor and Hinton (2009).

A Bayesian Inversion of Dynamical Models
In this appendix we give a high-level description of the D-step in Friston's dynamic expectation framework  which leads to Eq. 5 in the main paper.

A.1 Generalised Coordinates
A major component of Friston's approach to stochastic processes is the redefinition of the timedependent variables in generalised coordinates of motion. For example, one replaces x(t) with and obtains for the probabilistic form of Eq. 3 (dropping the dependence on t for simplicity of writing)ẋ Note that it is assumed that f is locally linear around x(t) and that differently from the usual stochastic process models dependencies between noise variables across time are allowed, i.e., it is assumed that the noise at two close points in time correlates and that the noise process x (t) is differentiable sufficiently many times. In generalised coordinates of motion time-dependent variables encode not only a state at the current time, but additionally the future path of states. This is seen when we consider how the continuous representation here can be mapped onto a discrete sequence of N future observations y = [y 1 , . . . , y N ] T (for simplicity we here show only a single observed variable, i.e., D = 1): where n is the highest order of motion considered. We assume that i = 1, . . . , N are the times starting from t at which the data have been sampled. This formula represents a Taylor series approximation making use of the derivatives inỹ.  have shown that the variance of the noise process quickly becomes very large for high order motions such that only a small number n of generalised coordinates and data points need to be taken into account at any point in time. One also needs to translate discrete data samples into generalised coordinates of motion. This can be done using the inverse operation of Eq. 16. Rewriting Eq. 16 in matrix form gives . . . , N }, j ∈ {1, . . . , n}. (17) If N = n, E is invertible and one obtainsỹ(t) = E −1 y. The resultingỹ(t) is then used to compute the likelihood of the data at t and make inference over the hidden RNN statesx(t) as described below.

A.2 Dynamic Approximation of the Posterior Mode
In generalised coordinates eqs. 3 and 4 becomẽ where D is a matrix differentiation operator which shifts coordinates upwards by one element,f = . . ] T are the predicted generalised states and observations, respectively, with f = ∂f ∂x x and g = Vx (analogously for higher order terms f , . . . ). Becauseg is linear here, one can writeg = (I ⊗ V)x where ⊗ denotes the Kronecker product and I ∈ R n×n . Based on these equations the log-likelihood of the observationsỹ(t) is defined as where θ is a placeholder for all parameters in the model. Notice that p(f (x)|Y), where Y represents previously seen data, has been approximated as δ(f (μ)) -a Dirac delta function atf evaluated at the previous posterior modeμ (see below). This means that only the mode is propagated through the dynamics, but not its uncertainty.  then introduce a variational density q(x) (ignoring the density over parameters as learning is not our objective) and make use of Jensen's inequality to obtain where F(q, t) is the free energy which is a lower bound on the log-likelihood. The aim is to find q such that L(t) = F(q, t). In other words, one maximises F(q, t) with respect to q. This is equivalent to minimising KL[q(x)||p(x|ỹ, θ)], the KLdivergence between variational density and true posterior, i.e., after optimisation q is an approximation of the posterior density over RNN states.
In particular, it can be shown (Ghahramani and Beal, 2001;) that the q maximising F(q, t) is equal to where V (x) is called the variational energy. While this equation appears to be a trivial statement, the formulation of q in this way lets us recognize ) that q also is the density defined by a set of stochastically moving particles at their stationary solution where the movement of a single particle is given bẏ (22) and Γ(t) is a random fluctuation. Using this relationship one can find q using Monte Carlo simulation as we can compute the partial derivative of log p(ỹ,z|θ). However,  simplified this further. In particular, a single particle in generalised coordinates with motioṅ will converge to the modeμ of V , which is also the mode of the posterior, at a rate proportional to the constant κ . Given the modẽ µ Friston et al. (2008) use a Laplace approximation for the posterior where q ∼ N (μ,Σ) is defined to be Gaussian and the covarianceΣ is found as This is the inverse of the negative curvature of the posterior evaluated at the modeμ. This completes the derivation of the approximate posterior over RNN states. Under the approximations made and given the linearity of g one can identify the posterior p(x|ỹ, θ) as being Gaussian exploiting that p(ỹ|x, θ) and p(x|f , θ) are Gaussian. In this case the Laplace approximation is exact. Nevertheless, we retained Friston's more general form which is also valid for nonlinear g. More importantly, this motivates the dynamic form of estimating the posterior mode in Eq. 23 which allows us to extend the static result above to the dynamic case. In particular, note that all results above were obtained for only a single time point t. However, it can be shown  that the path integral of the free energy is maximised, if Eq. 21 holds for all t. Naively, this means that one has to integrate the motion of the particle in Eq. 23 until it converges toμ(t) for each t. However, if the particle converges faster ontoμ(t) thanμ moves itself, a condition which can be ensured by choosing an appropriate rate constant κ, we will be able to track the motion ofμ with a single particle and the dynamics given by initialiseμ (0) FOR t = 1:N 1) compute predictionsg(μ) andf (μ) from previousμ(t − ∆t) 2) findỹ based on n data points closest to t using Eq. 17 3) compute gradients of V (x) using predictions, Dμ(t − ∆t) andỹ 4) numerically integrate Eq. 23 to get newμ(t) END Eq. 23. Intuitively, the representation in generalised coordinates of motion here helps to converge to a mode which better represents the data as it also takes the local motion (velocity, acceleration, etc.) of the mode into account. For the purpose of this paper we ignored the approximated covariance and only concentrated on the posterior mode and the corresponding prediction errors. A summary of the resulting algorithm is shown in Table 2. We were able to ignore the covariance, because we assumed network parameters to be fixed during inversion. However, in the full DEM-framework these covariances are needed for the computation of parameter updates.

B Learning of RNN Parameters
We want to adapt the RNN parameters W, V, k such that the observations generated by the RNN defined in eqs. 3,4 fit the data. We mainly follow the approach underlying dynamic causal modelling (Friston et al., 2003;Kiebel et al., 2009a) which is detailed in (Friston, 2002;. This entails an iterative approximation of the parameter posterior based on a first-order Taylor expansion of an observation function vec(Y) = h(θ) which represents the underlying dynamical system. Here, Y ∈ R N ×D contains the observations at all N time points and θ = [vec(W) T , vec(V) T , k T ] T . The RNN states are enclosed in h(θ), because the dynamics is assumed to be noise free, i.e., deterministic. Both parameter likelihood and prior are assumed to be Gaussian so that the following gradients of the log-posterior L = log p(θ|Y) ∝ log p(Y|θ)p(θ) are obtained (cf. Friston, 2002, Eq. 17) We use these in a numerical integration scheme for nonlinear dynamical systems to obtain an update of the parameters dθ based on the model dθ/dt = ∂L/∂θ andθ (i+1) =θ (i) + dθ. Hereθ (i) is the maximum a posteriori estimate of the parameters in iteration i, [J] jk = ∂[h(θ (i) )] j /∂θ k is the Jacobian of h evaluated atθ (i) , C y is the covariance of the observations and µ θ and C θ are the prior mean and covariance of the parameters, respectively. Finally, r = vec(Y) − h(θ (i) ) are the residuals of the data not explained by the predictions h(θ (i) ) which are equivalent to the observation prediction errors ε y described in the main text. In each iteration one obtains the predictions h(θ (i) ) by numerical integration of the RNN dynamics and the Jacobian J using numerical differentiation of h(θ (i) ).
In our experiments we divided learning into two phases -an initial phase in which we adapted parameters only on local chunks of the data and a final phase in which we used the complete data. We found that the first phase helped to find a better initialisation ofθ for the optimisation on the whole data set in the second phase of learning. In the first phase we split the data into seven overlapping, equal size chunks and ran two passes through all chunks where we ran only two iterations of the update procedure described above per chunk and pass. In the second phase we ran 25 iterations with a fast, approximate numerical integration scheme for h and subsequently another 4 iterations with a slower, but more accurate scheme. While our choices for the number of chunks, passes and iterations led to good results, we expect that many other values may be chosen equivalently.
Embedded in each iteration there is also an expectation maximization (EM)-like update of hyperparameter λ which determines the amount of noise on the observations C y = e λ I during learning. We refer the reader to (Friston, 2002; for details. λ was initialised as -32. The hyperprior for λ was Gaussian λ ∼ N (− log(s 2 ), 1/8) wheres 2 is the average variance of the observation variables in the data.
We initialised the parameters contained in θ (0) as follows. The elements of k (0) were chosen uniformly in the interval [1/8, 3/8]. Randomly chosen 2/3 of all elements in W (0) were fixed at 0, the remaining were drawn randomly from a standard normal distribution. Furthermore, following (Jaeger et al., 2007), we scaled the resulting matrix by W = 1/(0.95δ)W (0) to bring the initial RNN dynamics to a useful dynamical regime. δ here is the largest absolute eigenvalue of the matrix [kW (0) − (1 −ka)I] where a is the leakage (cf. Eq. 1) andk = 1/4 is the expected value of k i for any i. The initial states x j (0) l were chosen uniformly in [−2, 2] for all j. We then found V (0) as the solution to the underdetermined system of equations y(0) = V (0) x(0) l using Matlab's backslash operator, i.e., we found the least squares solution for V (0) with most elements of V (0) equal to zero. A randomly chosen subset of these zero elements were also fixed during learning. The number of fixed elements was 1/3 of the total number of elements.
In the initial learning phase we set the mean of the parameter prior to the described initialisation of the parameters µ θ = θ (0) . In the subsequent learning phase we set µ θ to the result of the 1st phase. The covariances of the prior parameter distribution were chosen to be diagonal, but also differed in the two phases of learning. In the initial phase we set the variances associated with the elements of W to 1.6 · 10 5 while we set the variances for V and k to 0.018 and 0.135, respectively. This enforced particularly the adaptation of the dynamical parameters. For learning on the full data set we chose these variances to be 7.389, 1 and 1 for W, V and k, respectively.

C Prior Covariances
For the rRNN the prior covariances,Σ y andΣ x , modulate the size of updates of the posterior (cf. eqs. 10 and 9) and influence the result of the Bayesian inversion. Intuitively, for large prior (co-)variances, i.e., a large amount of a priori expected noise, smaller updates are made and larger prediction errors are tolerated. The amount of noise here has to be seen in comparison to the variance of the unperturbed states of the units in the gRNN. For the sensory states this corresponds to the variance of the movement data. The standard deviation of the sensory states across all walks averaged over the five input dimensions was 0.38 while the standard deviation of the corresponding changes in hidden states averaged over the 12 hidden units was 0.04. In our simulations we assumed isotropic prior noise and correspondingly chose covariances of the form Σ y = σ 2 y I 2 where I is the identity matrix and σ y is our choice of standard deviation. We chose σ y = 0.3 and σ x = 0.1 for sensory and hidden states, respectively. This means that we tolerated only relatively small prediction errors on sensory states while allowing for relatively larger prediction errors on changes of hidden states. This choice implements the natural prior belief that the variability of walk observations is mainly determined by the variability of the underlying dynamics.