Adaptive importance sampling for control and inference

Path integral (PI) control problems are a restricted class of non-linear control problems that can be solved formally as a Feyman-Kac path integral and can be estimated using Monte Carlo sampling. In this contribution we review path integral control theory in the finite horizon case. We subsequently focus on the problem how to compute and represent control solutions. Within the PI theory, the question of how to compute becomes the question of importance sampling. Efficient importance samplers are state feedback controllers and the use of these requires an efficient representation. Learning and representing effective state-feedback controllers for non-linear stochastic control problems is a very challenging, and largely unsolved, problem. We show how to learn and represent such controllers using ideas from the cross entropy method. We derive a gradient descent method that allows to learn feed-back controllers using an arbitrary parametrisation. We refer to this method as Path Integral Learning. We illustrate this method for some simple examples. The path integral control methods can be used to estimate the posterior distribution in latent state models. In neuroscience these problems arise when estimating connectivity from neural recording data using EM. We demonstrate the path integral control method as an accurate alternative to particle filtering.


Introduction
Stochastic optimal control theory (SOC) considers the problem to compute an optimal sequence of actions to attain a future goal. The optimal control is usually computed from the Bellman equation, which is a partial differential equation. Solving the equation for high dimensional systems is difficult in general, except for special cases, most notably the case of linear dynamics and quadratic control cost or the noiseless deterministic case. Therefore, despite its elegance and generality, SOC has not been used much in practice.
In [Fleming and Mitter, 1982] it was observed that posterior inference in a certain class of diffusion processes can be mapped onto a stochastic optimal control problem. These so-called Path integral (PI) control problems [Kappen, 2005] represent a restricted class of non-linear control problems with arbitrary dynamics and state cost, but with a linear dependence of the control on the dynamics and quadratic control cost. For this class of control problems, the Bellman equation can be transformed into a linear partial differential equation. The solution for both the optimal control and the optimal cost-to-go can be expressed in closed form as a Feyman-Kac path integral. The path integral involves an expectation value with respect to a dynamical system. As a result, the optimal control can be estimated using Monte Carlo sampling. See [Todorov, 2009, Kappen, 2011, Kappen et al., 2012 for earlier reviews and references.
In this contribution we review path integral control theory in the finite horizon case. Important questions are: how to compute and represent the optimal control solution. In order to efficiently compute, or approximate, the optimal control solution we discuss the notion of importance sampling and the relation to the Girsanov change of measure theory. As a result, the path integrals can be estimated using (suboptimal) controls. Different importance samplers all yield the same asymptotic result, but differ in their efficiency. We show an intimate relation between optimal importance sampling and optimal control: we prove a Lemma that shows that the optimal control solution is the optimal sampler, and better samplers (in terms of effective sample size) are better controllers (in terms of control cost) [Thijssen and Kappen, 2015]. This allows us to iteratively improve the importance sampling, thus increasing the efficiency of the sampling.
In addition to the computational problem, another key problem is the fact that the optimal control solution is in general a state-and time-dependent function u(x, t) with u the control, x the state and t the time. The state dependence is referred to as a feed-back controller, which means that the execution of the control at time t requires knowledge of the current state x of the system. It is often impossible to compute the optimal control for all states because this function is an infinite dimensional object, which we call the representation problem. Within the robotics and control community, there are several approaches to deal with this problem.

Deterministic control and local linearisation
The simplest approach follows from the realisation that state-dependent control is only required due to the noise in the problem. In the deterministic case, one can compute the optimal control solution u(t) = u * (x * (t), t) along the optimal path x * (t) only, and this is a function that only depends on time. This is a so-called open loop controller which applies the control u(t) regardless of the actual state that the system is at time t. This approach works for certain robotics tasks such a grasping or reaching. See for instance [Theodorou et al., 2010, Schaal andAtkeson, 2010] who constructed open loop controllers for a number of robotics tasks within the path integral control framework. However, open loop controllers are clearly sub-optimal in general and simply fail for unstable systems that require state feedback.
It should be mentioned that the open loop approach can be stabilised by computing a linear feed-back controller around the deterministic trajectory. This approach uses the fact that for linear dynamical systems with Gaussian noise and with quadratic control cost, the solution can be efficiently computed. 1 One defines a linear quadratic control problem around the deterministic optimal trajectory x * (t) by Taylor expansion to second order, which can be solved efficiently. The result is a linear feedback controller that stabilises the trajectory x * (t). This two-step approach is well-known and powerful and at the basis of many control solutions such as the control of ballistic missiles or chemical plants [Stengel, 1993].
The solution of the linear quadratic control problem also provides a correction to the optimal trajectory x * (t). Thus, a new x * (t) is obtained and a new LGQ problem can be defined and solved. This approach can be iterated, incrementally improving the trajectory x * (t) and the linear feedback controller. This approach is known as Differential Dynamic Programming [Mayne, 1966, Murray andYakowitz, 1984] or the Iterative LQG method [Todorov and Li, 2005]. In the robotics community this is a popular method, providing a practical compromise between stability, non-linearity and efficient computation [Morimoto et al., 2003, Tassa, 2011, Tassa et al., 2014.

Model predictive control
The second approach is to compute the control 'at run-time' for any state that is visited using the idea of model predictive control (MPC) [Camacho and Alba, 2013]. At each time t in state x t , one defines a finite horizon control problem on the interval [t, t + T ] and computes the optimal control solution u(s, x s ), t ≤ s ≤ t + T on the entire interval. One executes the dynamics using u(t, x t ) and the system moves to a new state x t+dt as a result of this control and possible external disturbances. This approach is repeated for each time. The method relies on a model of the plant and external disturbances, and on the possibility to compute the control solution sufficiently fast. MPC yields a state dependent controller because the control solution in the future time interval depends on the current state. MPC avoids the representation problem altogether, because the control is never explicitly represented for all states, but computed for any state when needed. MPC is particularly well-suited for the path integral control problems, because in this case the optimal control u * (x, t) is explicitly given in terms of a path integral. The challenge then is to evaluate this path integral sufficiently accurate in real time.
[Thijssen and Kappen, 2015] propose adaptive Monte Carlo sampling that is accelerated using importance sampling. This approach has been successfully applied to the control of 10 to 20 autonomous helicopters (quadrotors) that are engaged in coordinated control tasks such as flying with minimal velocity in a restricted area without collision or a task where multiple 'cats' need to catch a mouse that tries to get away [Gómez et al., 2015].

Parametrized solution
The third approach is to consider a parametrised family of controllers u(t, x||θ) and to find the optimal parameters θ * . If successful, this yields a near optimal state feedback controller for all t, x. This approach is well-known in the control and reinforcement community. Reinforcement learning (RL) is a particular setting of control problems with the emphasis on learning a controller on the basis of trial-and-error. A sequence of states X t , t = 0, dt, 2dt, . . . , is generated from a single roll-out of the dynamical system using a particular control, which is called the policy in RL. The 'learning' in reinforcement learning refers to the estimation of the optimal policy or cost-togo function from a single roll out [Sutton and Barto, 1998]. The use of function approximation in RL is not straightforward [Bellman and Dreyfus, 1959, Sutton, 1988, Bertsekas and Tsitsiklis, 1996. To illustrate the problem, consider the infinite horizon discounted reward case, which is the most popular RL setting. The problem is to compute the optimal cost-to-go of a particular parametrised form: J(x|θ). In the non-parametrised case, the solution is given by the Bellman 'back-up' equation, which relates J(x t ) to J(x t+dt ) where x t,t+dt are the states of the system at time t, t + dt, respectively and x t+dt is related to x t through the dynamics of the system. In the parametrised case, one must compute the new parameters θ ′ of J(x t |θ ′ ) from J(x t+dt |θ) . The problem is that the update is in general not of the parametrised form and an additional approximation is required to find the θ ′ that gives the best approximation. In the RL literature, one makes the distinction between 'on-policy' learning where J is only updated for the sequence of states that are visited, and off-policy learning updates J(x) for all states x, or a (weighted) set of states. Convergence of RL with function approximation has been shown for on-policy learning with linear function approximation (ie. J is a linear function of θ) [Tsitsiklis and Van Roy, 1997]. These authors also provide examples of both off-policy learning and non-linear function approximation where learning does not converge.

Outline
This chapter is organized as follows. In section 2 we present a review of the main ingredients of the path integral control method. We define the path integral control problem and state the basic Theorem of its solution in terms of a path integral. We then prove the Theorem by showing in section 2.1 that the Bellman equation can be linearized by a log transform and in section 2.2 that the solution of this equation is given in terms of a Feyman-Kac path integral. In section 2.3 we discuss how to efficiently estimate the path integral using the idea of importance sampling. We show that the optimal importance sampler coincides with the optimal control. In section 3 we review the cross entropy method, as an adaptive procedure to compute an optimized importance sampler in a parametrized family of distributions. In order to apply the cross entropy method, we reformulate the path integral control problem in terms of a KL divergence minimization in section 3.1 and in section 3.2 we apply this procedure to the obtain optimal samplers/controllers to estimate the path integrals. In section 4 we illustrate the method to learn a parametrized time independent state dependent controller for some simple control tasks.
In section 5 we consider the reverse connection between control and sampling: We consider the problem to compute the posterior distribution of a latent state model that we wish to approximate using Monte Carlo sampling, and to use optimal controls to accelerate this sampling problem. In neuroscience, such problems arise, e.g. to estimate network connectivity from data or decoding of neural recordings. The common approach is to formulate a maximum likelihood problem that is optimized using the EM method. The E-step is a Bayesian inference problem over hidden states and is shown to be equivalent to a path integral control problem. We illustrate this for a small toy neural network where we estimate the neural activity from noisy observations.

Path integral control
Consider the dynamical system The stochastic process W(s), t ≤ s ≤ T is called a Brownian motion. We will use upper case for stochastic variables and lower case for deterministic variables. t denotes the current time and T the future horizon time. Given a function u(s, x) that defines the control for each state x and each time t ≤ s ≤ T , define the cost with t, x the current time and state and u the control function. The stochastic optimal control problem is to find the optimal control function u: where E is an expectation value with respect to the stochastic process Eq. 1 with initial condition X t = x and control u. J(t, x) is called the optimal cost-to-go as it specifies the optimal cost from any intermediate state and any intermediate time until the end time t = T . For any control problem, J satisfies a partial differential equation known as the Hamilton-Jacobi-Bellman equation (HJB). In the special case of the path integral control problems the solution is given explicitly as follows.

Theorem 1. The solution of the control problem Eqs. 3 is given by
where we define and W(s), s ≥ t the Brownian motion.
The path integral control problem and Theorem 1 can be generalised to the multidimensional case where X(t), f (s, X(s)) are n-dimensional vectors, u(s, X(s)) is an m dimensional vector and g(s, X(s)) is an n × m matrix. dW(s) is m-dimensional Gaussian noise with E dW(s) = 0 and E dW(s)dW(r) = νdsδ(s − r) and ν the m × m positive definite covariance matrix. Eqs. 1 and 2 become: where ′ denotes transpose. In this case, ν and R must be related as with λI = Rν with λ > 0 a scalar [Kappen, 2005].
In order to understand this result, we first will derive in section 2.1 the HJB equation and show that for the path integral control problem it can be transformed into a linear partial differential equation. Subsequently, in section 2.2 we present a Lemma that will allow us prove the Theorem.

The linear HJB equation
The derivation of the HJB equation relies on the argument of dynamic programming. This is quite general, but here we restrict ourselves to the path integral case. Dynamic programming expresses the control problem on the time interval [t, T ] as an instantaneous contribution at the small time interval [t, t + ds] and a control problem on the We derive the HJB equation by discretising time with infinitesimal time increments ds. The dynamics and cost-to-go become The minimisation in Eq. 3 is with respect to a functions u of state and time and becomes a minimisation over a sequence of state-dependent functions u t: The first step is the definition of J t . The second step separates the cost term at time t from the rest of the contributions in S t , uses that EdW t = 0. The third step identifies the second term as the optimal cost-to-go from time t + ds in state X t+ds . The expectation is with respect to the next future state X t+ds only. The fourth step uses the dynamics of x to express X t+ds in terms of x t , a first order Tayler expansion in ds and a second order Taylor expansion in (X t+ds −x t and uses the fact that EX t+ds − x are partial derivatives with respect to t, x respectively.
Note, that the minimization of control paths u t:T −ds is absent in the final result, and only a minimization over u t remains. We obtain in the limit ds → 0: Eq. 8 is a partial differential equation, known as the Hamilton-Jacobi-Bellman (HJB) equation, that describes the evolution of J as a function of x and t and must be solved with boundary condition J(x, T ) = φ(x). Since u appears linear and quadratic in Eq. 8, we can solve the minimization with respect to u which gives u * (t, t,x) , then the HJB equation becomes linear in ψ: with boundary condition ψ(T, x) = e −Φ(x) .

Proof of the Theorem
In this section we show that Eq. 9 has a solution in terms of a path integral (see [Thijssen and Kappen, 2015]). In order to prove this, we first derive the following Lemma. The derivation makes use of the so-called Itô calculus which we have summarised in the appendix.
Lemma 2. Define the stochastic processes Y(s), Z(s), t ≤ s ≤ T as functions of the stochastic process Eq. 1: When ψ is a solution of the linear Bellman equation Eq. 9 and u * is the optimal control, then Proof. Consider ψ(s, X(s)), t ≤ s ≤ T as a function of the stochastic process Eq. 1. Since X(s) evolves according to Eq. 1, ψ is also a stochastic process and we can use Itô's Lemma (Eq. 32 to derive a dynamics for ψ.
where the last equation follows because ψ satisfies the linear Bellman equation Eq. 9. From the definition of Y we obtain dY = Vds + 1 2 u 2 ds + udW. Using again Itô's Lemma Eq. 32: Using the product rule Eq. 31 we get where we used that Z(t) = 1 and ψ(T ) = exp(−Φ(X(T ))). This proves Eq. 11.
With the Lemma, it is easy to prove Theorem 1. Taking the expected value in Eq. 11 proves Eq. 4 (t,x,u) This is a closed form expression for the optimal cost-to-go as a path integral.
To prove Eq. 5, we multiply Eq. 11 with W(s) = s t dW, which is an increment of the Wiener Process and take the expectation value: where in the first step we used EW(s) = 0 and in the last step we used Itô Isometry Eq. 35. To get u * we divide by the time increment s − t and take the limit of the time increment to zero. This will yield the integrand of the RHS ψ(t, x)(u * (t, x) − u(t, x). Therefore the expected value disappears and we get which is Eq. 5.

Monte Carlo sampling
Theorem 1 gives an explicit expression for the optimal control u * (t, x) and the optimal cost-to-go J(t, x) in terms of an expectation value over trajectories that start at x at time t until the horizon time T . One can estimate the expectation value by Monte Carlo sampling. One generates N trajectories X(t) i , i = 1, . . . , N starting at x, t that evolve according to the dynamics Eq. 1. Then, ψ(t, x) and u * (t, x) are estimated aŝ The optimal control estimate involves a limit which we must handle numerically by setting s − t = ǫ > 0. Although in theory the result holds in the limit ǫ → 0, in practice ǫ should be taken a finite value because of numerical instability, at the expense of theoretical correctness. The estimate involves a control u, which we refer to as the sampling control. Theorem 1 shows that one can use any sampling control to compute these expectation values. The choice of u affects the efficiency of the sampling. The efficiency of the sampler depends on the variance of the weights w i which can be easily understood. If the weight of one sample dominates all other weights, the weighted sum over N terms is effectively only one term. The optimal weight distributions for samping is obtained when all samples contribute equally, which means that all weights are equal. It can be easily seen from Lemma 2 that this is obtained when u = u * . In that case, the right hand side of Eq. 11 is zero and thus is S (t, x, u * ) a deterministic quantity. This means that for all trajectories X i (t) the value S i (t, x, u * ) is the same (and equal to the optimal cost-to-go J(t, x)). Thus, sampling with u * has zero variance meaning that all samples yield the same result and therefore only one sample is required.
One can view the choice of u as implementing a type of importance sampling and the optimal control u * is the optimal importance sampler. One can also deduce from Lemma 2 that when u is close to u * , the variance in the right hand side of Eq. 11 as a result of the different trajectories is small and thus is the variance in w i = e −S i (t,x,u) is small. Thus, the closes u is to u * the more effective is the importance sampler [Thijssen and Kappen, 2015].
Since it is in general not feasible to compute u * exactly, the key question is how to compute a good approximation to u * . In order to address this question, we propose the so-called cross-entropy method.

The cross-entropy method
The cross-entropy method [De Boer et al., 2005] is an adaptive approach to importance sampling. Let X be a random variable taking values in the space X. Let f v (x) be a family of probability density function on X parametrized by v and h(x) be a positive function. Suppose that we are interested in the expectation value where E u denotes expectation with respect to the pdf f u for a particular value of v = u. A crude estimate of l is by naive Monte Carlo sampling from f u : Draw N samples X i , i = 1, . . . , N from f u and construct the estimator The estimator is a stochastic variable and is unbiased, which means that its expectation value is the quantity of interest: E ul = l. The variance ofl quantifies the accuracy of the sampler. The accuracy is high when many samples give a significant contribution to the sum. However, when the supports of f u and h have only a small overlap, most samples X i from f u will have h(X i ) ≈ 0 and only few samples effectively contribute to the sum. In this case the estimator has high variance and is inaccurate. A better estimate is obtained by importance sampling. The idea is to define an importance sampling distribution g(x) and to sample N samples from g(x) and construct the estimator:l It is easy to see that this estimator is also unbiased: The question now is to find a g such thatl has low variance. When g = f u Eq. 16 reduces to Eq. 15.
Before we address this question, note that it is easy to construct the optimal importance sampler. It is given by where the denominator follows from normalization: 1 = dxg * (x). In this case the estimator Eq. 16 becomesl = l for any set of samples. Thus, the optimal importance sampler has zero variance and l can be estimated with one sample only. Clearly g * cannot be used in practice since it requires l, which is the quantity that we want to compute! However, we may find an importance sampler that is close to g * . The cross entropy method suggests to find the distribution f v in the parametrized family of distributions that minimises the KL divergence where in the first step we have dropped the constant term E g * log g * (X) and in the second step have used the definition of g * and dropped the constant factor 1/l. The objective is to maximize D(v) with respect to v. For this we need to compute D(v) which involves an expectation with respect to the distribution f u . We can use again importance sampling to compute this expectation value. Instead of f u we sample from f w for some w. We thus obtain We estimate the expectation value by drawing N samples from f w . If D is convex and differentiable with respect to v, the optimal v is given by The cross entropy method considers the following iteration scheme. Initialize w 0 = u. In iteration n = 0, 1, . . . generate N samples from f w n and compute v by solving Eq. 18. Set w n+1 = v. We illustrate the cross entropy method for a simple example. Consider X = R and the family of so-called tilted distributions f v (x) = 1 N v p(x)e vx , with p(x) a given distribution and N v = dxp(x)e vx the normalization constant. We assume that it is easy to sample from f v for any value of v. Choose u = 0, then the objective Eq. 14 is to compute l = dxp(x)h(x). We wish to estimate l as efficient as possible by optimizing v. Eq. 18 becomes Note that the left hand side is equal to E v X and the right hand side is the 'h weighted' expected X under p. The cross entropy update is to find v such that h-weighted expected X equals E v X. This idea is known as moment matching: one finds v such that the moments of the left and right hand side, in this case only the first moment, are equal.

The Kullback-Leibler formulation of the path integral control problem
In order to apply the cross entropy method to the path integral control theory, we reformulate the control problem Eq. 1 in terms of a KL divergence. Let X denote the space of continuous trajectories on the interval [t, T ]: τ = X(s), t ≤ s ≤ T with fixed initial value X(t) = x. Denote p u (τ) the distribution over trajectories τ with control u. The distributions p u for different u are related to each other by the Girsanov Theorem. We derive this relation by simply discretising time as before. In the limit ds → 0, the conditional probability of X s+ds given X s is Gaussian with mean µ s = X s + f (s, X s )ds + g(s, X s )u(s, x s )ds and variance Ξ s ds = g(s, X s ) 2 ds. Therefore, the conditional probability of a trajectory τ = X t:T |x with initial state X t = x is 2 where in the last step we used dynamics Eq. 1. p 0 (τ) is the distribution over trajectories in the absence of control, which we call the uncontrolled dynamics. Using Eq. 19 one immediately sees that In other words, the quadratic control cost in the path integral control problem Eq. 3 can be expressed as a KL divergence between the distribution over trajectories under control u and the distribution over trajectories under the uncontrolled dynamics. Eq. 3 can thus be written as with V(τ) = Φ(X T ) +

T t dsV(s, X(s)).
Since there is a one-to-one correspondence between u and p u , one can replace the minimization with respect to the functions u in Eq. 20 by a minimisation with respect to the distribution p subject to a normalization constraint dτp(τ) = 1. The optimal solution is given by where ψ(t, x) = E p 0 e −V(τ) is the normalization, which is identical to Eq. 4. Substituting p * in Eq. 20 yields the familiar result J(t, x) = − log ψ(t, x). Note, that we have two expressions for p * (τ). Eq. 21 expresses p * in terms of the uncontrolled dynamics p 0 and the control cost. Eq. 19 expresses for u = u * , p * in terms of the uncontrolled dynamics p 0 and the optimal control u * . Combining Eqs. 21 and 19 we obtain 2 In the multi-dimensional case of Eq. 7 this generalizes as follows. The variance is g(s, X s )νg(s, X s ) ′ ds = λΞ s ds with Ξ s = g(s, X s )R −1 g(s, X s ) ′ and

The cross entropy method for path integral control
We are now in a similar situation as the cross entropy method. We cannot compute the optimal control u * that parametrizes the optimal distribution p * = p u * and instead wish to compute a near optimal controlû such that pû is close to p * . Following the CE argument, we minimise

û(s, X(s)) u(s, X(s)) + dW s ds
where in the second line we used Eq. 19 and discard the constant term E p * log p 0 and in the third line we used Eq. 22 to express the expectation with respect to the optimal distribution p * controlled by u * in terms of a weighted expectation with respect to an arbitrary distribution p controlled by u. We further used that X s+ds = X s + f (s, X s )ds + g(s, X s )(u(s, X S ) + dW(s)). The expectation of dW s in Eq. 23 is non-zero due to the weighting by e −S (t,x,u) . 3 The KL divergence Eq. 23 must be optimized with respect to the functionsû t:T = {û(s, X s ), t ≤ s ≤ T }. In addition, the KL divergence involves an expectation value that uses a sampling control u t:T = {u(s, X s ), t ≤ s ≤ T }. We are free to choose any samping control as they all are unbiased estimators, but the more the sampling control resembles the optimal control, the more efficient can these expecations values be estimated.
We now assume thatû is a parametrized function with parameters θ. In the timedependent case, we consider different θ s for each of the functionsû(s, x|θ s ) separately. In this case the gradient of the KL divergence Eq. 19 is given by: E p e −S (t,x,u) û(s, X(s)) − u(s, X(s)) − dW s ds ∂û(s, X(s)) ∂θ s In the case thatû(s, x) and u(s, x) are linear combinations of a set of K basis functions h sk (x) with parameters θ sk and θ 0 sk , respectively, ie.û(s, x) = K k=1 θ sk h sk (x) and similar for u(t, x), we can set the gradient equal to zero and obtain the set of equations: where we defined F = 1 ψ(t,x) E p e −S (t,x,u) F with p a distribution over trajectories under control u that is linearly parametrized by θ 0 . Eq. 25 is for each s a system of K linear equations with K unknowns θ sk , k = 1, . . . , K. The statistics h sl h sk and dW s ds h sk can be estimated for all times t ≤ s ≤ T simultaneously from a single Monte Carlo sampling run using the control u parametrized by θ 0 . The fixed point equations Eq. 25 were derived in [Thijssen and Kappen, 2015] using a different reasoning.
Although in principle the optimal control explicitly depends on time, there may be reasons to compute a control functionû(x) that does not explicitly depend on time. For instance, consider a stabilizing task such as an inverted pendulum. The optimal control solution u * (t, x) assumes an optimal timing of the execution of the swing-up. If for some reason this is not the case and the timing is off, an inappropriate controlû(t, x) is used at time t. Another situation where a time-independent solution is preferred is when the horizon time is very large, and the dynamics and the cost are also not explicit functions of time. The advantage of a time-independent control solution is clearly that it requires less storage.
We thus considerû(X s ) and u(X s ) independent of time parametrised by θ and θ 0 , respectively. In this case the gradient of the KL divergence Eq. 23 is given by: Note the extra integral over s, due to the fact that a single control function is active at all times. In the last term, the integration over s has resulted in a Itô stochastic integral. This has removed the awkward numerical estimation of EdW(s)/ds . In the case thatû(x) and u(x) are linear combinations of a set of K basis functions h k (x) with parameters θ k and θ 0 k , respectively, we can again set the gradient equal to zero and obtain the set of equations: Eq. 27 is a system of K linear equations with K unknowns θ k , k = 1, . . . , K.
If required, the estimations of θ in Eqs. 25 and 27 can be repeated several times, each time with an improved θ, u, implementing an adaptive importance sampling algorithm. In iteration n, θ = θ n+1 is computed using a sampling control parametrized by θ 0 = θ n .
In the case thatû does not depend linearly on θ one cannot directly solve ∂KL(p * |p) ∂θ = 0. In this case one must resort to a gradient descent procedure. In this case, one can also include the idea of adaptive importance sampling. Remember that the KL divergence Eq. 23 must be minimized with respect to θ but also involves a sampling control, parametrized by θ 0 . Since the gradient descent procedure presumably monotonically improves the control, it is best to use the most recent control estimate as sampling control. Setting u =û in the gradients for the time-dependent and time-independent cases Eqs. 24 and 26 significantly simplifies them and the gradient descent updates become respectively, and η > 0 a small parameter. Since, Eqs. 28 and 29 are the gradients of the KL divergence, their convergence is guaranteed using standard arguments. We refer to this gradient method as path integral learning (PIL).

Numerical illustration
In this section, we illustrate path integral learning for two simple problems. For a linear quadratic control problem, where we compare the result with the optimal solution, and  Figure 1: Illustration of path integral earning Eq. 29 for a 1-dimensional linear quadratic control problem with Q = 2, R = 1, ν = 0.1, T = 5. We used time discretization ds = 0.05 and generated 50 sample trajectories for each gradient computation all starting from x = 2 and η = 0.1.
for an inverted pendulum control task where we compute the non-linear state feedback controller.
Consider the finite horizon 1-dimensional linear quadratic control problem with dynamics and cost with EdW(s) 2 = νds. The optimal control solution can be shown to be a linear feedback controller For finite horizon, the optimal control explicitly depends on time, but for large T the optimal control becomes independent of t: u * (x) = − Q R x. We estimate a timeindependent feed-back controller of the formû(x) = θ 0 +θ 1 x using path integral learning rule Eq. 29. The result is shown in fig. 1. The top left plot shows θ 0,1 as a function of gradient desent step. Note, that θ 0 , θ 1 rapidly approach their optimal values 0, −1.41 (red and blue line). Under-estimation of |θ 1 | is due to the finite horizon and the transient behavior induced by the initial value of X 0 , as can be checked by initializing X 0 from the stationary optimally controlled distribution around zero (results not shown). The top right plot shows the effective sample size as a function of gradient desent step,  Figure 2: Illustration of gradient descent learning Eq. 29 for a second order inverted pendulum problem with Q 1 = 2/T, Q 2 = 0.02/T, R = 0.5/T, ν = 0.3, T = 10. We used time discretization ds = 0.005 and generated 500 sample trajectories for each gradient computation all starting from (x 1 , x 2 ) = (−π/2, 0) ± (0, 0.1) and η = 1, K 1 = 40, K 2 = 80. Left: Fraction of effective sample size i w 2 i ) −1 with w i ∝ e −S i , i w i = 1 the normalised trajectory weights versus importance sampling iteration. Right: Optimal control solutionû(x 1 , x 2 ) versus x 1 , x 2 with 0 ≤ x 1 ≤ 2π and −2 ≤ x 2 ≤ 2.
which increases due to the improved sampling control. The bottom row shows 50 sample trajectories in the first and last gradient descent iteration.
As a second illustration we consider a simple inverted pendulum, that satisfies the dynamicsα = − cos α + u where α is the angle that the pendulum makes with the horizontal, α = 3π/2 is the initial 'down' position and α = π/2 is the target 'up' position, − cos α is the force acting on the pendulum due to gravity. Introducing x 1 = α, x 2 =α and adding noise, we write this system as with EdW 2 s = νds and ν the noise variance. We estimate a time-independent feed-back controller on a grid k 1 = 1 : K 1 , k 2 = 1 : with x ± i the maximum and minimum value of The results of the path integral learning rule Eq. 29 are shown in fig. 2. Fig. 2Left shows that the effective sample size for this problem increases with learning to approximately 7 % on average after 1000 importance sampling iterations, but with large fluctuation. Fig. 2Right shows the solution after 1000 importance sampling iterations in the (x 1 , x 2 ) plane. White star is initial location (3π/2, 0) (pendulum pointing down, zero velocity) and red star is the target state x = (π/2, 0) (pendulum point up, zero velocity). There are two example trajectories shown. The red trajectory forces the particle with positive velocity towards the top, and the blue solution forces the particle with negative velocity towards the top. Note the green NE-SW ridge in the control solution around the top. These are states where the position deviates from the top position, but with a velocity directed towards the top. So in these states no control is required. In the orthogonal NW-SE direction, control is needed to balance the particle. This example shows that the learned state feedback controller is able to swing-up and stabilize the inverted pendulum.
It should be noted that the use of the path integral method for stabilizing stochastic control task is challenging, as is evident from the low effective sample size. There are two reasons for this: • The weights of the trajectories are proportional to e −S with S ∝ 1/λ from Eq. 7 and λ = Rν playing the role of temperature. Small λ has the effect that the effective sample size is small (close to one sample), because the weight of one trajectory dominates all other trajectories. Thus, in order to have a large effective number of samples one cannot choose ν too small, meaning that the stochastic disturbances will be relatively large which make the the problem harder to control. In order to control these, the control should be sufficiently large, meaning that R should be small. But R cannot be chosen too small either since it affects the effective sample size in the same way as ν. This problem is due to the log transform that is used to linearize the Bellman equation.
• No matter how complex or unstable the problem, if the control solution approaches the optimal control sufficiently close, the effective sample size should reach 100 %. The grid-like parametrization provides maximal flexibility to represent any control function. We experimented with other parametrizationsû(x 1 , x 2 ) = θ 0 + θ 1 sin(x 1 ) + θ 2 cos(x 1 ) + θ 3 x 2 sin(x 1 ) + θ 4 x 2 cos(x 1 ) (results not shown) but could only find solutions with effective sample size close to one sample. Representing the correct control using the grid parametrization requires in principle an infintely fine grid, which in turn requires infinitely many samples to avoid overfitting. The low effective sample size is thus also due to a too coarse grid.
This suggests that the key issue for the succesful application of the path integral method is the parametrization that is used to representû. This representation should balance the two conflicting requirements of any learning problem: 1) the parametrization should be sufficiently flexible to represent an arbitrary function and 2) the number of parameters should be not too large so that the function can be learned with not too many samples. The inverted pendulum can of course also be controlled using other methods, for instance using the iterative LQG. One first solves the deterministic control problem in the absence of noise and then computes a linear feedback controller around this solution. In that case the solution is 'unimodal', representing one of the two possible swing-up solutions, and time-dependent. The point of the simulation is to illustrate that it is in principle possible to learn any state feedback controller, such as the 'multimodal' control solution that represents both solutions simultaneously.

Bayesian system identification: potential for neuroscience data analysis
We have shown that the path integral control problem is equivalent to a statistical estimation problem. We can use this identity to solve large stochastic optimal control problems by Monte Carlo sampling. We can accelerate this computation by importance sampling and have shown that the optimal control coincides with the optimal importance sampler. In this section, we consider the reverse connection between control and sampling: We consider the problem to compute the posterior distribution of a latent state model that we wish to approximate using Monte Carlo sampling, and to use optimal controls to accelerate this sampling problem.
In neuroscience, there is great interest for scalable inference methods, e.g. to estimate network connectivity from data or decoding of neural recordings. It is common to assume that there is an underlying physical process of hidden states that evolves over time, which is observed through noisy measurements. In order to extract information about the processes giving rise to these observation, or to estimate model parameters, one needs knowledge of the posterior distributions over these processes given the observations. For instance, in the case of calcium imaging, one can indirectly observe the network activity of a large population of neurons. Here, the hidden states represent the activity of individual neurons, and the observations are the calcium measurements, [Mishchenko et al., 2011].
The state-of-the-art is to use one of many variations of particle filtering-smoothing methods to estimate the state distributions conditioned on the observations, see [Briers et al., 2010, Doucet and Johansen, 2011, Lindsten and Schoen, 2013. A fundamental shortcoming of these methods is that the estimated smoothing distribution relies heavily on the filtering distribution which is computed using particle filtering. For high dimensional problems these distributions may differ significantly which yields poor estimation accuracy, as seen in the following example.
One can easily see that the path integral control computation is mathematically equivalent to a Bayesian inference problem in a time series model with p 0 (τ) the distribution over trajectories under the forward model Eq. 1 with u = 0, and where one interprets e −V(τ) = T s=t p(y s |x s ) as the likelihood of the trajectory τ = x t:T under some fictitious observation model p(y s |x s ) = e −V(x s ) with given observations y t:T . The posterior is then given by p * (τ) in Eq. 21. One can generalize this by replacing the fixed initial state x by a prior distribution over the initial state. Therefore, the optimal control and importance sampling results of section 3.2 can be directly applied. The advantage of the PI method is that the computation scales linear in the number of particles 4 , compared to the state-of-the-art particle smoother that scales quadratic in the number of particles, although in practice significant accelerations can be made, e.g. [Fearnhead et al., 2010, Lindsten andSchön, 2013].
To illustrate this we estimate the posterior distribution of a noisy 2-dimensional firing rate model given 12 noisy observations of a single neuron, say ν 1 (green diamonds in fig. 3). The model is given by J is a 2-dimensional antisymmetric matrix and θ is a 2-dimensional vector, both with random entries from a Gaussian distribution with mean zero and standard deviation 25 and standard deviation 0.75, respectively, and σ 2 dyn = 0.2. We assume a Gaussian observation model N(y i |ν 1t i , σ 2 obs ) with σ obs = 0.2. We generate the 12 1-dimensional observations y i , i = 1, . . . , 12 with ν 1t i the firing rate of neuron 1 at time t i during one particular run of the model.
We parametrized the control as u(x, t) = A(t)x + b(t) and estimated the 2x2 matrix A(t) and the 2-dimensional vector b(t) as described in [Thijssen and Kappen, 2015] and Eq. 25.
The path integral control solution (RPIIS) is shown in fig. 3) and was computed using 22 importance sampling iterations with 6000 particles per iteration. As a comparison, the forward-backward particle filter solution (FFBSi) was computed using N = 6000 forward and M = 3600 backward particles. In blue, we see the FFBSi estimates and in red the RPIIS estimates of the posterior distribution p(ν 0:T |y 0:T ). The computation time was 35.1 s and 638 s respectively. Figure 4 shows the estimated control parameters used for the RPIIS method. The open loop controller b 1 (t) steers the particles to the observations. The feedback controller A 11 (t) 'stabilizes' the particles around the observations (blue lines). Due to the coupling between the neurons, the non-observed neuron is also controlled in a nontrivial way. To appreciate the effect of using a feedback controller, we compared these results with an open-loop controller u(x, t) = b(t). This reduces the ESS from 60 % for the feedback controller to around 29 % for the open loop controller. The lower sampling efficiency increases the error of the estimations, especially the variance of the posterior marginal (not shown). When choosing the importance sampling controller, there is in general a trade off between accuracy and the computational effort involved in the update rules in Eqs. 25 or 27.
The example shows the potential of adaptive importance sampling for posterior estimation in continuous state-space models. A publication with the analysis of this approach for high dimensional problems is in preparation. This can be used to accelerate maximum likelihood based methods to estimate, for instance connectivity, decoding of neural populations, estimation of spike rate functions and, in general, any inference problem in the context of state-space models; see [Oweiss, 2010, and references therein] for a treatment of state-space models in the contex of neuroscience and neuro-engineering.

Summary and discussion
The original path integral control result of Theorem 1 expresses the optimal control u * (t, x) for a specific t, x as a Feynman-Kac path integral. The important advantage of the path integral control setting is that, asymptotically, the result of the sampling procedure does not depend on the choice of sampling control. The reason is that the control used during exploration is an importance sampling in the sense of Monte Carlo sampling and any importance sampling strategy gives the same result asymptotically. Clearly, the efficiency of the sampling depends critically on the sampling control. Theorem 1 can be used very effectively for high dimensional stochastic control problems using the Model Predictive Control setting [Gómez et al., 2015]. However, Theorem 1 is of limited use when we wish to compute a parametrized control function for all t, x. We have therefore here proposed the cross entropy argument, originally formulated to optimize importance sampling distributions, to find a control function whose distribution over trajectories is closest to the optimally controlled distribution. In essence, this optimization replaces the original KL divergence KL(p|p * ) Eq. 20 by the reverse KL divergence KL(p * |p) and optimizes for p. The resulting path integral learning method provides a flexible framework for learning a large class of non-linear stochastic optimal control problems with a control that is an arbitrary function of state and parameters. The idea to optimize this reverse KL divergence was earlier explored for the time-dependent case and linear feedback control in [Gomez et al., 2014].
We have restricted our numerical examples to parametrizations that are linear in the parameters. Generalization to non-linear parametrizations, such as for instance (deep) neural networks, Gaussian processes or other machine learning methods can be readily considered, at no significant extra computational cost.
The path integral learning rule Eq. 29 has some similarity with the so-called policy gradient method for average reward reinforcement learning [Sutton et al., 1999] ∆θ = ηE π a ∂π(a|s) ∂θ Q π (s, a) where s, a are discrete states and actions, π(a|s, θ) is the policy which is the probability to choose action a in state s, and θ parametrizes the policy. E π denotes expectation with respect to the invariant distribution over states when using policy π and Q π is the state-action value function (cost-to-go) using policy π. The convergence of the policy gradient rule is proven when the policy is an arbitrary function of the parameters. The similarities between policy gradient and path integral learning are that the policy takes the role of the sampling control and the policy gradient involves an expectation with respect to the invariant distribution under the current policy, similar to the time integral in Eq. 29 for large T when the system is ergodic. The differences are 1) that the expectation value in the policy gradient is weighted by Q π , which must be estimated independently, whereas the brackets in Eq. 29 involve a weighting with e −S which is readily available; 2) Eq. 29 involves an Itô stochastic integral whereas the policy gradient does not; 3) the policy gradient method is for discrete state and actions and the path integral learning is for controlled non-linear diffusion processes; 4) the policy gradient expectation value is not independent of π as is the case for the path integral gradients Eqs. 24 and 26.

Acknowledgement
I would ike to thank Vicenç Gómez for helpful comments and careful reading of the manuscript.

A Itô calculus
Given two diffusion processes, The term in the last line is known as the quadratic covariance.
Let F(Y) as a function of the stochastic process Y. Itô's Lemma is a type of chain rule that gives the evolution of F; Putting a process Eq. 30 in integral notation and taking the expected value yields the following The Itô Isometry states that