Convergence results for an averaged LQR problem with applications to reinforcement learning

In this paper, we will deal with a Linear Quadratic Optimal Control problem with unknown dynamics. As a modeling assumption, we will suppose that the knowledge that an agent has on the current system is represented by a probability distribution $\pi$ on the space of matrices. Furthermore, we will assume that such a probability measure is opportunely updated to take into account the increased experience that the agent obtains while exploring the environment, approximating with increasing accuracy the underlying dynamics. Under these assumptions, we will show that the optimal control obtained by solving the"average"Linear Quadratic Optimal Control problem with respect to a certain $\pi$ converges to the optimal control driven related to the Linear Quadratic Optimal Control problem governed by the actual, underlying dynamics. This approach is closely related to model-based Reinforcement Learning algorithms where prior and posterior probability distributions describing the knowledge on the uncertain system are recursively updated. In the last section, we will show a numerical test that confirms the theoretical results.


Introduction
Reinforcement learning (RL) is one of the three basic machine learning paradigms, together with supervised learning and unsupervised learning. In RL, an agent interacts with a partially unknown environment, aiming at finding a policy, which optimizes the measure of a certain long-term performance [25]. The connection between RL and optimal control theory was already identified in the past [26]. Nowadays, several research questions and tools coming from the RL literature are influencing the optimal control field and vice versa [23].
A natural setting in RL consists in considering a Markov decision process over state/action pairs varying on a discrete-time set. The discrete-time problem setting provides an excellent framework to develop methods and algorithms, which, however, often underlies a continuous-time structure. For this reason, in particular in the control system engineering field, significant attention has been recently given to continuoustime RL [8,14,15,17].
Both in discrete-and continuous-time problem settings, one can consider two main RL philosophies: The first one, called model-based, usually concerns the reconstruction of a model from the data trying to mimic the unknown environment. That model is then used to plan and to compute a suboptimal policy. The second RL philosophy, called model-free, employs a direct approximation of the value function and/or a policy based on a dynamic-programming-like algorithm, without using a model to simulate the unknown environment. An excellent overview of the two approaches can be found in [25].
About ten years ago, PILCO was introduced [6,7], an innovative and disruptive method, from which many subsequent model-based RL methods have been inspired. Rather than exploiting the data to construct a dynamics approximating the partially known environment, PILCO makes use of them to construct a probability distribution (more precisely, a Gaussian process) on a class of dynamical systems. At each iteration, this distribution is updated to fit the data set. After the model update, PILCO takes the policy improvement step which boils down to solving an averaged optimal control problem, where the averaging distribution is the one extrapolated by the data at the previous experiments. That approach has the advantage of considerably reducing the model bias, one of the main shortcomings of model-based RL [1]. A general, rigorous framework capturing PILCO as well as other Bayesian model-based RL approaches (see, e.g., [3,4,[10][11][12]29]) has been developed in [18,19]. In particular, it is important to mention that the framework developed in [18] is closely related to the averaging control framework and Riemann-Stieltjes optimal control [2,16,21,24,30].
The aim of this paper is to provide a stricter link between PILCO [7] and the framework introduced in [18]. We will concentrate on a specific physical system, driven by a linear quadratic regulator (LQR) optimal control problem, namely where B, Q, R, Q f are given, known matrices and x 0 ∈ R n is a given vector. However, in this context we will assume that the physical systemÂ is unknown and the knowledge we have aboutÂ is merely represented by a probability distribution π constructed over a set of matrices A (withÂ ∈ A) by using the data available from the physical system. This situation is in accordance with the PILCO setting, which is "not focusing on a single dynamics model", but makes use of "a probabilistic dynamics model, a distribution over all plausible dynamics models that could have generated the observed experience" ( [6], pg. 34). Such a modeling setting allows us to define the averaged optimal control problem ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ( If indeed the real physical system is driven by the equatioṅ for a certain matrixÂ, then it is reasonable to expect that an increase of the experience will produce a more accurate distribution π over A. This fact can be translated into the assumption that the probability distribution is "close" (in a precise sense that will be specified in the sequel) to δÂ, when enough experience of the environment (here represented byÂ) is gained. We would like to stress that our goal is not to propose a new algorithm to find an optimal policy but to consider a class of existing algorithms and motivate their good performances. In particular, the paper aims to provide an insight into the convergence of Bayesian-like RL algorithms in which a recursive construction of probability measures is carried out. (Further considerations on the connection with RL are given in Remarks 3.4 and 5.5.) Here, by "convergence", we mean convergence of the optimal policy obtained by estimating the underlying dynamics using data from the real system (the one constructed in the so-called policy improvement step) toward the optimal policy obtained by solving problem (1). More precisely, the questions we will tackle in this paper are: (1) Is the value function related to the optimal control (1) close to the value function associated with (2) when π is close to δÂ w.r.t. the Wasserstein distance (see (9) for the formal definition)? (2) Under the same assumptions over π , is the optimal control of (2) close to the optimal control of (1)? For both question (1) and question (2), we will provide positive answers. It is worth noticing that, in control theory, it is very uncommon to have a positive answer to question (2), even when one has a positive answer to question (1).
The paper is organized as follows. Section 2 introduces the basic notations that we will use throughout the paper. In Sect. 3, we state the problem formulation and we study the basic properties of the LQ optimal control problem (2). Then, in Sect. 4, we also derive a Pontryagin's maximum principle for problem (2), refining some results in [2]. In Sect. 5, we state and prove the main results of the paper, providing positive answers to question (1) and question (2). In Sect. 6, we strengthen the results presented in Sect. 5 in the case in which one is dealing with a discrete probability measure π . That result is further stressed in Sect. 7, where we present and analyze a numerical example. Finally, we present the conclusion and discuss some future directions and open questions.

Preliminaries and notations
In this section, we will recall some useful notations and concepts which will be used throughout the paper. For vectors v ∈ R n , |v| denotes the Euclidean norm; we use B n (x, r ) to denote the open ball in R n centered at x ∈ R n and of radius r > 0. We will use L to denote the Lebesgue σ -algebra and B X to denote the collection of all Borel sets on a given topological space X .
In the following, T will be our fixed time horizon. Given a time s ∈ [0, T ], we denote by C([s, T ]; R n ) the space of continuous functions x : [s, T ] → R n . For functions x(·) ∈ C([s, T ]; R n ), x(·) ∞ or x ∞ denotes the sup norm. For continuous functions y(·) ∈ C(R n ; R) and for a compact set K ⊂ R n , we also define the sup norm restricted on the compact set K as y(·) ∞,K := sup x∈K |y(x)|.
Moreover, given p ∈ [1, ∞), we define the spaces of a.e. defined functions and For g ∈ L p ([s, T ]; R n ), the L p -norm g(·) p or g p is defined by and for g ∈ W 1, p ([s, T ]; R n ), we define Let us denote by M m×n the space of real matrices with m rows and n columns. For square matrices A ∈ M n×n , we consider the 2−norm A 2 := sup x T Ay : x, y ∈ R n , |x| = |y| = 1 .
Given two matrices A, A ∈ M n×n , d 2 (A, A ) := A − A 2 denotes the distance between the two matrices induced by the 2−norm. For a generic metric space (X , d), M(X ) will denote the space of measures on X , equipped with the weak-topology, according to which π N * π if and only if where C b (X ) denote all bounded continuous functions f : X → R. When the space X is compact, the weak-topology is metrized by the Wasserstein distance (see, e.g., [27]) where Γ (π, π ) is the collection of all probability measures on X × X with marginals π and π on the first and second factors, respectively. Given a probability space (Ω, F, π) and a random variable Y on Ω, E π [Y ] denotes the expected value of Y with respect to π .

Problem statements and preliminary results
We begin our discussion considering two LQ optimal control problems. We will see in the sequel how the two problems are connected: Problem A: the LQR problem.
Let us consider the classical LQR problem with finite horizon, which we will refer to as Problem A: The pair (x, u)(·) such that u ∈ U and x(·) is the solution of the Cauchy problem is called admissible process for Problem A. Let us define the value function V : [0, T ] × R n → R for Problem A as We shall say that (x,ū)(·) is an optimal process for Problem A if for any other admissible process (x, u)(·) of Problem A. In this case,ū will be denoted as optimal control for Problem A. Problem B: an LQR problem with unknown dynamics. Let us now introduce an optimal control that does not require the exact knowledge of the matrixÂ, but merely a probability distribution defined on a compact space of matrices A containingÂ. For each s ∈ [0, T ], x 0 ∈ R n and π ∈ M(A), consider the following optimal control problem, which we will refer to as Problem B: where U := {u : [s, T ] → R m Lebesgue measurable} and Remark 3.1 Sometimes we will denote this problem as Problem B π , to stress its dependency on the probability distribution π .
The definition of the value function V π : The collection {(x A , u)(·) : A ∈ A} such that u ∈ U and, for each A ∈ A, x A (·) is the solution of the Cauchy problem is called admissible process for Problem B. Note that the initial condition is the same for every A. The admissible process for any other admissible process {(x A , u)(·) : A ∈ A} of Problem B. In this case,ū will be denoted as optimal control for Problem B.
Throughout the whole paper, the following Standing Hypothesis will be imposed both for Problem A and for Problem B: (SH): Q, Q f ∈ M n×n are symmetric and semipositive definite, and R ∈ M m×m is symmetric and positive definite.

Remark 3.2
As we will show in Lemma 3.6, Problem B admits a unique optimal process when the assumption (SH) holds true. Furthermore, it is interesting to observe that Problem A can be regarded as a particular case of Problem B when one chooses π = δÂ, namely when π is a Dirac delta concentrated atÂ: In our framework, we consider a probability distribution merely on the matrix A of the dynamics and not on the matrix B. It is not hard to check that the arguments proposed and the results achieved in the next sections hold as well in an extended framework where the matrix B is possibly unknown. However, we preferred to consider a simpler case to keep the overall presentation as clear as possible. Moreover, it seems reasonable to assume that the agent does not know how the environment works (matrix A), whereas being aware of how the control affects the system (matrix B).

Remark 3.4 (Nonlinear systems and connection with RL)
We defined π as a probability distribution on a space of matrices A. In a higher perspective, we can identify A with a class of linear dynamical systems, namely and see π as a probability distribution on A as well. More generally, one could consider a nonlinear dynamical system assuming that the f is unknown and only a probability distribution π on a space X of possible dynamics is available. In a similar way to how we did, given a cost functional one could define a Problem B for the nonlinear problem (see [18,22]). A concrete example of nonlinear Problem B related to RL is presented in PILCO, by Deisenroth and Rasmussen [6,7]. PILCO aims at approximating the actual dynamics with a Gaussian process (GP); the parameters of the GP are tuned according to the data gathered by an agent while interacting with the environment. Providing a probability distribution of the dynamics permits to obtain an interval of confidence of the actual dynamics at each point, rather than a pointwise estimate of the actual dynamics. We stress that a GP is a probability distribution on the set X of bounded, continuous functions, and thus, it constitutes a perfect candidate to play the role of π in Problem B. In the policy improvement step, the optimal policy is then computed minimizing the averaged cost over all possible realizations of the GP, in a similar way to how we defined the cost functional in (16).
The framework presented in this paper incorporates the approach presented in PILCO, as well as other probabilistic model-based RL methods [29], when the dynamics is linear and the cost is quadratic. We will make further considerations about the link with PILCO in Remark 5.5.

Preliminary results for Problem B
Let us start showing a series of basic results on the existence and the regularity of trajectories and optimal controls for the system we are considering.
In this section, we assume that π is a probability distribution on a compact set of matrices A ⊂ M n×n . Being A bounded, there exists a constant C A such that For a given matrix A ∈ A and an admissible control u ∈ U, the notation x A (t; u) will denote the solution of (18) relative to A and u; sometimes, when it is not ambiguous, we could omit the dependency on the control u and write only x A (t).

Lemma 3.5 (Boundness and continuity of trajectories)
Let us consider the dynamical system in (18). The following results hold: where C u x is a constant which depends on u and on x 0 .
for each A ∈ A. Then, so by the Grönwall lemma (see, e.g., Lemma 2.4.4 on [28]) we get for every A ∈ A and t ∈ [s, T ]. This shows condition (i). (ii) Fix a control u ∈ L 1 ([s, T ]; R m ) and consider the trajectories solutions of (18) relative to two different matrices A and A . If we define then z solves the following differential system: Notice that we can rewrite the right-hand side as and thus, we can give the estimate Applying again the Grönwall lemma on z, we get which implies the continuity of the map A → x A (·). (iii) Fix a matrix A ∈ A and consider the trajectories relative to two different controls u, u ∈ U. In a similar way as in (ii), we define which solves the ODE system and, by Grönwall's lemma, we get The last inequality gives the continuity with respect to u ∈ The following result guarantees that the minimization problem (15) is well posed: where C u does not depend on π and is defined as where r 1 is the smallest eigenvalue of the matrix R.
Proof Consider a minimizing sequence u k ∈ U of the cost functional J s,π defined in Let us consider the system (18) when the control is u 0 ≡ 0. The process Clearly, the cost achieved by the control u 0 can be estimated as follows: Furthermore, it follows from the construction of the minimizing sequence that On the other hand, since the matrix R > 0, one has where r 1 is the smallest eigenvalue of the matrix R.
Hence, one obtains the bound on the minimizing sequence which results in a uniformly bounded norm: In view of the previous relation, it follows from standard compactness arguments that u k ū weakly in L 2 ([s, T ]; R m ). Since u k is uniformly bounded in L 2 , then using in turn the relation (22), the Hölder inequality and the relation (25), one obtains that there exists a constant C x > 0 such that and, in view of the linearity of the control system, the process {(x A ,ū)(·) : A ∈ A} is the solution of the linear system for each A ∈ A. So the process {(x A ,ū)(·) : A ∈ A} is a minimizer for Problem B, and in view of (24),ū satisfies the bound (23) with the stricter constant Since the functional u → J s,π [u] is strictly convex, the uniqueness of the minimizer follows from standard arguments. This completes the proof.

Remark 3.7 In view of the previous results
A ∈ A} is an optimal process for problem B, then the constant and does not depend on π .

Optimality conditions
Let us consider the optimal control problem R m is a L×B R m -measurable multifunction taking values compact sets and J s,π is the cost functional defined in (16) for a given, fixed π ∈ M(A). For a given reference process {(x A ,ū)(·) : A ∈ A}, we assume that the following condition holds true: It follows from standard ODE theory that when condition (TH) holds, for every admis- We recall the following result due to Bettiol and Khalil, which is a special case of Theorem 3.3 in [2]: However, scrutiny to the proof given there reveals that the result still holds true under the relaxed condition (TH). Furthermore, Theorem 3.3 in [2] is derived for a Mayer optimal control problem, i.e., with only a final cost, but an analogous theorem for Bolza optimal control problems can be easily obtained by using a standard state augmentation argument.
We are now ready to prove the necessary optimality conditions for Problem B: the hypothesis (SH). Then, the following optimality condition is satisfied by the minimizer {(x A ,ū)(·) : A ∈ A} for Problem B (15). There exists a Proof Let us first observe that the optimal process {(x A ,ū)(·) : A ∈ A} exists and is unique, due to Lemma 3.6. Consider now the optimal control problem Such an optimal control problem is a special case of (27) with the choice of U (t) = B m (ū(t), 1). Clearly, since {(x A ,ū)(·) : A ∈ A} is a minimizer for Problem B, then it is also a minimizer for problem (28). Furthermore, since any element of u ∈ U (t) can be written as u =ū(t) + v, for some v ∈ B m (0, 1) and in view of Remark 3.7, then one can easily find δ > 0 and a function c ∈ L 2 ([s, T ], R) such that the hypothesis (TH) is satisfied. Indeed, for all x ∈ B n (x A (t), δ), ∀A ∈ A, for all u ∈ B m (ū(t), 1), a.e. t ∈ [s, T ], where C x is the constant appearing in Remark 3.7. So the process {(x A ,ū)(·) : A ∈ A} is a W 1,1minimizer (see Definition 4.1) for the optimal control problem (28), and the hypothesis (SH) is satisfied. Then, one can invoke Theorem 4.2, which provides conditions (i)-(iii)-(iv) of the statement. In order to obtain condition (ii), it is enough to observe thatū where {(x A ,ū)(·) : A ∈ A} is the unique minimizer of Problem B.
The following result guarantees that this extension defined in Remark 4.5 is continuous with respect to A: Lemma 4.6 (Boundness and continuity of multipliers) Let us consider the multipliers defined in (30) for each A ∈ A. They have the following properties: (i) There exists a positive constant C p , independent from π , which bounds uniformly all multipliers, i.e., Proof The proof is similar to that of Lemma 3.5, with the only difference that here we need to apply the Grönwall lemma backward instead of forward. Notice that the final condition will not be the same for all A ∈ A, whereas the initial condition was the same in Lemma 3.5. However, it is still continuous with respect to A by Lemma 3.5, so the same arguments can be applied to prove (ii).

Main convergence results
In this section, we will present the main results, which are, respectively, the convergence of the value function (Corollary 5.2) and the convergence of the optimal control (Theorem 5.3). Given a sequence of probability distributions {π N } ⊂ M(A), for each N ∈ N we consider Problem B π N , namely problem (15) relative to the distribution π N . Recalling the definition of the value function in (17), we define what can be said about the convergence of the value functions V N to V ∞ and of the optimal controls u N to u ∞ for N → ∞? In this section, we give an answer to those questions.
Theorem 5.1 (Lipschitz estimate for the value function w.r.t. π ) Let the assumption (SH) be satisfied. Given π, π ∈ M(A) and (s, x 0 ) ∈ [0, T ] × K with K ⊂ R n compact set, let us consider the two value functions V π and V π as defined in (17). Then, the distance between V and V can be bounded uniformly for (s, x 0 ) ∈ [0, T ] × K , that is: where C K = C K (T , C A , Q 2 , Q f 2 , r 1 , K ) is a constant which does not depend on the distributions π and π , but merely on the compact set A.
Proof We divide the proof into three steps. STEP 1: Fix two matrices A, A ∈ A, any point (s, x 0 ) ∈ [0, T ] × R n and a control u ∈ U. Using Grönwall's lemma as we did for point (ii) of Lemma 3.5, we get the following estimate: with C u x given by Lemma 3.5. Let us denote by the running cost and by h the final cost:

(x A (t), u(t)) dt + h(x A (T )) dπ(A) .
Since both and h are locally Lipschitz continuous, one has and, similarly, where we defined the Lipschitz constants L u := Q 2 C u x and L u h := Q f 2 C u x ; these two constants inherit from C u x the dependency on x 0 and u. Finally, the cost difference between two single trajectories can be easily bounded by STEP 2: Fix an initial condition x(s) = x 0 ∈ R n with s ∈ [0, T ] and a control u ∈ U. We want to prove a bound for the distance between J s,π [u] and J s,π [u]. As a property of W 1 , there exists (see Theorem 4.1 on [27]) a probability distribution γ * ∈ Γ (π, π ) on A × A with marginal distributions π and π such that where d 2 is the distance introduced in (7). We can thus write where we have used that for all measurable functions φ, ψ on A, since γ * admits π and π as marginals. Using the bound (33) from STEP 1 and formula (34), we get Note that the constant C u x which appears here depends merely on x 0 and u. STEP 3: We will now show that an estimate similar to (35) holds true even for the value functions V π and V π .
Fix any point (s, x 0 ) ∈ [0, T ] × K . In view of Lemma 3.6, there exist controls u,ū ∈ U such that Then, one has and, in the same way, Hence, Moreover, being bothū andū optimal control for some distribution on A, we can use the uniform constant C x given by Remark 3.7 and define an analogous uniform constants for the Lipschitz continuity of and h: In this way, the estimate becomes independent fromū andū and thus from π and π : where Finally, noting that the estimate depends on x 0 only through its norm, we get that the bound is uniform in compact sets K ⊂ R n , letting A straightforward consequence of the previous theorem is the following: In what follows, we will useū N (·) to denote the optimal control of Problem B π N . Furthermore,x N A (·) and p N A (·) denote, respectively, the optimal trajectories and the multipliers relative to Problem B π N , that is the solutions of the differential systems (18) and (30).
The following theorem provides a strong convergence ofū N (·) to the optimal controlū ∞ (·) of the limit problem B π ∞ , assuming that π N * π ∞ .
Proof Without loss of generality, one can take s = 0, being all the other cases similar.
Lemma 3.6 assures that, for each π N ∈ M(A), there exists a unique optimal process (x N A ,ū N )(·) : A ∈ A . Taking into account Theorem 4.4 and Remark 4.5, that process satisfies the following necessary conditions: For each N ∈ N, there exists a continuous function p N : For each A ∈ A, the family of functions is uniformly bounded due to Lemmas 3.5 and 4.6. Moreover, one can find the bounds on the derivatives: for all A ∈ A, a.e. t ∈ [0, T ]. The second bound in (36) and the relation (ii) imply that also the map is equibounded and equicontinuous in N . For each A ∈ A fixed, one can then apply Theorem 2.5.3 on [28], implying the existence of some limit functions is a solution of the boundary value problem Furthermore, since π N * π ∞ , u ∞ satisfies the relation In fact, the result follows from the estimate for all t ∈ [0, T ], which implies that Notice that the convergence is guaranteed along a subsequence, but one can say that the whole sequence converges since the limit does not depend on the subsequence. (It solves (37).) It remains to show that the limiting process {(x ∞ A , u ∞ )(·) : A ∈ A} is actually optimal for the Problem B (15) relative to π ∞ . To this aim, let us stress the following properties of the cost functional J 0,π in (16), using the lighter notation J N := J 0,π N and J ∞ := J 0,π ∞ : continuous by Lemma 3.5; 2) each J N is continuous with respect to u, since for each A ∈ A, the map u → x A (·; u) is continuous, again from Lemma 3.5.
Sinceū N is the optimal control of Problem B π N and u is an admissible control for the same problem, then we get , ∀N ∈ N so, letting N → ∞, it easily follows from the previous relation and properties 1) and 2) that In view of the uniqueness of the optimal controlū ∞ for Problem B π ∞ , one can conclude that u ∞ ≡ū ∞ . Hence, also the process (x ∞ A , u ∞ )(·) : A ∈ A is optimal for the given problem. This concludes the proof. : π = δÂ) The particular case in which π is a Dirac delta δÂ for a given matrixÂ ∈ M n×n deserves special attention. Indeed, when π = δÂ, the cost functional J s,π (16) becomes the cost functional

Remark 5.4 (Special Case
and Problem B in (15) coincides with a standard Problem A (see (10)) Furthermore, the definition of the value function and that of the optimal control we gave in (17) and (19) agree, in this special case, with the classic definitions in control theory (see (13) and (14) If we apply Corollary 5.2 and Theorem 5.3 to a sequence π N converging to δÂ, then we obtain where V N andū N are, respectively, the value function in (17) and the optimal control (19) relative to π N .

Remark 5.5 (Additional remarks on the connection with RL)
Let us comment the results of this section bearing in mind the PILCO algorithm, presented in Remark 3.4. Recall that PILCO uses the agent experience on the environment to tune the parameters of a Gaussian process (GP), building up a stochastic model for the dynamics. The GP is then updated as the agent collects more data on the environment. This procedure generates a sequence of probability distributions π N on the space of continuous functions, which should get sufficiently close to the Dirac delta representing the actual dynamics. Now, assume that we are applying PILCO to the linear system (12) in which the agent does not know the matrixÂ. Then, one can think to π N N ∈N as a sequence of probability distributions on the space of matrices M n×n , which is converging to a Dirac delta concentrated at the true matrixÂ. If, at each step, the agent picks as control u N , namely the one minimizing the expected cost with respect to π N , and π N * π ∞ , then one can apply Theorem 5.3 to the special case presented in Remark 5.4 and obtains the uniform convergenceū N →ū ∞ , whereū ∞ is the optimal control for the limit problem.
This suggests that even if the distribution π N does not exactly reach δÂ but gets sufficiently close to it, then the controlū N is suboptimal for the actual LQR problem.

A case of study: finite support measures converging to ıÂ
In this section, we will assume that A is a finite set, namely A := {A 1 , . . . , A M } for some integer M ∈ N. Let us consider a sequence of probability distributions {π N } ⊂ M(A), which can be written as For a given s ∈ [0, T ] and x 0 ∈ R n , suppose that the underlying dynamics governing the optimal control problem we are interested in is a standard Problem A (see (10)): whereÂ ∈ A and the cost functional J s [u] is defined as in (11): Without loss of generality, one can setÂ ≡ A 1 . For each s ∈ [0, T ] and x 0 ∈ R n , the value function V (s, x 0 ) for this problem has been defined in (13). Then, one can expect that, after some interactions with the system, it is possible to construct a sequence of probability distributions {π N } ⊂ M(A) capturing the current belief that one has about the real system (42) and such that, when N is sufficiently large, π N gets closer and closer to δ A 1 and eventually π N * δ A 1 .
For each fixed N ∈ N, one can reformulate Problem B associated with π N as a classical LQR problem on an augmented system of dimension nM: with cost functional where we have used the compact notation X (t) In this section, we will use V N (s, X 0 ) to denote the value function related to the LQR problem (44), namely Since the optimal control problem (44) can be regarded as a classic LQR problem, then one has the following relation between the value function and the optimal control in feedback form (see, e.g., [13], Theorem 3.4): there exists P N such that where [s, T ] t → P N (t) ∈ M nM×nM solves the Riccati equation For x 0 ∈ R n , we will use the notation V N (s, x 0 ) to denote the value function We can summarize the results of the previous section applied to problem (44) as follows: (i) problems (44) and (42) admit a unique optimal process (x N A ,ū N )(·) : A ∈ A and (x,ū)(·), respectively; (ii) for each K ⊂ R n , ∃ C K such that (iii) if, moreover, {π N } is such that π N * δÂ, then the optimal controlū N →ū uniformly for t ∈ [s, T ].
For each N ∈ N, the solution of (49) is related to the matrices X N , Y N : [s, T ] → M nM×nM , such that the pair (X N , Y N )(·) is the solution of the backward Hamiltonian differential equation where The relation between the solutions of (49) and (51) uniformly on t ∈ [s, T ], where P(t) ∈ M n×n is the solution of the Riccati equation related to the optimal control problem (42) with state matrix A 1 , namely: Proof Consider the Hamiltonian systems in (51) related to all different distributions π N . Notice that the norm of H N in (52) can be bounded: Using Grönwall's lemma, one can easily show that the pair of matrices (X N , Y N ) solution to (51) is uniformly bounded and that, using again (51), (Ẋ N ,Ẏ N ) is uniformly integrally bounded, for all N ∈ N. So it is possible to apply Theorem 2.5.3 of [28] to show that the pair (X N , Y N ) converges to some matrices (X ∞ , Y ∞ ) solution of the system where and Then, in view of Theorem 6.2, the matrix X ∞ (t) is also nonsingular for each t ∈ [s, T ], and the matrix P ∞ (t) := Y ∞ (t)X ∞ (t) −1 is the solution of the Riccati equation Furthermore, since each X N (t) −1 is continuous and well defined for each N ∈ N and that X ∞ (t) −1 is uniformly continuous on [s, T ], then uniformly on t ∈ [s, T ]. Finally, consider the matrixP(t) ∈ M nM×nM , where P(t) ∈ M n×n is the unique solution of the Riccati equation (54). A direct verification shows thatP(t) ∈ M nM×nM also satisfies the Riccati equation (57). However, the problem (57) admits a unique solution, implying that P ∞ (t) ≡P(t) for all t ∈ [s, T ]. This concludes the proof.
and then,ū N tend toū for N going to +∞. Namely that the optimal control of problem (44), which satisfies the formula (48), evaluated at X 0 = (x 0 , . . . , x 0 ) pointwisely converges to the optimal control of problem (42). Furthermore, for each K ⊂ R n compact, one has that It is important to point out that whereas Theorem 5.3 proves the convergence for the class of optimal open-loop controls, Theorem 6.3 deals with the convergence of optimal controls in feedback form.
It remains an open question whether Theorem 6.3 can be proved for a generic sequence of probability measures π N N ∈N converging weakly-* to a generic probability measure π . Such an issue is delicate and will be studied in a forthcoming paper.

A numerical example
The aim of this section is to verify that the results summarized in Corollary 6.1 hold in a concrete example.
The model is the one presented in Sect. 6. The true dynamics is a controlled harmonic oscillator described by the matriceŝ  Note that the Wasserstein distance with respect to the Euclidean norm on R 2×2 between π N and δÂ can be computed exactly: Both Problem A, that is the LQR problem with the matrixÂ known, and Problem B, that is the LQR problem with the matrixÂ unknown can be solved by finding the solution of a Riccati equation (see §IV.5 on Fleming-Rishel monograph [9]). We solved the equation numerically for N = 0, . . . , 9. For each N , we computed the sup norm of the difference V N − V and of the differenceū N −ū, whereū N andū are, respectively, the optimal controls for the two problems starting from x 0 = (1, 0). The results are summarized in Table 1. All the computations have been done using MATLAB on a MacBook Air 13" 2017 with Intel Core i5 Processor (2x 1.8 GHz).
Notice that when we increase N by one, we halve the distance W 1 (π N , δÂ) and Table 1 tells us that also the error || V N − V || ∞,K , with K := [−2, 2] 2 , is halved; this is consistent with the estimate given by Corollary 6.1. At the same time, we remark that the error ||ū N −ū|| ∞ is halved as well, even if we did not have any estimate on the convergence rate of the optimal controls. We can say that in this example, both the errors are going to 0 with order 1.
The optimal trajectory of Problem A starting from x 0 = (1, 0) is represented in Fig. 1. For Problem B, the optimal trajectory is actually made of 9 trajectories, the costs of which are weighted averaged in order to compute the cost functional J . Two examples of optimal (multi-)trajectory, respectively, for π 0 and π 2 , are represented in Figs. 2 and 3; note that the trajectory related to the true dynamics is x 1 (t), which is the darkest one. Finally, in Fig. 4 the optimal controls for Problem B with N = 0, . . . , 4 are compared with the true optimal control.    Fig. 4 Comparison of the optimal controls of Problem B relative to different probability distributions π N , with N = 0, . . . , 4, and the true optimal control of Problem A (in blue). In the legend, we reported, for each N , the probability α 1 that the true matrix A 1 ≡Â has under the distribution π N . When α 1 → 1, the optimal control of Problem B converges to the true one

Conclusions
In this paper, we proved some convergence properties for the optimal policies of LQ optimal control problems with uncertainties (our Problem B), assuming that the current belief on the dynamics is represented by a generic probability distribution π on the space of matrices. Under standard hypotheses on the system dynamics and the cost functional, we proved that the open-loop, optimal controlū π of Problem B converges to the open-loop, optimal controlū of the actual system as soon as the distribution π is sufficiently close (w.r.t. the Wasserstein distance (9)) to a Dirac's delta δÂ evaluated at the actual system matrixÂ. We also showed that when the probability distribution π is actually a discrete measure, then also the closed-loop optimal control of Problem B converges to the closed-loop optimal control of the actual system when the distribution π is sufficiently close to the Dirac's delta δÂ. The latter result was also validated by a numerical example presented in Sect. 7.
It is worth stressing that the proposed approach has strong connections with several Bayesian-like RL algorithms (such as PILCO), providing a theoretical framework to obtain stability and convergence guarantees for such algorithms.
As a future direction, we would like to extend this approach to a nonlinear, control affine optimal control problem with convex functional, getting closer to the problem formulation studied in [7]. A first attempt in this direction has been recently proposed in [22]. Furthermore, we are interested in constructing new efficient RL algorithms using well-established tools from control theory. In this context, we have already developed a new method for solving LQR problems with unknown dynamics [20].
Funding Open access funding provided by Universitá degli Studi di Roma La Sapienza within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.