Discrete-time inverse linear quadratic optimal control over finite time-horizon under noisy output measurements

In this paper, the problem of inverse quadratic optimal control over finite time-horizon for discrete-time linear systems is considered. Our goal is to recover the corresponding quadratic objective function using noisy observations. First, the identifiability of the model structure for the inverse optimal control problem is analyzed under relative degree assumption and we show the model structure is strictly globally identifiable. Next, we study the inverse optimal control problem whose initial state distribution and the observation noise distribution are unknown, yet the exact observations on the initial states are available. We formulate the problem as a risk minimization problem and approximate the problem using empirical average. It is further shown that the solution to the approximated problem is statistically consistent under the assumption of relative degrees. We then study the case where the exact observations on the initial states are not available, yet the observation noises are known to be white Gaussian distributed and the distribution of the initial state is also Gaussian (with unknown mean and covariance). EM-algorithm is used to estimate the parameters in the objective function. The effectiveness of our results are demonstrated by numerical examples.


Introduction
Since first proposed by [1], there has been a numerous applications of inverse optimal control (IOC) [2][3][4]. For example, IOC has been widely developed as a powerful tool to help us understand the optimality criteria underlying biological motions, which could then used for control synthesis of bionic robots. The goal of "forward" optimal control problem is to find the optimal control input and the corresponding state trajectory given the cost function, system dynamics and the initial value of the states. In contrast, for a given system dynamics, the IOC problem focuses on recovering the undetermined parameters in the cost function that corresponds to the observed system states or output measurements (possibly noisy).
In this paper, the problem of IOC for discrete-time linear quadratic regulators (LQRs) over finite-time horizon is considered. As an extension of our early work [5], the identifiability of the corresponding model structure is also fully studied. Specifically speaking, our goal is to reconstruct the unknown parameters in the quadratic objective function using the given discrete-time linear system dynamics and its noisy measurement of output.
The inverse linear quadratic optimal control problem has been widely studied during the past decades, particularly in the continuous infinite time-horizon case such as [6][7][8]. In most work, it is assumed that the optimal feedback gain K is measured exactly, which is a constant matrix in the infinite time-horizon. A classical method [9] is to formulate the problem into an optimization problem with linear matrix 1 3 inequalities (LMI) constraints and solve for the optimal Q and R when the feedback gain K is known. Nevertheless, in many scenarios, the feedback control gain K cannot be known exactly and we could only measure the optimal state sequence or the corresponding system output. In [10], given the noisy state observations, the discrete infinite time-horizon case is studied in which the optimal feedback gain K is time-invariant. Their approach is to first identify the feedback matrix K from noisy observations by a maximumlikelihood estimator. Then matrices Q and R could be solved in a similar way as the method proposed in [9].
However, note that in the finite-time horizon case, the feedback gain K t is time-varying, since "identifying timevarying system" is in general not an easy task, such "two steps" approach, i.e., "first identify the feedback gain using classical system identification methods, then search for the corresponding parameters in the quadratic objective function" cannot be directly applied. Furthermore, one critical property which is ignored for the "first identification step" is that the K t 's are actually generated by an LQR. Hence, it is difficult to guarantee the performance of such method in the finite-time horizon case and the optimality behind the observations could have been utilized to further improve the performance.
Since linear quadratic regulators can be seen as a special case of nonlinear optimal control problems, hence more generally speaking, the problem of IOC for quadratic regulators also lies in the scope of IOC problems for nonlinear systems. Usually, the considered objective function is composed by a weighted sum of base functions, where various parameter identification methods or learning techniques are applied. For example, in [11,12], the optimality conditions for the forward optimal control problem is analyzed and the undetermined parameters in the cost function are estimated by minimizing the violation of these conditions. Nevertheless, [13] showed that such methods are not statistically consistent and sensitive to observation noise. Instead, a statistically consistent formulation is proposed in [13], where the resulting optimization problem is difficult to solve. The discrete finite time case is also considered in [14,15] based on the Pontryagin's Maximum Principle (PMP) for the optimal control problem. An optimization problem is then formulated whose constraints are two of the three conditions of PMP, and the residual of the third PMP condition is minimized. It is assumed that the optimal control input signal is known exactly therein while in our case, we only assume that the initial values and noisy system output is known. Moreover, though the authors claim that the problem of identifiability is considered, the "identifiability" considered therein does not coincide with the traditional definition of identifiability. In addition, the statistical consistency of the estimation is not claimed therein, hence the performance under noisy observations can not be guaranteed. In [16], the authors consider the discrete-time IOC for nonlinear systems when some segments of the trajectories and input observations are missing. However, their analysis for the identifiability does not accord with the other definition of identifiability and statistical consistency cannot be guaranteed.
The aforementioned problems of estimating the parameters in a nonlinear optimal control problem could indeed include inverse LQ optimal control as a special case. However, taking advantage of the special structure of LQR, we can further prove the identifiability for the model structure in Sect. 3 and the statistical consistency for the estimation in Sect. 4 as well. Furthermore, under Gaussian assumptions, we use EM-algorithm to obtain a maximum likelihood estimation of the objective function as shown in Sect. 5 due to its special structure. The performance of the algorithms are shown by numerical examples in Sect. 6.
Notations We denote n dimensional positive semidefinite matrices cone as n + while n ++ is the set of strictly positive semi-definite matrices. ‖ ⋅ ‖ denotes the l 2 norm of a vector. We denotes the Frobenius norm of a matrix as ‖ ⋅ ‖ F . tr(⋅) denotes the trace of a matrix and tr(H T 1 H 1 ) = ‖H 1 ‖ 2 F . ⊗ denotes Kronecker product and [H] i denotes the ith row of matrix H.

Problem formulation
The "forward" optimal LQ problem reads where S, Q ∈ n + , R ∈ m ++ , x t ∈ ℝ n and u t ∈ ℝ m . Suppose the noisy output y t = Cx t + v t is available, where v t ∈ ℝ l denotes the noise term, y t ∈ ℝ l and the matrices (A, B, C) are known. For the sake of simplicity, we only consider the case which R = I and S = 0 in this paper.
The IOC problem considered in this paper aims to find Q in the objective function using: 1. the exact samples of initial value x 1 =x and the noisy output y 2∶N , or 2. the noisy output y 1∶N under Gaussian assumption.
We further assume (A, B) is controllable and B has full column rank. Moreover, since a discrete-time system is usually sampled from a continuous linear system ̇x =Âx +Bu , this means that A = eÂ Δt , B = ∫ Δt 0 eÂBd and by the fact of matrix exponential being invertible, it is reasonable to further suppose A being invertible.

Model identifiability
Before moving on, we would like to investigate the fundamental question: given the observation data, can we uniquely reconstruct the parameter Q in the objective function? This question actually involves two aspects: identifiability of the model structure as well as persistent excitation. Identifiability is a property of the model structure which determines if the model structure can generate the same output using two different parameters while persistent excitation of the input signal ensures that we can actually recover the undetermined parameters in the identifiable model by the measured corresponding output. Hence, to do further analysis, we need to define the model structure for the IOC problem first.
Model structure is nothing but a parameterized candidate model that describes the input-output relationship [17]. In the IOC problem we just formulated, we regard the initial value x as the input signal and the system output y t as the output signal. Therefore, the model structure M 2∶N−1 (Q) considered in this paper is defined as the following: where A cl (k;Q) is the closed-loop system matrix generated by the "forward problem" (1) and (2).
We further make the following assumption for the system: The discrete-time linear system defined by (A, B, C) is a square system, i.e., l = m and has relative degree (r 1 , … , r m ) . Namely, where c i denotes the ith row of C.
We first look into more details of the optimal control sequences.
cannot happen consecutively for more than n − 1 steps.
Proof Recall that the optimal solution to (2) is derived by the discrete-time Riccati Equation (DRE) whose solution sequence is denoted by DRE(Q, S). We then prove the theorem by contradiction.
Using (4) and the fact that A is invertible, B T ΔP t+1 = 0 could be computed recursively for t = t 1 − n + 1 ∶ t 1 and stacked into the following compact form: Since (A, B) is controllable, has full column rank. Hence, ΔP t 1 +1 is the unique solution of the linear equation P = .
On the other hand, consider the following algebraic equation: We can apply the same techniques to (5) and get the same equation P = while ΔP is a solution to it. Since P = has a unique solution, hence it follows that ΔP t 1 +1 = ΔP .
Using (5) and where P N = P � N + ΔP = ΔP. Recall that on the time interval t = t 1 − n + 2 ∶ t 1 + 1 , P t and P t satisfy the same DRE with weighting matrix Q and boundary P t 1 +1 =P t 1 +1 = P � t 1 +1 + ΔP . Then based on the backwards recursion of DRE, it is obvious that P t =P t for all 1 ⩽ t ⩽ t 1 + 1 . It means that P t and P t are generated by the same forward Riccati equation with initial condition P 1 . We further show uniqueness of solution for the forward DRE for t ∈ [1, N − 1] by considering its adjoint system and using the fact that A is nonsingular whose solution can be uniquely determined for all t ⩾ 1 . Recall that P t = DRE(Q, 0) and let {X t } be the regular matrix solution of its closed-loop system X t+1 = (A + BK t )X t with X 1 = I . Since A + BK t is always nonsingular [18], the nonsingular X t together with Y t = P t X t compose the unique solution to (6). Therefore, for the given P 1 , there exists a unique solution Y t X −1 t that satisfies the forward DRE for t = 1 ∶ N , which contradicts with the fact P N = ΔP ≠ P N . Hence the claim follows. ◻ Now, we are ready to present the theorem for the identifiability of the model structure M 2∶N−1 (Q).
Proof We prove the statement by contradiction. Suppose two different weighting matrices Q 1 ≠ Q 2 could generate the same model structure, namely, For the model structure M t,i (Q) that corresponds to the ith system output at time instant t. By Theorem 1, assume , we can cancel the product of closed-loop system matrices in the above equation and get Since L is invertible by definition, it holds that K t � (Q 1 ) = K t � (Q 2 ) and we have a contradiction. Hence the model structure is globally identifiable at Q 1 . Since Q 1 is arbitrarily chosen, this means that the model structure is strictly globally identifiable. ◻

Remark 1
In the above theorem, it is easy to see that it is actually sufficient for the time-horizon length to have N ⩾ n + max i (r i ) to make the model structure strictly globally identifiable. However, we state the theorem as it is just to make the theorem easier to comprehend for the reader. On the other hand, for the case l > m (the output dimension is greater than the input dimension), we just need to assume that there is a subset of outputs that makes the system have relative degree (r 1 , … , r m ).
Here, we would like to remark that in our problem formulation observability of the system is not assumed. This is due to the fact that lack of observability can happen in various situations such as sensor unavailability, data shared with privacy, uncertain environment or tasks with a limited number of descriptive features, and it is not necessary to impose reconstructability of the state from noisy observations. On the other hand, for the identifiability of the IOC problem, it is required that the influence of different cost functions should be reflected sufficiently in the observation sequences. More specifically, from the proof of Theorem 2, we can see that the existence of relative degree plays a key role in guaranteeing the distinguishability of the system output under different parameter specifications, namely, ȳ t (Q,x) =ȳ t (Q,x) with t = 2, … N if and only if Q =Q . It means that under such assumption, the LQR problems would generate exactly the same output sequences in the noiseless case only when the same weighting matrix Q is used. However, such property can not always be guaranteed for systems without a relative degree, which is illustrated by the following example.
Example 1 Consider the following controllable square system which, however, does not have a relative degree. For the quadratic cost function, we take N = 5 . We could show that the following weighting matrices would generate the same output sequences ȳ t , First, we check the first output. Since x , straightforward computation gives By computing K t from DRE, one could easily show that ȳ t,1 (Q,x) =ȳ t,1 (Q,x) for any x and t = 1, … N since all the coefficient matrices of x in the right hand side of the above equations are equal for K t (Q) and K t (Q) . The same claim also holds for ȳ t,2 and the statement follows. ◻

IOC using exact initial values and noisy output
We first consider the IOC problem in which exact samples of the initial value x 1 =x and noisy output y 2∶N are available.
Suppose that x ∈ ℝ n , {v t ∈ ℝ n } N t=2 are independent random vectors with unknown distributions. We make further the following assumptions: Assumption 2 It holds that N ⩾ 2n , l ⩾ m , and there exists a subset of outputs that makes the system have relative degree (r 1 , … , r m ).

Assumption 3 (Persistent excit ation)
is the open -ball centered at r .

Assumption 4
The "real" Q belongs to a compact set ̄n With the assumptions above, the forward LQR problem (1) and (2) can actually be expressed as where is the sample space. Hence the optimal trajectory {x * t } as well as the observed system output {y * t = Cx * t + v t } are random vectors implicitly determined by the random variable x and the parameter Q. Based on (7) we can define a risk function as is the optimal trajectory to (7). We minimize the risk function to obtain an estimate of the Q matrix, namely, The main problem now is that whether the risk minimization problem has a unique solution. For a given closed-loop system matrix A + BK t of a finite-horizon LQR problem, the uniqueness of the corresponding quadratic cost function has been studied in discrete systems [18] and continuous systems [19], respectively. However, additional difficulties would rise when we could only measure the system output. The following lemma shows the uniqueness of the solution to (10) under the assumption of relative degrees.
Lemma 1 Under Assumptions 2-5, the risk minimization problem (10) has a unique optimizer Q , where Q is the true value used in the forward problem (7).
. It is clear that Q =Q minimizes the risk function R(Q) . What remains to show is the uniqueness.
Recall that by our assumption, there exists a subset of outputs that makes the system have relative degree (r 1 , … , r m ) . On the other hand, it holds that ȳ t (Q,x) = M t (Q)x . Note that By Theorem 2, we know that given N ⩾ 2n , if Q ≠Q , then there exists some time instants t i s such that M t (Q) ≠ M t (Q) . On the other hand, by Assumption 3, Hence, Q is the unique minimizer to (8) and the statement follows. ◻ The risk minimization problem is expected to be solved numerically. To tackle the issue that the distributions of x and v 2∶N are unknown, we approximate (8) using the empirical mean We consider first the optimal control problem (7 Since what we consider is a forward LQR problem, the PMP conditions are also sufficient. Hence, we can express x * 2∶N in (9) using (12). Thus, the approximated risk minimization problem reads The superscript "star" is omitted for simplicity. Note that the optimizer (Q * M ( ), x (i) * 2∶N ( ), (i) * 2∶N ( )) is stochastic and is optimal in the sense that it optimizes (11) for every ∈ .
Based on Lemma 1, we could then show the statistical consistency of the approximated risk minimization problem (13).
where Q is the true value used in the "forward" problem (7).  (12) can be written as the following implicit dynamics Therefore, we can rewrite (12) in the following compact matrix form as where Note that for an arbitrary Q ∈ n + , since (14) is a sufficient and necessary condition for the optimality of the corresponding forward LQR problem that in turn has a unique solution, F(Q) must be invertible for all Q ∈ n + . Thus, it follows that Hence f (Q; ) can be rewritten as where G = I N−1 ⊗ C, 0 l×n and Y = [y T 2 , … , y T N ] T . It can be shown in a similar way to [5] that the uniform law of large numbers [20] applies, namely, In addition to (15), by Lemma 1 we have also shown that Q is the unique optimizer to (10), then Q * M p →Q follows directly from Theorem 5.7 in [21]. ◻

IOC under Gaussian assumption
In this section, we consider the case in which exact samples on the initial values are not available and we can only observe the noisy measurement y 1 = Cx 1 + v 1 . Recall that in Sect. 3, we mentioned that we see the initial value as the "input signal" of the model structure while the system output observation as the "output signal". This means that in this scenario, the problem is actually an error-in-variable system identification problem, which is a hard problem in general.
To tackle this problem, we have the following assumption in the remainder of this section.
and v 1∶N are white. v is known, but m 1 and 1 are not a priori known.
Recall that in the proof of Theorem 3, we are able to write Namely, it holds that where G t = [G] l(t−1)+1∶lt . We can regard y 1∶N as the output of the following system: This means that the IOC problem can be viewed as a system identification problem for linear Gaussian model under Assumption 6, which can be solved using maximum log-likelihood method. If M sets of output sequences are available, it means we have M i.i.d samples t and x (i) t depends only on the initial value and Q, y (i) t is independent of y (j) t , ∀i ≠ j. EM-algorithm will be used to get a maximum loglikelihood estimate for Q. Based on (16), 1 is chosen as the latent variable and we use to parametrize Q, m 1 and 1 . Now we are ready to compute the auxiliary function Q( , k ) for EM-algorithm.

Proof By Bayes' rule, it follows that
Hence by the definition of the auxiliary function Q( , k ) , we have Note that y (i) , v ) and it has nothing to do with . By using the cyclic permutation property of matrix trace and the fact that where the constant and irrelevant terms are ignored. ◻ We can obtain the mean and covariance for the initial value ̂( i) 1 and P (i) 1|N by fixed point smoothing [22]. Each iteration of EM-algorithm involves maximizing Q( , k ) with respect to . Since m 1 and 1 only appear in Q 1 ( , k ) while Q appears only in Q 2 ( , k ) , they can be optimized separately. The first-order optimality condition for maximizing Q 1 ( , k ) with respect to m 1 and 1 reads: Solving (19) and (20), we get Since (21) and (22) give the unique maximizer of Q 1 ( , k ) , it is also the global maximizer of Q 1 ( , k ) . In contrast it is not possible to obtain the analytic solution for the optimizer of Q 2 ( , k ) , thus we have to solve Q * numerically. The problem of maximizing Q 2 ( , k ) can be rewritten as where Using the function f proposed by [23] and apply the same "trick" mentioned in the end of Sect. 4, we can use standard nonlinear optimization solvers to solve (23).

Numerical examples
In our numerical simulation, a group of discrete-time linear systems sampled from continuous linear systems ̇x =Âx +Bu with the sampling period Δt = 0.1 are generated, where and a 1 , a 2 , c 1 , c 2 are randomly generated from N(0, 3) , respectively. However, we need to screen for satisfaction of the relative degree condition. The "real" Q is generated by letting Q =QQ T , where each element of Q is sampled from the uniform distribution on [−1, 1] and the time-horizon length N is set to 40. The feasible compact set for Q is set as ̄ n + (5) and those randomly generated Q that does not belong to ̄ n + (5) is discarded and re-generated until we have 200 groups of randomly generated "forward" problems. Each element of the initial conditions x (1∶M) is sampled from a uniform distribution supported on [−5, 5] . For each forward problem, 200 trajectories are generated, i.e., M = 200 . We add 20dB of white Gaussian noises to Cx (1∶M) 2∶N to obtain y (1∶M) 2∶N . MATLAB function fmincon is used to solve the risk-minimizing problem. We use Q = I as the initial iteration values when solving the optimization problem in MAT-LAB for all IOC estimations. Denote the estimation of Q as Q est .
From Fig. 1, we see that the relative error ‖Q est −Q‖ F ∕ ‖Q‖ F largely decreases as M increases. This empirically justifies Theorem 3 of the statistical consistency.
Next, the result of IOC estimation under Gaussian assumptions would be illustrated. Here, 100 groups of forward problems, in which 50 trajectories are generated for each group, i.e., M = 50 . The initial values are randomly generated by sampling from N(m 1 = 0, 1 = 10I) . White Gaussian noises are added to Cx (1∶M) 1∶N to get y (1∶M) 1∶N and v t ∼ N(0, v = 0.01I) . We use m 1 = I , 1 = 20I and Q = I as the initial guesses for EM-algorithm for all groups. The (23) min Q∈̄ n + ( ) stopping criterion for iteration is ‖Q k+1 − Q k ‖ F ⩽ 10 −4 . The results are shown as Fig. 2.
As illustrated in Fig. 2, the relative error ‖Q est −Q‖ F ∕‖Q‖ F is largely small, but we do get a few estimations with "huge" relative errors. This is due to the fact that EM-algorithms only guarantee that the likelihood does not descend but guarantee neither the finding of the global maximizer for the loglikelihood function nor the convergence of Q k to some local maxima. Mitigation of such issues will be our future work.

Conclusions
In this investigation, IOC for finite-horizon discrete-time LQRs is studied. We first investigate the identifiability of the model structure and conditions for identifiability is given. Next, the problem of IOC is considered under the two scenarios. We first consider the case where the distributions of the initial state and the observation noise are completely unknown, yet the initial states and some noisy system output are available. The problem is formulated as a risk minimization problem. By utilizing the property of identifiability, we are able to prove the statistical consistency for such the estimation method. We next consider the case where the initial states are not available, but the distribution as well as that of the observation noises is Gaussian. Maximum-likelihood estimation is produced by EM-algorithms. The performance of our methods is illustrated by numerical examples.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.