Kernel methods for the approximation of discrete-time linear autonomous and control systems

Methods from learning theory are used in the state space of linear dynamical and control systems in order to estimate relevant matrices and some relevant quantities such as the topological entropy. An application to stabilization via algebraic Riccati equations is included by viewing a control system as an autonomous system in an extended space of states and control inputs. Kernel methods are the main techniques used in this paper and the approach is illustrated via a series of numerical examples. The advantage of using kernel methods is that they allow to perform function approximation from data and, as illustrated in this paper, allow to approximate linear discrete-time autonomous and control systems from data.


Introduction
This paper discusses several problems in dynamical systems and control, where methods from learning theory are used in the state space of linear systems. This is in contrast to previous approaches in the frequency domain [10,11]. We refer to [11] for a general survey on applications of machine learning to system identification where similar problems have been treated using different techniques.
Basically, learning theory allows to deal with problems when only data from a given system are given. Reproducing Kernel Hilbert spaces (RKHS) allow to work in a very large dimensional space in order to simplify the underlying problem. We will discuss this in the simple case when the matrix A describing a linear discrete-time system is unknown, but a time series from the underlying linear dynamical system is given. We propose a method to estimate the underlying matrix using kernel methods. Applications are given in the stable and unstable case and for estimating the eigenvalues and topological entropy for a linear map. Furthermore, in the control case, estimation of the relevant matrices for a linear control system is done by viewing a linear control system as a dynamical system in the extended space of states and control inputs. Stabilization via linear-quadratic optimal control is discussed.
The emphasis of the present paper is on the formulation of a number of problems in dynamical systems and control and to illustrate the applicability of our approach via a series of numerical examples. This paper should be viewed as a preliminary step to extend these results to nonlinear discrete-time systems within the spirit of [3,4] where the authors showed that RKHSs act as "linearizing spaces" and that this approach offers tools for a data-based theory for nonlinear (continuous-time) dynamical systems. The approach used in these papers is based on embedding a nonlinear system in a high (
The contents are as follows: In Sect. 2, the problem is stated formally and an algorithm based on kernel methods is given for the stable case. In Sect. 3, the algorithm is extended to the unstable case. In particular, the topological entropy of linear maps is computed (which boils down to computing unstable eigenvalues). In Sect. 4, identification of linear control systems is considered and Sect. 5 discusses their stabilization. Every section contains several numerically computed examples (via MATLAB) illustrating the approach. Section 6 draws some conclusions from the numerical experiments. For the reader's convenience we have collected in the appendix basic concepts from learning theory as well as some hints to the relevant literature. A preliminary version of this article appeared in Hamzi and Colonius [9] 2 Statement of the problem Consider the linear discrete-time system where A = [a i,j ] ∈ ℝ n×n . We want to estimate A from the time series x(1) + 1 , … , x(N) + N where the initial condition x(0) is known and i are distributed according to a probability measure x that satisfies the following condition (this is the Special Assumption in [18]).
Assumption The measure x is the marginal on X = ℝ n of a Borel measure on X × ℝ with zero mean supported One obtains from (1) for the components of the time series that For every i we want to estimate the coefficients a ij , j = 1, … , n . They are determined by the linear maps f * i ∶ ℝ n → ℝ given by This problem can be reformulated as a learning problem as described in the "Appendix" where f * i in (3) plays the role of the unknown function (74) and (x(k), x i (k + 1) + i ) are the samples in (76).
We note that in [18], the authors do not consider time series and that we apply their results to time series.
In order to approximate f * i , we minimize the criterion in (79). For a positive definite kernel K, let f i be the kernel expansion of f * i in the corresponding RKHS H K . Then f i = ∑ ∞ j=1 c i,j j with certain coefficients c ij ∈ ℝ and where ( j , j ) are the eigenvalues and eigenfunctions of the integral operator L K ∶ L 2 (X) → C(X) given by with a Borel measure on X . Thus L K j = j j for j ∈ ℕ * and the eigenvalues j ≥ 0.
Then we consider the problem of minimizing over Since we are dealing with a linear problem, it is natural to choose the linear kernel k(x, y) = ⟨x, y⟩ . Then the solution of the above optimization problem is given by the kernel expansion of where the c ij satisfy the following set of equations: with This is a consequence of Theorem 2.
From (2), we have Then an estimate of the entries of A is given by This discussion leads us to the following basic algorithm. Algorithm A If the eigenvalues of A are all within the unit circle, one proceeds as follows in order to estimate A. Given the time series x(1), … , x(N) solve the system of Eq. (7) to find the numbers c ij and then compute â i from (9).
Before we present numerical examples and modifications and applications of this algorithm, it is worthwhile to note the following preliminary remarks indicating what may be expected.
The stability assumption in algorithm A is imposed, since otherwise the time series will diverge exponentially. Then, already for a moderately sized number of data points ( N ≈ 10 2 ) Eq. (7) will be ill conditioned. Hence for unstable A, modifications of algorithm A are required.
While for test examples one can compare the entries of the matrix A and its approximation Â , it may appear more realistic to compare the values x(1), … , x(N) of the data series and the values x(1), … ,x(N) generated by the iteration of the matrix Â .
In general, one should not expect that increasing the number of data points will lead to better approximations of the matrix A. If the matrix A is diagonalizable, for generic initial points x(0) ∈ ℝ n the data points x(k) will approach, for N → ∞ , the eigenspace for the eigenvalue with maximal modulus. For general A and generic initial points x(0) ∈ ℝ n , the data points x(N) will approach for, N → ∞ , the largest Lyapunov space (i.e., the sum of the real generalized eigenspaces for eigenvalues with maximal modulus). Thus in the limit for N → ∞ , only part of the matrix can be approximated. A detailed discussion of this (well known) limit behavior is, e.g., given in Colonius and Kliemann [6]. A consequence is that a medium length of the time series should be adequate.
This problem can be overcome by choosing the regularization parameter in (5) and (7) using the method of cross validation described in [12]. Briefly, in order to choose , we consider a set of values of regularization parameters: we run the learning algorithm over a subset of the samples for each value of the regularization parameter and choose the one that performs the best on the remaining data set. Cross validation helps also in the presence of noise and to improve the results beyond the training set.
A theoretical justification of our algorithm is guaranteed by the error estimates in Theorem 5. In fact, for the (9) a i = N ∑ j=1 c i,j x (j).
linear dynamical system (1), we have that f * in (74) is the linear map f * (x) = f i (x) in (3) and the samples in (76) are (x(k), x i (k + 1) + i ) . Moreover, by choosing the linear kernel k(x, y) = ⟨x, y⟩ we get that f * ∈ H K . In this case, (84) has the form See [3] for more details about error estimates in the general nonlinear case. The first term in the right hand side of inequality (10) represents the error due to the noise (sampling error) and the second term represents the error due to regularization (regularization error) and the finite-number of samples (integration error).
Next, we discuss several numerical examples, beginning with the following scalar equation.
. Applying algorithm A with the regularization parameter = 10 −6 we compute ̂= 0.4997 . Using cross validation, we get that ̂= 0.5 with regularization parameter = 1.5259 × 10 −5 . When we introduce an i.i.d perturbation signal i ∈ [− 0.1, 0.1] , the algorithm does not behave well when we fix the regularization parameter. With cross validation, the algorithm works quite well and the regularization parameter adapts to the realization of the signal i . Here, for e(k) = x(k) −x(k) with x(k + 1) = x(k) and x(k + 1) =̂x(k) , we get that We observe an analogous behavior of the algorithm when the data are generated from x(k + 1) = x(k) + x(k) 2 where the algorithm works well in the presence of noise and structural perturbations when using cross validation. When = 0.1 and with an i.i.d perturbation signal i ∈ [− 0.1, 0.1] , ̂ varies between 0.38 and 0.58 depending on the realization of i but i=100 e 2 (i) = 2.8098 × 10 −30 which shows that the error e decreases exponentially and the generalization properties of the algorithm are quite good.

Unstable case
Consider where some of the eigenvalues of A are outside the unit circle. Again, we want to estimate A when the following data are given, which are generated by system (16), thus x(k) = A k−1 x(1).
As remarked above, a direct application of the algorithm A will not work, since the time series diverges fast. Instead, we construct a new time series from (17) associated to an auxiliary stable system.
For a constant > 0 we define the auxiliary system by Thus and with y(1) = x(1) one finds If we choose > 0 such that the eigenvalues of A are in the unit circle, we can apply algorithm A to this stable matrix and hence we would obtain an estimate of A and hence of A. However, since the eigenvalues of the matrix A are unknown, we will be content with a somewhat weaker condition than stability of A . The data (17) for system (16) yield the following data for system (18): We propose to choose as follows: Define Clearly, the inequality ≤ ‖A‖ holds. We apply algorithm A to the time series y(k). This yields an estimate of A and hence an estimate Â of A.
For general A, this choice of certainly does not guarantee that the eigenvalues of A are within the unit circle. However, as mentioned above, a generic data sequence x(k), k ∈ ℕ , will converge to the eigenspace of the eigenvalue with maximal modulus. Hence ‖x(k+1)‖ ‖x(k)‖ will approach the maximal modulus of an eigenvalue, thus this choice of will lead to a matrix A which is not "too unstable".
We observe the same behavior in higher dimensional systems where the eigenvalues are of the same order of magnitude. The algorithm fails in the presence of quadratic structural perturbations. This is due to the choice of a linear kernel. A polynomial kernel, for example, would allow for nonlinear perturbations but this would require a complete reformulation of our algorithm. We leave the extension of our algorithm to the nonlinear case for future work.
The next example is an unstable system with a large gap between the eigenvalues.
With the initial condition x(0) = [− 1.9, 1] , we generate the time series x(1), … , x(100) . The algorithm above yields the (excellent) estimate In the presence of noise of maximal amplitude 10 −4 , the algorithm approximates well only the large entry a 11 = 20 : For a first realization of i and with cross validation, we get with 1 = 1.5259 × 10 −5 and 2 = 2 20 . However another realization of i leads to with 1 = 3.0518 × 10 −5 and 2 = 2.8147 × 10 14 . This is due to the fact that the data converge to the eigenspace generated by the largest eigenvalue = 20 . However, the eigenvalues of A −Â are within the unit disk with small amplitude which guarantees that the error dynamics of e(k) = x(k) −x(k) converges to the origin quite quickly. We observe the same phenomenon with Here, in the absence of noise, we obtain the estimate with 1 = 2 = 0.9313 × 10 −9 . In the presence of noise i with amplitude 10 −4 , the data converge to the eigenspace corresponding to the largest eigenvalue = 25 : for some realization of i one obtains the estimate The regularization parameters 1 and 2 adapt to the realization of the noise.
As already remarked in the end of Sect. 2, we see that "more data" does not always necessarily lead to better results, since the data sequence converges to the eigenspace generated by the largest eigenvalue. However, whether with or without noise, the approximations of A are good enough to reduce the error between x(k + 1) = Ax(k) and x(k + 1) =Âx(k) outside of the training examples, since cross-validation determines a good regularization parameter that balances between good fitting and good prediction properties.
The next example has an eigenvalue on the unit circle.     Here we used i = 10 −12 , i = 1, … , 4 . Moreover, the eigenvalues of A −Â are quite small and such that the error dynamics converges quickly to the origin. In the presence of noise , the algorithm approximates the largest eigenvalues of A but does not approximate the smaller (stable) ones. For example, for a particular realization of noise with amplitude 10 −4 , we get the estimate The algorithm introduced above also allows us to compute the topological entropy of linear systems, since it is determined by the unstable eigenvalues. Recall that the topological entropy of a linear map on ℝ n is defined in the following way: Fix a compact subset K ⊂ ℝ n , a time ∈ ℕ and a constant > 0 . Then a set R ⊂ ℝ n is called ( , )-spanning for K if for every y ∈ K there is x ∈ R with By compactness of K, there are finite ( , )-spanning sets. Let R be a ( , )-spanning set of minimal cardinality #R = r min ( , , K ) . Then (the limits exist). Finally, the topological entropy of A is where the supremum is taken over all compact subsets K of ℝ n .
A classical result due to Bowen (cf. [21,Theorem 8.14]) shows that the topological entropy is determined by the sum of the unstable eigenvalues, i.e., where summation is over all eigenvalues of A counted according to their algebraic multiplicity.
Hence, when we approximate the unstable eigenvalues of A by those of the matrix Â , we also get an approximation of the topological entropy.

Identification of linear control systems
Consider the linear control system with A ∈ ℝ n×n and B ∈ ℝ n×1 . We want to estimate the matrices A and B from the time series x(1) + 1 , … , x(N) + N where satisfies the Assumption in Sect. 2. The initial condition x(0) and the control sequence u(0), … , u(N) are assumed to be known. In order to estimate A and B, we will extend algorithm A . The ith component of system (50) is given by For every i we want to estimate the coefficients b i and a ij , j = 1, … , n . Thus the linear map f i ∶ ℝ n → ℝ given by is unknown. To extend algorithm A , we will view system (51) as a system of the form (2) where the state x is the extended state x = (x, u) ∈ ℝ n × ℝ for (50). Hence, the kernel expansion (6) becomes where x n+1 = u and the c ij satisfy the following set of equations: with Let us emphasize that u = x n+1 does not appear on the left hand side of (53)-(54).
In reference to the case when A has eigenvalues outside the unit circle, we adopt the same method as in Sect. 3 and define (50) x(k + 1) = Ax(k) + Bu(k), Example 11 (Three dimensional unstable case) Consider control system (50) with The input u(k) = sin(k) + cos(k) and 100 points give the estimates (60) Here cross validation yields the regularization parameters These results show that algorithm A works quite well in these cases.

Stabilization via linear-quadratic optimal control
A basic problem for linear control systems is stabilization by state feedback. A standard method is to use linear quadratic optimal control, where the feedback is computed using the solution of an algebraic Riccati equation. In this section, we propose to replace in the algebraic Riccati equation the system matrix A by the estimate Â obtained by learning theory. The linear quadratic optimal control problem has the following form: Minimize over all (continuous) inputs u with x(⋅) given by here Q ∈ ℝ n×n is positive semidefinite and R ∈ ℝ m×m is positive definite, and A ∈ ℝ n×n , B ∈ ℝ n×m . Consider the discrete algebraic Riccati equation DARE Obviously, every solution P is positive semi-definite. We cite the following theorem from [1]. Theorem Suppose that for every x 0 ∈ ℝ n there is an input u, such that J(x 0 , u) < ∞ . Then the following holds: (i) There is a unique solution P of the DARE. (ii) For every x 0 ∈ ℝ n one has J * (x 0 ) ∶= inf{J(x 0 , u)| u an input} = x ⊤ 0 Px 0 and there is a unique optimal input u * with J * (x 0 ) = J(x 0 , u * ) . This optimal input is generated by the feedback F = (R + B T PB) − (66) u(k) = −Fx(k), k ≥ 0.
In particular, the feedback F stabilizes the system, i.e., x(k + 1) = (A − BF)x(k) is stable. Now we use an estimate Â and B (obtained by kernel methods) instead of A and B in the algebraic Riccati equation and obtain the solution P . Will the corresponding feedback u =Fx ∶= −B ⊤P x also stabilize the system, i.e., is the following system stable: Example 12 Consider the one-dimensional system x(k + 1) = − 0.9x(k) + 3.5u in Example 9. In the absence of noise, we get Â = − 0.9 and B = 3.5 . We have that In this example the feedback based on the estimate also stabilizes the original system.

Example 14 Consider control system (50) with
As Example 11 illustrates, without noise we get excellent approximations of A and B. For the feedback system one finds When there is noise of amplitude 10 −4 , one computes the estimates This are bad approximations for A and B. Furthermore, for the feedback system one finds Thus the stabilizing controller for the approximate system does not stabilize the true system.

Conclusions
This paper has introduced the algorithm A based on kernel methods to identify a stable linear dynamical system from a time series. The numerical experiments give excellent results in the absence of noise and structural perturbations. In the presence of noise and structural perturbations the algorithm works well in the stable case. In the unstable case, a modified algorithm works quite well in the presence of noise but cannot handle structural perturbations. Then we have extended algorithm A to identify linear control systems. In particular, we have used estimates obtained by kernel methods to stabilize linear systems using linear-quadratic control and the algebraic Riccati equation. Here the numerical experiments seem to indicate that the same conclusions on applicability of the algorithm apply.
Extensions of the considered algorithms to nonlinear systems appear feasible and are left to future work.
Acknowledgements BH thanks the European Commission for financial support received through Marie Curie Fellowships.

Appendix: Elements of learning theory
In this section, we give a brief overview of Reproducing Kernel Hilbert Spaces (RKHS) as used in statistical learning theory. The discussion here borrows heavily from Cucker and Smale [7], Wahba [20], and Schölkopf and Smola [16]. Early work developing the theory of RKHS was undertaken by Schoenberg [13][14][15] and then Aronszajn [2]. Historically, RKHS came from the question, when it is possible to embed a metric space into a Hilbert space.

Definition 1
Let H be a Hilbert space of functions on a set X which is a closed subset of ℝ n . Denote by ⟨f , g⟩ the inner product on H and let ‖f ‖ = ⟨f , f ⟩ 1∕2 be the norm in H , for f and g ∈ H . We say that H is a reproducing kernel Hilbert space (RKHS) if there exists K ∶ X × X → ℝ such that (i) K has the reproducing proper t y, i.e., f (x) = ⟨f (⋅), K (⋅, x)⟩ for all f ∈ H. (ii) K spans H , i.e., H = span {K (x, ⋅)|x ∈ X}.
K will be called a reproducing kernel of H and H K will denote the RKHS H with reproducing kernel K. Definition 2 Given a kernel K ∶ X × X → ℝ and inputs x 1 , … , x n ∈ X , the n × n matrix is called the Gram Matrix of k with respect to x 1 , … , x n . If for all n ∈ ℕ and distinct x i ∈ X the kernel Kgives rise to a strictly positive definite Gram matrix, it is called strictly positive definite.

Definition 3 (Mercer kernel map) A function
K ∶ X × X → ℝ is called a Mercer kernel if it is continuous, symmetric and positive definite.

Remark 1
(i) The dimension of the RKHS can be infinite and corresponds to the dimension of the eigenspace of the integral operator L K ∶ L 2 (X) → C(X) defined as if K is a Mercer kernel, for f ∈ L 2 (X) and a Borel measure on X. (ii) In Theorem 1, and using property [iv.] in Proposition 1, we can take (x) ∶= K x ∶= K (x, ⋅) in which case F = H-the "feature space" is the RKHS. This is called the canonical feature map. (iii) The fact that Mercer kernels are positive definite and symmetric shows that kernels can be viewed as generalized Gramians and covariance matrices. (iv) In practice, we choose a Mercer kernel, such as the ones in [v.] in Proposition 1, and Theorem 1, that guarantees the existence of a Hilbert space admitting such a function as a reproducing kernel.
RKHS play an important role in learning theory whose objective is to find an unknown function from random samples In the following we review results from [18] (for a more general setting, cf. [7]) in the special case when the data samples are such that the following assumption holds.