Abstract
Methods from learning theory are used in the state space of linear dynamical and control systems in order to estimate relevant matrices and some relevant quantities such as the topological entropy. An application to stabilization via algebraic Riccati equations is included by viewing a control system as an autonomous system in an extended space of states and control inputs. Kernel methods are the main techniques used in this paper and the approach is illustrated via a series of numerical examples. The advantage of using kernel methods is that they allow to perform function approximation from data and, as illustrated in this paper, allow to approximate linear discrete-time autonomous and control systems from data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
This paper discusses several problems in dynamical systems and control, where methods from learning theory are used in the state space of linear systems. This is in contrast to previous approaches in the frequency domain [10, 11]. We refer to [11] for a general survey on applications of machine learning to system identification where similar problems have been treated using different techniques.
Basically, learning theory allows to deal with problems when only data from a given system are given. Reproducing Kernel Hilbert spaces (RKHS) allow to work in a very large dimensional space in order to simplify the underlying problem. We will discuss this in the simple case when the matrix A describing a linear discrete-time system is unknown, but a time series from the underlying linear dynamical system is given. We propose a method to estimate the underlying matrix using kernel methods. Applications are given in the stable and unstable case and for estimating the eigenvalues and topological entropy for a linear map. Furthermore, in the control case, estimation of the relevant matrices for a linear control system is done by viewing a linear control system as a dynamical system in the extended space of states and control inputs. Stabilization via linear-quadratic optimal control is discussed.
The emphasis of the present paper is on the formulation of a number of problems in dynamical systems and control and to illustrate the applicability of our approach via a series of numerical examples. This paper should be viewed as a preliminary step to extend these results to nonlinear discrete-time systems within the spirit of [3, 4] where the authors showed that RKHSs act as “linearizing spaces” and that this approach offers tools for a data-based theory for nonlinear (continuous-time) dynamical systems. The approach used in these papers is based on embedding a nonlinear system in a high (or infinite) dimensional reproducing kernel Hilbert space (RKHS) where linear theory is applied. To illustrate this approach, consider a polynomial in \(\mathbb {R}\), \(p(x)=\alpha + \beta x +\gamma x^2\) where \(\alpha , \beta , \gamma\) are real numbers. If we consider the map \(\phi {:}\, \mathbb {R}\rightarrow \mathbb {R}^3\) defined as \(\phi (x)=[1 \; x \; x^2]^{T}\) then \(p(x) = {\alpha } \cdot [1 \; x \; x^2]^{T}= {\alpha } \cdot \phi (x)\) is an affine polynomial in the variable \(\phi (x)\). Similarly, consider the nonlinear discrete-time system \(x(k+1)=x(k)+x^2(k)\). By rewriting it as \(x(k+1)=[1 \; 1] \left[ \begin{array}{c}x(k) \\ x(k)^2 \end{array} \right]\), the nonlinear system becomes linear in the variable \([x(k) \; x(k)^2]\).
The contents are as follows: In Sect. 2, the problem is stated formally and an algorithm based on kernel methods is given for the stable case. In Sect. 3, the algorithm is extended to the unstable case. In particular, the topological entropy of linear maps is computed (which boils down to computing unstable eigenvalues). In Sect. 4, identification of linear control systems is considered and Sect. 5 discusses their stabilization. Every section contains several numerically computed examples (via MATLAB) illustrating the approach. Section 6 draws some conclusions from the numerical experiments. For the reader’s convenience we have collected in the appendix basic concepts from learning theory as well as some hints to the relevant literature. A preliminary version of this article appeared in Hamzi and Colonius [9]
2 Statement of the problem
Consider the linear discrete-time system
where \(A=[a_{i,j}]\in {\mathbb {R}}^{n\times n}\). We want to estimate A from the time series \(x(1)+\eta _{1}\), \(\ldots\), \(x(N)+\eta _{N}\) where the initial condition x(0) is known and \(\eta _{i}\) are distributed according to a probability measure \(\rho _{x}\) that satisfies the following condition (this is the Special Assumption in [18]).
Assumption The measure \(\rho _{x}\) is the marginal on \(X={\mathbb {R}} ^{n}\) of a Borel measure \(\rho\) on \(X\times {\mathbb {R}}\) with zero mean supported on \([-\,M_{x},M_{x}],M_{x}>0\).
One obtains from (1) for the components of the time series that
For every i we want to estimate the coefficients \(a_{ij},j=1,\ldots ,n\). They are determined by the linear maps \(f^{*}_{i}{:}\,{\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\) given by
This problem can be reformulated as a learning problem as described in the “Appendix” where \(f^{*}_{i}\) in (3) plays the role of the unknown function (74) and \((x(k),x_{i}(k+1)+\eta _{i})\) are the samples in (76).
We note that in [18], the authors do not consider time series and that we apply their results to time series.
In order to approximate \(f^{*}_{i}\), we minimize the criterion in (79). For a positive definite kernel K, let \(f_{i}\) be the kernel expansion of \(f^{*}_{i}\) in the corresponding RKHS \({\mathcal {H}}_{K}\). Then \(f_{i}=\sum _{j=1}^{\infty }c_{i,j}\phi _{j}\) with certain coefficients \(c_{ij}\in {\mathbb {R}}\) and
where \((\lambda _{j},\phi _{j})\) are the eigenvalues and eigenfunctions of the integral operator \(L_{K}{:}\,{\mathcal {L}}_{\nu }^{2}({\mathcal {X}})\rightarrow {\mathcal {C}}({\mathcal {X}})\) given by \((L_{K}f)(x)=\int K(x,t)f(t)d\nu (t)\) with a Borel measure \(\nu\) on \({\mathcal {X}}\). Thus \(L_{K}\phi _{j}=\lambda _{j}\phi _{j}\) for \(j\in {\mathbb {N}}^{*}\) and the eigenvalues \(\lambda _{j}\ge 0\).
Then we consider the problem of minimizing over \((c_{i,1},\) \(\ldots ,c_{i,N})\) the functional
where \(y_{i}(k):=x_{i}(k+1)+\eta _{i}=f^{*}_{i}(x(k))+\eta _{i}\) and \(\gamma _{i}\) is a regularization parameter.
Since we are dealing with a linear problem, it is natural to choose the linear kernel \(k(x,y)=\langle x,y\rangle\). Then the solution of the above optimization problem is given by the kernel expansion of \(x_{i}(k+1)\), \(i=1,\ldots ,n\),
where the \(c_{ij}\) satisfy the following set of equations:
with
This is a consequence of Theorem 2.
From (2), we have
Then an estimate of the entries of A is given by
This discussion leads us to the following basic algorithm.
Algorithm \({\mathcal {A}}\) If the eigenvalues of A are all within the unit circle, one proceeds as follows in order to estimate A. Given the time series \(x(1),\ldots ,x(N)\) solve the system of Eq. (7) to find the numbers \(c_{ij}\) and then compute \(\hat{a}_{i\ell }\) from (9).
Before we present numerical examples and modifications and applications of this algorithm, it is worthwhile to note the following preliminary remarks indicating what may be expected.
The stability assumption in algorithm \({\mathcal {A}}\) is imposed, since otherwise the time series will diverge exponentially. Then, already for a moderately sized number of data points (\(N\approx 10^{2}\)) Eq. (7) will be ill conditioned. Hence for unstable A, modifications of algorithm \({\mathcal {A}}\) are required.
While for test examples one can compare the entries of the matrix A and its approximation \(\hat{A}\), it may appear more realistic to compare the values \(x(1),\ldots ,x(N)\) of the data series and the values \(\hat{x}(1),\ldots ,\hat{x}(N)\) generated by the iteration of the matrix \(\hat{A}\).
In general, one should not expect that increasing the number of data points will lead to better approximations of the matrix A. If the matrix A is diagonalizable, for generic initial points \(x(0)\in {\mathbb {R}}^{n}\) the data points x(k) will approach, for \(N\rightarrow \infty\), the eigenspace for the eigenvalue with maximal modulus. For general A and generic initial points \(x(0)\in {\mathbb {R}}^{n}\), the data points x(N) will approach for, \(N\rightarrow \infty\), the largest Lyapunov space (i.e., the sum of the real generalized eigenspaces for eigenvalues with maximal modulus). Thus in the limit for \(N\rightarrow \infty\), only part of the matrix can be approximated. A detailed discussion of this (well known) limit behavior is, e.g., given in Colonius and Kliemann [6]. A consequence is that a medium length of the time series should be adequate.
This problem can be overcome by choosing the regularization parameter \(\gamma\) in (5) and (7) using the method of cross validation described in [12]. Briefly, in order to choose \(\gamma\), we consider a set of values of regularization parameters: we run the learning algorithm over a subset of the samples for each value of the regularization parameter and choose the one that performs the best on the remaining data set. Cross validation helps also in the presence of noise and to improve the results beyond the training set.
A theoretical justification of our algorithm is guaranteed by the error estimates in Theorem 5. In fact, for the linear dynamical system (1), we have that \(f^{*}\) in (74) is the linear map \(f^{*}(x)=f_{i}(x)\) in (3) and the samples \({\mathbf {s}}\) in (76) are \((x(k),x_{i}(k+1)+\eta _{i})\). Moreover, by choosing the linear kernel \(k(x,y)=\langle x,y\rangle\) we get that \(f^{*}\in {\mathcal {H}}_{K}\). In this case, (84) has the form
where \(\Vert x_{i}(k+1)\Vert _{{\mathcal {H}}_{K}}=\sum _{j=1}^{\infty }\frac{c_{i,j}^{2} }{\lambda _{j}}\). See [3] for more details about error estimates in the general nonlinear case.
The first term in the right hand side of inequality (10) represents the error due to the noise (sampling error) and the second term represents the error due to regularization (regularization error) and the finite-number of samples (integration error).
Next, we discuss several numerical examples, beginning with the following scalar equation.
Example 1
Consider \(x(k+1)=\alpha x(k)\) with \(\alpha =0.5\). With the initial condition \(x(0)=-\,0.5\), we generate the time series \(x(1),\ldots ,x(100)\). Applying algorithm \({\mathcal {A}}\) with the regularization parameter \(\gamma =10^{-6}\) we compute \(\hat{\alpha }=0.4997\). Using cross validation, we get that \(\hat{\alpha }=0.5\) with regularization parameter \(\gamma =1.5259\,\times \,10^{-5}\). When we introduce an i.i.d perturbation signal \(\eta _{i}\in [-\,0.1,0.1]\), the algorithm does not behave well when we fix the regularization parameter. With cross validation, the algorithm works quite well and the regularization parameter adapts to the realization of the signal \(\eta _{i}\). Here, for \(e(k)=x(k)-\hat{x}(k)\) with \(x(k+1)=\alpha x(k)\) and \(\hat{x}(k+1)=\hat{\alpha }\hat{x}(k)\), we get that \(\Vert e(300)\Vert =\sqrt{\sum _{i=1}^{300}e^{2}(i)}=0.0914\) and \(\sqrt{\sum _{i=100}^{300}e^{2} (i)}=1.8218\,\times \,10^{-30}\).
We observe an analogous behavior of the algorithm when the data are generated from \(x(k+1)=\alpha x(k)+\varepsilon x(k)^{2}\) where the algorithm works well in the presence of noise and structural perturbations when using cross validation. When \(\varepsilon =0.1\) and with an i.i.d perturbation signal \(\eta _{i}\in [-\,0.1,0.1]\), \(\hat{\alpha }\) varies between 0.38 and 0.58 depending on the realization of \(\eta _{i}\) but \(\Vert e(300)\Vert =\sqrt{\sum _{i=1}^{300}e^{2}(i)}=0.2290\) and \(\sqrt{\sum _{i=100}^{300}e^{2} (i)}=2.8098\,\times \,10^{-30}\) which shows that the error e decreases exponentially and the generalization properties of the algorithm are quite good.
Example 2
Consider \(x(k+1)=Ax(k)\) with matrix A given by
For the initial condition \(x=[-\,0.9,0.1,15,0.2]^{\prime }\) and with \(N=100\) data points, we get
We then simulate \(x(k+1)=Ax(k)\) and \(\hat{x}(k+1)=\hat{A}\hat{x}(k)\) for \(k=0,\ldots ,200\) to test the accuracy of our approximation beyond the interval \(k=0,\ldots ,100\). Then the norm of the error \(e_{j}(k)=x_{j}(k)-\hat{x} _{j}(k)\), for \(j=1,\ldots ,4\), \(\Vert e_{j}(300)\Vert =\sqrt{\sum _{i=1}^{300}e_{j} ^{2}(i)}\) is of the order of \(10^{-3}\) and \(\sqrt{\sum _{i=100}^{300}e_{j} ^{2}(i)}\) is of the order of \(10^{-11}\) which shows that the error e decreases exponentially and the generalization properties of the algorithm are quite good. The regularization parameters are \(\gamma _{i}=0.9313\,\times \,10^{-9}\) for \(i=1,\ldots ,4\).
Also in the presence of small noise \(\eta _{i}\in [-\,0.01,0.01]\), the algorithm behaves well and the regularization parameters adapt to the realization of \(\eta _{i}\). For example, for a certain realizations of \(\eta _{i}\), we obtain the regularization parameters
and the error \(\Vert e_{j}(300)\Vert =\sqrt{\sum _{i=1}^{300}e_{j}^{2}(i)}\) is of the order of \(10^{-1}\) and \(\sqrt{\sum _{i=100}^{300}e_{j}^{2}(i)}\) is of the order of \(10^{-9}\) .
Suppose that in addition to a small noise \(\eta _{i}\in [-0.01,0.01],\) there is a quadratic structural perturbation, i.e.,
Then with cross validation for \(\varepsilon =0.001\) the algorithm behaves well. For a particular realization of \(\eta\), the error \(\Vert e_{j}(300)\Vert =\sqrt{\sum _{i=1}^{300}e_{j}^{2}(i)}\) is between 5 and 15 but \(\sqrt{\sum _{i=100}^{300}e_{j}^{2}(i)}\) is of the order of \(10^{-9}\) and the regularization parameters are
These examples show a very good behavior of the algorithm.
3 Unstable case
Consider
where some of the eigenvalues of A are outside the unit circle. Again, we want to estimate A when the following data are given,
which are generated by system (16), thus \(x(k)=A^{k-1}x(1)\).
As remarked above, a direct application of the algorithm \({\mathcal {A}}\) will not work, since the time series diverges fast. Instead, we construct a new time series from (17) associated to an auxiliary stable system.
For a constant \(\sigma >0\) we define the auxiliary system by
Thus
and with \(y(1)=x(1)\) one finds
If we choose \(\sigma >0\) such that the eigenvalues of \(\frac{A}{\sigma }\) are in the unit circle, we can apply algorithm \({\mathcal {A}}\) to this stable matrix and hence we would obtain an estimate of \(\frac{A}{\sigma }\) and hence of A. However, since the eigenvalues of the matrix A are unknown, we will be content with a somewhat weaker condition than stability of \(\frac{A}{\sigma }\).
The data (17) for system (16) yield the following data for system (18):
We propose to choose \(\sigma\) as follows: Define
Clearly, the inequality \(\sigma \le \left\| A\right\|\) holds. We apply algorithm \({\mathcal {A}}\) to the time series y(k). This yields an estimate of \(\frac{A}{\sigma }\) and hence an estimate \(\hat{A}\) of A.
For general A, this choice of \(\sigma\) certainly does not guarantee that the eigenvalues of \(\frac{A}{\sigma }\) are within the unit circle. However, as mentioned above, a generic data sequence \(x(k),k\in {\mathbb {N}}\), will converge to the eigenspace of the eigenvalue with maximal modulus. Hence \(\frac{\left\| x(k+1)\right\| }{\left\| x(k)\right\| }\) will approach the maximal modulus of an eigenvalue, thus this choice of \(\sigma\) will lead to a matrix \(\frac{A}{\sigma }\) which is not “too unstable”.
Example 3
Consider \(x(k+1)=\alpha x(k)\) with \(\alpha =11.46\). With the initial condition \(x(0)=-\,0.5\), we generate the time series \(x(1),\ldots ,x(100)\). The algorithm above with the regularization parameter \(\gamma =10^{-6}\) yields the estimate \(\hat{\alpha }=11.4086\). Cross validation leads to the regularization parameter \(\gamma =9.5367\,\times \,10^{-7}\) and the estimate \(\hat{\alpha }=11.4599\). In the presence of a small noise \(\eta \in [-\,0.1,0.1]\), cross validation yields the regularization parameter \(\gamma =0.002\) and the slightly worse estimate \(\hat{\alpha }=11.1319\).
We observe the same behavior in higher dimensional systems where the eigenvalues are of the same order of magnitude.
Example 4
Consider \(x(k+1)=Ax(k)\) with
Using cross validation, we get that
for \(\gamma _{i}=0.9313\,\times \,10^{-9}\), \(i=1,\ldots ,4\).
For different realizations of a noise \(\eta _{i}\) of magnitude \(0.5\,\times \,10^{-4}\), cross validation gives a good approximation of A and the eigenvalues of \(A-\hat{A}\) are all within the unit disk with amplitude of the order of \(10^{-3}\) showing that the dynamics of the error \(e(k)=x(k)-\hat{x}(k)\) is asymptotically stable. For example, for a particular realization of \(\eta _{i}\) of magnitude \(0.5\,\times \,10^{-4}\), we get
with regularization parameters
The algorithm fails in the presence of quadratic structural perturbations. This is due to the choice of a linear kernel. A polynomial kernel, for example, would allow for nonlinear perturbations but this would require a complete reformulation of our algorithm. We leave the extension of our algorithm to the nonlinear case for future work.
The next example is an unstable system with a large gap between the eigenvalues.
Example 5
Consider the system \(x(k+1)=Ax(k)\) with
With the initial condition \(x(0)=[-\,1.9,1]\), we generate the time series \(x(1),\ldots ,x(100)\). The algorithm above yields the (excellent) estimate
In the presence of noise of maximal amplitude \(10^{-4}\) , the algorithm approximates well only the large entry \(a_{11}=20\): For a first realization of \(\eta _{i}\) and with cross validation, we get
with \(\gamma _{1}=1.5259\,\times \,10^{-5}\) and \(\gamma _{2}=2^{20}\). However another realization of \(\eta _{i}\) leads to
with \(\gamma _{1}=3.0518\,\times \,10^{-5}\) and \(\gamma _{2}=2.8147\,\times \,10^{14}\). This is due to the fact that the data converge to the eigenspace generated by the largest eigenvalue \(\lambda =20\). However, the eigenvalues of \(A-\hat{A}\) are within the unit disk with small amplitude which guarantees that the error dynamics of \(e(k)=x(k)-\hat{x}(k)\) converges to the origin quite quickly. We observe the same phenomenon with
Here, in the absence of noise, we obtain the estimate
with \(\gamma _{1}=\gamma _{2}=0.9313\,\times \,10^{-9}\). In the presence of noise \(\eta _{i}\) with amplitude \(10^{-4}\), the data converge to the eigenspace corresponding to the largest eigenvalue \(\lambda =25\): for some realization of \(\eta _{i}\) one obtains the estimate
while for another realization of \(\eta\)
The regularization parameters \(\gamma _{1}\) and \(\gamma _{2}\) adapt to the realization of the noise.
As already remarked in the end of Sect. 2, we see that “more data” does not always necessarily lead to better results, since the data sequence converges to the eigenspace generated by the largest eigenvalue. However, whether with or without noise, the approximations of A are good enough to reduce the error between \(x(k+1)=Ax(k)\) and \(\hat{x}(k+1)=\hat{A}\hat{x}(k)\) outside of the training examples, since cross-validation determines a good regularization parameter \(\gamma\) that balances between good fitting and good prediction properties.
The next example has an eigenvalue on the unit circle.
Example 6
Consider \(x(k+1)=Ax(k)\) with
The set of eigenvalues of A is \(\text{ spec }(A)=\{-\,1.5000,1.0000,10.4000,-\,21.9000\}\). In the absence of noise and initial condition \(x=[-\,0.9,15,1.5.2.5]\) with \(N=100\) points, we compute the estimate
and regularization parameters \(\gamma _{1}=\gamma _{2}=0.9313\,\times \,10^{-9}\). In this case, the set of eigenvalues of \(\hat{A}\) is
For a given realization of \(\eta \in [-10^{-4},10^{-4}]\), we obtain the estimate
with \(\gamma _{1}=0.0745\,\times \,10^{-7}\) and \(\gamma _{2}=0.1490\,\times \,10^{-7}\). The eigenvalues of \(A-\hat{A}\) are of the order of \(10^{-4}\) which guarantees that the error dynamics converges quickly to the origin. However, the set of eigenvalues of \(\hat{A}\) is
Hence an additional unstable eigenvalue occurs.
Example 7
Consider \(x(k+1)=Ax(k)\) with
The eigenvalues of A are given by
For an initial condition \(x=[-\,0.9;15;1.5;2.5]\) and with \(N=100\) data points, we get
with eigenvalues given by
Here we used \(\gamma _{i}=10^{-12}\), \(i=1,\ldots ,4\). Moreover, the eigenvalues of \(A-\hat{A}\) are quite small and such that the error dynamics converges quickly to the origin. In the presence of noise \(\eta\), the algorithm approximates the largest eigenvalues of A but does not approximate the smaller (stable) ones. For example, for a particular realization of noise with amplitude \(10^{-4}\), we get the estimate
and \(\text{ spec }(\hat{A})=\{-\,40.0009,0.1620\pm 0.8438i,15.3008\}\).
For another realization of noise with amplitude \(10^{-2}\), we get the estimate
and \(\text{ spec }(\hat{A})=\{-\,40.1391,3.9326,0.9601,15.3002\}\).
The algorithm introduced above also allows us to compute the topological entropy of linear systems, since it is determined by the unstable eigenvalues. Recall that the topological entropy of a linear map on \({\mathbb {R}}^{n}\) is defined in the following way:
Fix a compact subset \(K\subset {\mathbb {R}}^{n}\), a time \(\tau \in {\mathbb {N}}\) and a constant \(\varepsilon >0\). Then a set \(R\subset {\mathbb {R}}^{n}\) is called \((\tau ,\varepsilon )\)-spanning for K if for every \(y\in K\) there is \(x\in R\) with
By compactness of K, there are finite \((\tau ,\varepsilon )\)-spanning sets. Let R be a \((\tau ,\varepsilon )\)-spanning set of minimal cardinality \(\#R=r_{\min }(\tau ,\varepsilon ,K)\). Then
(the limits exist). Finally, the topological entropy of A is
where the supremum is taken over all compact subsets K of \({\mathbb {R}}^{n}.\)
A classical result due to Bowen (cf. [21, Theorem 8.14]) shows that the topological entropy is determined by the sum of the unstable eigenvalues, i.e.,
where summation is over all eigenvalues of A counted according to their algebraic multiplicity.
Hence, when we approximate the unstable eigenvalues of A by those of the matrix \(\hat{A}\), we also get an approximation of the topological entropy.
Example 8
For Example 6, we get that \(h_{top}(A)=34.80\) while for the estimate \(\hat{A}\) one obtains \(h_{top}(\hat{A})=34.7999\). For Example 7, we get that \(h_{top}(A)=55.30\) and \(h_{top}(\hat{A})=55.3008\). These estimates appear reasonably good.
4 Identification of linear control systems
Consider the linear control system
with \(A\in {\mathbb {R}}^{n\times n}\) and \(B\in {\mathbb {R}}^{n\times 1}\). We want to estimate the matrices A and B from the time series \(x(1)+\eta _{1},\) \(\ldots ,x(N)+\eta _{N}\) where \(\eta\) satisfies the Assumption in Sect. 2. The initial condition x(0) and the control sequence u(0), \(\ldots ,u(N)\) are assumed to be known.
In order to estimate A and B, we will extend algorithm \({\mathcal {A}}\). The ith component of system (50) is given by
For every i we want to estimate the coefficients \(b_{i}\) and \(a_{ij},j=1,\) \(\ldots ,n\). Thus the linear map \(f_{i}:{\mathbb {R}}^{n}\rightarrow {\mathbb {R}}\) given by
is unknown. To extend algorithm \({\mathcal {A}}\), we will view system (51) as a system of the form (2) where the state x is the extended state \(\underline{x}=(x,u)\in {\mathbb {R}}^{n}\times {\mathbb {R}}\) for (50). Hence, the kernel expansion (6) becomes
where \(\underline{x}_{n+1}=u\) and the \(c_{ij}\) satisfy the following set of equations:
with
Let us emphasize that \(u=x_{n+1}\) does not appear on the left hand side of (53)–(54).
In reference to the case when A has eigenvalues outside the unit circle, we adopt the same method as in Sect. 3 and define
Example 9
(One dimensional case) Consider \(x(k+1)=-\,0.9x(k)+3.5u\). For an input \(u(k)=\sin (k)+\cos (k)\) and for 100 points we obtain the estimate \(\hat{A}=-\,0.9\) and \(\hat{B}=3.5\) when there is no noise \(\eta _{i}\). Here cross validation gives \(\gamma _{1}=1.5259\,\times \,10^{-05}\) and \(\gamma _{2}=1\). For a certain realization of the noise \(\eta _{i}\) with amplitude 0.1, we get \(\hat{A}=-\,0.9008\) and \(\hat{B}=3.4983\). Here cross validation gives \(\gamma _{1}=0.0078\) and \(\gamma _{2}=1\).
Example 10
(Three dimensional stable case) Consider control system (50) with
With the input \(u(k)=\sin (k)+\cos (k)\) and 100 points, one computes the estimates
Here cross validation gives the regularization parameters \(\gamma _{i}=0.1526\,\times \,10^{-4}\) for \(i=1,\ldots ,4\). For some realization of perturbations \(\eta _{i}\) with amplitude 0.1, one computes the estimates
Here cross validation gives \(\gamma _{1}=9.7656\,\times \,10^{-4}\), \(\gamma _{2}=9.7656\,\times \,10^{-4}\), \(\gamma _{3}=1.5259\,\times \,10^{-5}\), \(\gamma _{4}=4\).
Example 11
(Three dimensional unstable case) Consider control system (50) with
The input \(u(k)=\sin (k)+\cos (k)\) and 100 points give the estimates
Here cross validation yields the regularization parameters \(\gamma _{i}=0.8882\,\times \,10^{-15}\) for \(i=1,\ldots ,4\). For some realization of perturbations \(\eta _{i}\) with amplitude \(10^{-4}\), one computes the estimates
Here cross validation gives \(\gamma _{1}=\gamma _{2}=0.2384\,\times \,10^{-6}\), \(\gamma _{3}=\gamma _{4}=0.0596\,\times \,10^{-6}\).
These results show that algorithm \({\mathcal {A}}\) works quite well in these cases.
5 Stabilization via linear-quadratic optimal control
A basic problem for linear control systems is stabilization by state feedback. A standard method is to use linear quadratic optimal control, where the feedback is computed using the solution of an algebraic Riccati equation. In this section, we propose to replace in the algebraic Riccati equation the system matrix A by the estimate \(\hat{A}\) obtained by learning theory.
The linear quadratic optimal control problem has the following form:
Minimize over all (continuous) inputs u
with \(x(\cdot )\) given by
here \(Q\in {\mathbb {R}}^{n\times n}\) is positive semidefinite and \(R\in {\mathbb {R}}^{m\times m}\) is positive definite, and \(A\in {\mathbb {R}}^{n\times n},B\in {\mathbb {R}}^{n\times m}\).
Consider the discrete algebraic Riccati equation DARE
Obviously, every solution P is positive semi-definite. We cite the following theorem from [1].
Theorem Suppose that for every \(x_{0}\in {\mathbb {R}}^{n}\) there is an input u, such that \(J(x_{0},u)<\infty\). Then the following holds:
-
(i)
There is a unique solution P of the DARE.
-
(ii)
For every \(x_{0}\in {\mathbb {R}}^{n}\) one has \(J^{*}(x_{0}):=\inf \{J(x_{0},u)\left| {}\right. u\) an input\(\}=x_{0}^{\top }Px_{0}\) and there is a unique optimal input \(u^{*}\) with \(J^{*}(x_{0})=J(x_{0},u^{*})\). This optimal input is generated by the feedback \(F=(R+B^{T}PB)^{-1}B^{\top }PA\) and
In particular, the feedback F stabilizes the system, i.e., \({x} (k+1)=(A-BF)x(k)\) is stable.
Now we use an estimate \(\hat{A}\) and \(\hat{B}\) (obtained by kernel methods) instead of A and B in the algebraic Riccati equation and obtain the solution \(\hat{P}\). Will the corresponding feedback \(u=\hat{F}x:=-B^{\top } \hat{P}x\) also stabilize the system, i.e., is the following system stable:
Example 12
Consider the one-dimensional system \(x(k+1)=-\,0.9x(k)+3.5u\) in Example 9. In the absence of noise, we get \(\hat{A}=-\,0.9\) and \(\hat{B}=3.5\). We have that \(A-B\hat{F}=\hat{A}-\hat{B}\hat{F}=-\,0.0643\). When there is noise of amplitude 0.1, we get that \(\hat{A}=-\,0.9002\) and \(\hat{B}=3.4929\) and \(A-B\hat{F}=-\,0.0643\) while \(\hat{A}-\hat{B}\hat{F}=-\,0.0610\). Hence, the controller improves stability.
Example 13
Consider control system (50) with
As illustrated in Example 10, without noise we get excellent approximations of A and B. For both cases, the set of eigenvalues of the closed-loop system is \(\{-\,0.6172,0.4049,-\,0.0018\}\). With a noise of maximal amplitude 0.1, the estimates \(\hat{A}\) and \(\hat{B}\) are given in Example 10. For the feedback system one finds
In this example the feedback based on the estimate also stabilizes the original system.
Example 14
Consider control system (50) with
As Example 11 illustrates, without noise we get excellent approximations of A and B. For the feedback system one finds
When there is noise of amplitude \(10^{-4}\), one computes the estimates
This are bad approximations for A and B. Furthermore, for the feedback system one finds
Thus the stabilizing controller for the approximate system does not stabilize the true system.
6 Conclusions
This paper has introduced the algorithm \({\mathcal {A}}\) based on kernel methods to identify a stable linear dynamical system from a time series. The numerical experiments give excellent results in the absence of noise and structural perturbations. In the presence of noise and structural perturbations the algorithm works well in the stable case. In the unstable case, a modified algorithm works quite well in the presence of noise but cannot handle structural perturbations.
Then we have extended algorithm \({\mathcal {A}}\) to identify linear control systems. In particular, we have used estimates obtained by kernel methods to stabilize linear systems using linear-quadratic control and the algebraic Riccati equation. Here the numerical experiments seem to indicate that the same conclusions on applicability of the algorithm apply.
Extensions of the considered algorithms to nonlinear systems appear feasible and are left to future work.
Notes
A suggestion in [18] is to consider the \(\rho _{X}\)-volume of the Voronoi cell associated with \(\bar{x}\). Another example is \(w=1\) or if \(|\bar{x}|=m<\infty\), \(w=\frac{1}{m}\).
This assumption is true if X is compact and the inclusion map of \({\mathcal {H}}_{K,\bar{t}}\) into the space of Lipschitz functions on X is bounded which is the case when K is a \(C^{2}\) Mercer kernel [22]. In fact, if \(\Vert f\Vert _{\text{ Lip }(X)}\le C_{0}\Vert f\Vert _{K}\) for each \(f\in {\mathcal {H}}_{K,\bar{t}}\), then \(C_{\bar{x}}\le C_{0}^{2} \rho _{X}(X)\).
References
Antsaklis PJ, Michel AN (2006) Linear systems. Birkhäuser, Boston
Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404
Bouvrie J, Hamzi B (2017) Kernel methods for the approximation of nonlinear systems. SIAM J Control Optim 55–4:2460–2492
Bouvrie J, Hamzi B (2017) Kernel methods for the approximation of some key quantities of nonlinear systems. J Comput Dyn 4(1—-2):1–19
Cheney W, Light W (2009) A course in approximation theory, graduate studies in mathematics, vol 101. American Mathematical Society, Providence
Colonius F, Kliemann W (2014) Dynamical systems and linear algebra, graduate studies in mathematics, vol 158. American Mathematical Society, Providence
Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull Am Math Soc 39:1–49
Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. Adv Comput Math 13(1):1–50
Hamzi B, Colonius F (2019) Kernel methods for discrete-time linear equations, springer lecture notes in computer science LNCS, vol 11539
Li L, Zhou D-X (2015) Learning theory approach to a system identification problem involving atomic norm. J Fourier Anal Appl 21:734
Pillonetto G, Dinuzzo F, Chen T, De Nicolao G, Ljung L (2014) Kernel methods in system identification, machine learning and function estimation: a survey. Automatica 50(3):657–682
Rifkin RM, Lippert A (2007) Notes on regularized least squares. Computer science and artificial intelligence laboratory technical report, MIT, MIT-CSAIL-TR-2007-025, CBCL-268
Schoenberg IJ (1935) Remarks to Maurice Fréchet’s article “Sur la définition axiomatique d’une classe d’espace distanciés vectoriellement applicable sur l’espace de Hilbert”. Ann Math 36:724–732
Schoenberg IJ (1937) On certain metric spaces arising from Euclidean spaces by a change of metric and their imbedding in Hilbert space. Ann Math 38:787–793
Schoenberg IJ (1938) Metric spaces and positive definite functions. Trans Am Math Soc 44:522–536
Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge
Smale S, Zhou D-X (2003) Estimating the approximation error in learning theory. Anal Appl 1(1):17
Smale S, Zhou D-X (2004) Shannon sampling and function reconstruction from point values. Bull Am Math Soc 41:279–305
Smale S, Zhou D-X (2005) Shannon sampling II: connections to learning theory. Appl Comput Harmon Anal 19(3):285–302
Wahba G (1990) Spline models for observational data. In: SIAM CBMS-NSF regional conference series in applied mathematics, vol 59
Walters P (1982) An introduction to ergodic theory. Springer, Berlin
Zhou D-X (2003) Capacity of reproducing kernel spaces in learning theory. IEEE Trans Inf Theory 49(7):1743–1752
Acknowledgements
BH thanks the European Commission for financial support received through Marie Curie Fellowships.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Elements of learning theory
Appendix: Elements of learning theory
In this section, we give a brief overview of Reproducing Kernel Hilbert Spaces (RKHS) as used in statistical learning theory. The discussion here borrows heavily from Cucker and Smale [7], Wahba [20], and Schölkopf and Smola [16]. Early work developing the theory of RKHS was undertaken by Schoenberg [13,14,15] and then Aronszajn [2]. Historically, RKHS came from the question, when it is possible to embed a metric space into a Hilbert space.
Definition 1
Let \({\mathcal {H}}\) be a Hilbert space of functions on a set \({\mathcal {X}}\) which is a closed subset of \({\mathbb {R}}^{n}\). Denote by \(\langle f, g \rangle\) the inner product on \({\mathcal {H}}\) and let \(\Vert f\Vert = \langle f, f \rangle ^{1/2}\) be the norm in \({\mathcal {H}}\), for f and \(g \in {\mathcal {H}}\). We say that \({\mathcal {H}}\) is a reproducing kernel Hilbert space (RKHS) if there exists \(K:{\mathcal {X}} \times {\mathcal {X}} \rightarrow {\mathbb {R}}\) such that
-
(i)
K has the reproducing property, i.e., \(f(x)=\langle f(\cdot ),K(\cdot ,x)\rangle\) for all \(f\in {\mathcal {H}}\).
-
(ii)
K spans \({\mathcal {H}}\), i.e., \({\mathcal {H}}=\overline{\text{ span }\{K(x,\cdot )|x\in {\mathcal {X}}\}}\).
K will be called a reproducing kernel of \({\mathcal {H}}\) and \({\mathcal {H}}_{K}\) will denote the RKHS \({\mathcal {H}}\) with reproducing kernel K.
Definition 2
Given a kernel \(K:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}\) and inputs \(x_{1},\ldots ,x_{n}\in {\mathcal {X}}\), the \(n\times n\) matrix
is called the Gram Matrix of k with respect to \(x_{1},\ldots ,x_{n}\). If for all \(n\in {\mathbb {N}}\) and distinct \(x_{i}\in {\mathcal {X}}\) the kernel \(K\,\)gives rise to a strictly positive definite Gram matrix, it is called strictly positive definite.
Definition 3
(Mercer kernel map) A function \(K:{\mathcal {X}} \times {\mathcal {X}} \rightarrow {\mathbb {R}}\) is called a Mercer kernel if it is continuous, symmetric and positive definite.
The important properties of reproducing kernels are summarized in the following proposition.
Proposition 1
If K is a reproducing kernel of a Hilbert space \({\mathcal {H}}\), then
-
(i)
K(x, y) is unique.
-
(ii)
For all \(x,y\in {\mathcal {X}}\), \(K(x,y)=K(y,x)\) (symmetry).
-
(iii)
\(\sum _{i,j=1}^{m}\alpha _{i}\alpha _{j}K(x_{i},x_{j}) \ge 0\) for \(\alpha _{i} \in {\mathbb {R}}\) and \(x_{i} \in {\mathcal {X}}\) (positive definitness).
-
(iv)
\(\langle K(x,\cdot ),K(y,\cdot ) \rangle _{\mathcal {H}}=K(x,y)\).
-
(v)
The following kernels, defined on a compact domain \({\mathcal {X}} \subset {\mathbb {R}}^{n}\), are Mercer kernels: \(K(x,y)=x\cdot y^{\top }\) (Linear), \(K(x,y)=(1+x\cdot y^{\top })^{d},\quad d\in {\mathbb {N}}\) (Polynomial), \(K(x,y)=e^{-\frac{\Vert x-y\Vert ^{2}}{\sigma ^{2}}},\quad \sigma >0\) (Gaussian).
Theorem 1
Let \(K:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}\) be a symmetric and positive definite function. Then there exists a Hilbert space of functions \({\mathcal {H}}\) defined on \({\mathcal {X}}\) admitting K as a reproducing Kernel. Moreover, there exists a function \(\varPhi :X\rightarrow {\mathcal {H}}\) such that
\(\varPhi\) is called a feature map.
Conversely, let \({\mathcal {H}}\) be a Hilbert space of functions \(f:{\mathcal {X}} \rightarrow {\mathbb {R}}\), with \({\mathcal {X}}\) compact, satisfying
Then \({\mathcal {H}}\) has a reproducing kernel K.
Remark 1
-
(i)
The dimension of the RKHS can be infinite and corresponds to the dimension of the eigenspace of the integral operator \(L_{K}:{\mathcal {L}}_{\nu }^{2}({\mathcal {X}})\rightarrow {\mathcal {C}}({\mathcal {X}})\) defined as \((L_{K}f)(x)=\int K(x,t)f(t)d\nu (t)\) if K is a Mercer kernel, for \(f\in {\mathcal {L}}_{\nu }^{2}({\mathcal {X}})\) and \(\nu\) a Borel measure on \({\mathcal {X}}\).
-
(ii)
In Theorem 1, and using property [iv.] in Proposition 1, we can take \(\varPhi (x):=K_{x}:=K(x,\cdot )\) in which case \({\mathcal {F}}={\mathcal {H}}\)—the “feature space” is the RKHS. This is called the canonical feature map.
-
(iii)
The fact that Mercer kernels are positive definite and symmetric shows that kernels can be viewed as generalized Gramians and covariance matrices.
-
(iv)
In practice, we choose a Mercer kernel, such as the ones in [v.] in Proposition 1, and Theorem 1, that guarantees the existence of a Hilbert space admitting such a function as a reproducing kernel.
RKHS play an important role in learning theory whose objective is to find an unknown function
from random samples
In the following we review results from [18] (for a more general setting, cf. [7]) in the special case when the data samples \({\mathbf {s}}\) are such that the following assumption holds.
Assumption 1
The samples in (75) have the special form
where \(\bar{x}=\{x_{i}\}|_{i=1}^{d+1}\) and \(y_{x}\) is drawn at random from \(f^{*}(x)+\eta _{x}\), where \(\eta _{x}\) is drawn from a probability measure \(\rho _{x}\).
Here for each \(x \in X\), \(\rho _{x}\) is a probability measure with zero mean, and its variance \(\sigma _{x}^{2}\) satisfies \(\sigma ^{2} :=\sum _{x \in \bar{x}} \sigma _{x}^{2} < \infty\). Let X be a closed subset of \({\mathbb {R}}^{n}\) and \(\bar{t} \subset X\) is a discrete subset. Now, consider a kernel \(K: X \times X \rightarrow {\mathbb {R}}\) and define a matrix (possibly infinite) \(K_{\bar{t},\bar{t}} : \ell ^{2}(\bar{t}) \rightarrow \ell ^{2}(\bar{t})\) as
where \(\ell ^{2}(\bar{t})\) is the set of sequences \(a=(a_{t})_{t \in \bar{t}}: \bar{t} \rightarrow {\mathbb {R}}\) with \(\langle a,b \rangle =\sum _{t \in \bar{t} }a_{t} b_{t}\) defining an inner product. For example, we can take \(X={\mathbb {R}}\) and \(\bar{t}=\{0,1,\ldots ,d\}\).
In the case of a linear dynamical system (1), we are interested in learning the map \(x(k)\mapsto x(k+1)\). Here we can apply the following results.
The problem to approximate a function \(f^{*}\in {\mathcal {H}}_{K}\) from samples \({\mathbf {s}}\) of the form (75) has been studied in [18, 19]. It is reformulated as the minimization problem
where \(\gamma \ge 0\) is a regularization parameter. Moreover,when \(\bar{x}\) is not defined by a uniform grid on X, the authors of [18] introduced a weighting \(w:=\{w_{x}\}_{x\in \bar{x}}\) on \(\bar{x}\) with \(w_{x}>0\).Footnote 1 Let \(D_{w}\) be the diagonal matrix with diagonal entries \(\{w_{x}\}_{x\in \bar{x}}\). Then, \(\Vert D_{w}\Vert \le \Vert w\Vert _{\infty }\).
In this case, the regularization scheme (78) becomes
Theorem 2
Assume \(f^{*}\in {\mathcal {H}}_{K,\bar{t}}\) and the standing hypotheses with X, K, \(\bar{t}\), \(\rho\) as above, y as in (76). Suppose \(K_{\bar{t},\bar{x}}D_{w}K_{\bar{x},\bar{t}}+\gamma K_{\bar{t} ,\bar{t}}\) is invertible. Define \({\mathcal {L}}\) to be the linear operator \({\mathcal {L}}=(K_{\bar{t},\bar{x}}D_{w}K_{\bar{x},\bar{t}}+\gamma K_{\bar{t},\bar{t}})^{-1}K_{\bar{t},\bar{x}}D_{w}\). Then problem (79) has the unique solution
Assumption 2
For each \(x \in X\), \(\rho _{x}\) is a probability measure with zero mean supported on \([-\,M_{x},M_{x}]\) with \({\mathcal {B}}_{w} :=(\sum _{x \in \bar{x}}w_{x} M_{x}^{2})^{\frac{1}{2}} < \infty\).
The next theorems give estimates for the different sources of errors.
Theorem 3
(Sample error) [18, Theorem 4, Propositions 2 and 3] Let Assumptions 1 and 2 be satisfied, suppose that \(K_{\bar{t},\bar{x}}D_{w}K_{\bar{x},\bar{t}}+\gamma K_{\bar{t},\bar{t}}\) is invertible and let \(f_{{\mathbf {s}},\gamma }=\sum _{t\in \bar{t}}c_{t}K_{t}\) be the solution of (79) given in Theorem 2 by \(c={\mathcal {L}}y\). Define
Then for every \(0<\delta <1\), with probability at least \(1-\delta\) we have the sample error estimate
where \(\alpha (u):=(u-1)\log u\) for \(u>1\). In particular, \({\mathcal {E}} _{\text{ samp }}\rightarrow 0\) when \(\gamma \rightarrow \infty\) or \(\sigma _{w} ^{2}\rightarrow 0\).
Theorem 4
(Regularization error and integration error) [18, Proposition 4 and Theorem 5] Let Assumptions 1 and 2 be satisfied and let \(\bar{X}=(X_{x})_{x\in \bar{x}}\) be the Voronoi cell of X associated with \(\bar{x}\) and \(w_{x}=\rho _{X}(X_{x})\). Define the Lipschitz norm on a subset \(X^{\prime }\subset X\) as \(\Vert f\Vert _{\text{ Lip }(X^{\prime })}:=\Vert f\Vert _{L^{\infty }(X^{\prime })}+\sup _{s,u\in X}\frac{|f(s)-f(u)|}{\Vert s-u\Vert _{\ell ^{\infty }({\mathbb {R}}^{n})}}\) and assume that the inclusion map of \({\mathcal {H}}_{K,\bar{t}}\) into the Lipschitz space satisfiesFootnote 2
Suppose that \(\bar{x}\) is \(\varDelta -\)dense in X, i.e., for each \(y\in X\) there is some \(x\in \bar{x}\) satisfying \(\Vert x-y\Vert _{\ell ^{\infty }({\mathbb {R}}^{n})} \le \varDelta\).
Then for \(f^{*}\in {\mathcal {H}}_{K,\bar{t}}\)
Theorem 5
(Sample, regularization and integration errors) [18, Corollary 5] Under the assumptions of Theorems 3 and 4, let \(\bar{X}=(X_{x})_{x\in \bar{x}}\) be the Voronoi cell of X associated with \(\bar{x}\) and \(w_{x}=\rho _{x}(X_{x} )\). Suppose that \(\bar{x}\) is \(\varDelta -\)dense, \(C_{\bar{x}}<\infty\), and \(f^{*}\in {\mathcal {H}}_{K,\bar{t}}\). Then, for every \(0<\delta <1\), with probability at least \(1-\delta\) there holds
where \({\mathcal {E}}_{\text{ samp }}\) is given in (81).
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Hamzi, B., Colonius, F. Kernel methods for the approximation of discrete-time linear autonomous and control systems. SN Appl. Sci. 1, 674 (2019). https://doi.org/10.1007/s42452-019-0701-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42452-019-0701-3