Stability analysis of chaotic systems from data

The prediction of the temporal dynamics of chaotic systems is challenging because infinitesimal perturbations grow exponentially. The analysis of the dynamics of infinitesimal perturbations is the subject of stability analysis. In stability analysis, we linearize the equations of the dynamical system around a reference point and compute the properties of the tangent space (i.e. the Jacobian). The main goal of this paper is to propose a method that infers the Jacobian, thus, the stability properties, from observables (data). First, we propose the echo state network (ESN) with the Recycle validation as a tool to accurately infer the chaotic dynamics from data. Second, we mathematically derive the Jacobian of the echo state network, which provides the evolution of infinitesimal perturbations. Third, we analyse the stability properties of the Jacobian inferred from the ESN and compare them with the benchmark results obtained by linearizing the equations. The ESN correctly infers the nonlinear solution and its tangent space with negligible numerical errors. In detail, we compute from data only (i) the long-term statistics of the chaotic state; (ii) the covariant Lyapunov vectors; (iii) the Lyapunov spectrum; (iv) the finite-time Lyapunov exponents; (v) and the angles between the stable, neutral, and unstable splittings of the tangent space (the degree of hyperbolicity of the attractor). This work opens up new opportunities for the computation of stability properties of nonlinear systems from data, instead of equations. Supplementary Information The online version contains supplementary material available at 10.1007/s11071-023-08285-1.

directions and compute the properties of its linear tangent space.
Stability analysis relies on the linearization of the dynamical equations, which requires the Jacobian of the system. The key quantities that characterize chaotic dynamics, and many other related physical properties, such as dynamical entropies and fractal dimensions, are the Lyapunov Exponents (LEs) [5,6], which are the eigenvalues of the Oseledets matrix [7]. There are several numerical methods to extract the LEs based on the Gram-Schmidt orthogonalization procedure [6,8,9]. The relevant eigenvectors are the corresponding Lyapunov vectors that constitute a coordinate dependent orthogonal basis of the linear tangent space. Instead, an intrinsic and norm-independent basis, which is also time invariant and covariant with the dynamics is given by the covariant Lyapunov vectors (CLVs). Crucially, CLVs are able to provide information on the local structure of chaotic attractors [10]. This viewpoint allows the study of an attractor's topology with the occurrence of critical transitions [11][12][13][14], paving the way for CLVs to be considered as precursors to such phenomena.
The previous exposition is traditionally related to model-based approaches, as it relies on the knowledge of a system's dynamical equations. However, studying the stability properties of observed data, where equations are not necessarily known, is hard; there are few approaches, e.g. [15,16], relying on the delayed coordinates attractor reconstruction by Takens [17]. The recent breakthrough of data-driven (model-free) approaches poses the reasonable question: Can we use the rich knowledge of dynamical systems theory for model-free approaches? Indeed, although at early steps, the use of advanced machine learning (ML) techniques for complex systems has shown promising potential in applications ranging from weather and climate prediction and classification [18][19][20] to fluid flows prediction and optimization [21][22][23], among others. The overarching goal of this work is to propose a machine learning approach to accurately learn and infer the ergodic properties of prototypical chaotic attractors, and in particular to extract LEs and CLVs from data.
The recurrent neural networks (RNNs) constitute a promising type of ML to address chaotic behaviour. Thanks to their architecture, the RNNs are suitable for processing sequential data, typically encountered in speech and language recognition, or time-series prediction [24]. In particular, they are proven to be universal approximators [25,26] and are able to capture long-term temporal patterns, i.e. they possess memory. A key piece of their architecture is that they maintain a hidden state that evolves dynamically, effectively allowing the RNNs to be treated as dynamical systems, and in particular as discrete neural differential equations [27]. Thus, RNNs lend themselves to being analysed with dynamical systems theory, allowing the study of stability properties from the dynamics they have learned. By exploiting this here, we derive the RNN's Jacobian and infer the linear dynamics from data.
Recently, there have been significant advancements in employing RNNs to learn chaotic dynamics [28][29][30][31][32][33][34][35][36], where two core objectives are studied: (1) the timeaccurate prediction of chaotic fluctuations and maximization of the prediction horizon and (2) accurately learning the ergodic properties of chaotic attractors. The first objective has been addressed by one of the co-authors in [34][35][36] for several prototypical chaotic dynamical systems using the same RNN architecture as the present work. Here we address the second objective by extending the recent works [29,30,32], where the LEs of the Lorenz 63 [1] and the one-dimensional Kuramoto-Sivashinsky equation [37] were retrieved from trained RNNs.
In this work, we employ a specific architecture of the RNN, a type of reservoir computer, the echo state network (ESN) [38] and train it with a diverse set of four prototypical chaotic attractors. The objective of this paper is twofold; first the accurate learning and inference of the ergodic properties of the chaotic attractors by the ESN. This is accomplished by thoroughly comparing the long-term statistics of (i) degrees of freedom, (ii) LEs, (iii) finite-time LEs, and (iv) angles of the CLVs. Second, by comparing the distribution of (i) finite-time LEs and (ii) angles of CLVs on the topology of the attractor, providing a strict test of the ESN's capability to accurately learn intrinsic chaotic properties.
The paper is organized as follows: Section 2 presents the necessary tools for our study. In particular, Sect. 2.1 provides a brief introduction to the relevant concepts and quantities from dynamical systems, such as LEs and CLVs. Then, Sect. 2.2 describes the architecture of the ESN, while Sect. 2.3 its validation strategies. Section 3 presents our main results, which are divided into two subsections; Sect. 3.1 devoted in lowdimensional systems, namely the Lorenz 63 [1] and Rössler [39] attractors; and Sect. 3.2 showing results on the Charney-DeVore [40] and the Lorenz 96 [41] attractors. Finally, we summarize our results and provide future perspectives in the conclusions in Sect. 4. The appendix A presents the two algorithms to extract the LEs and CLVs from the ESN. Appendix B provides further tests on the robustness of the methodology.

Background
In the following two subsections, we summarize the key theory that underpins the stability of chaotic systems (Sect. 2.1) and reservoir computers (Sect. 2.2).

Stability of chaotic systems
We consider a state x(t) ∈ R D with D degrees of freedom, which is governed by a set of nonlinear ordinary differential equations where f (x) : R D → R D is a smooth nonlinear function. Equation (1) defines an autonomous dynamical system. Hence, the dynamical system exists in a phase space of dimension D, equipped with a suitable metric, and is associated with a certain measure μ that we assume to be preserved (invariant). To investigate the stability of the dynamical system (1), we perturb the state by first-order perturbations as By substituting decomposition (2) into (1) and collecting the first-order terms ∼ O( ), we obtain the governing equation for the first-order perturbations (i.e. linear dynamics) where J i j = ∂ f i (x) ∂ x j ∈ R D×D are the components of the Jacobian, J(x(t)), which is in general time-dependent. The perturbations u evolve on the linear tangent space at each point x(t). The goal of stability analysis is to compute the growth rate of infinitesimal perturbations, which is achieved by computing the Lyapunov exponents and a basis of the tangent space with the covariant Lyapunov Vectors. To do so, we numerically time- Geometrically, Eq. (4) describes the tangent space around the state x(t). Starting from x(t = t 0 ) = x 0 and U(t 0 ) = I, Eqs. (1) and (4) are numerically solved with a time integrator. As explained in the subsequent paragraphs, in a chaotic system, almost all nearby trajectories diverge exponentially fast with an average rate equal to the leading Lyapunov exponent. Hence, the tangent vectors align exponentially fast with the leading Lyapunov vector, u 1 . ('Almost all' means that the set of perturbations that do not grow with the largest Lyapunov exponents has a zero measure). To circumvent this numerical issue, it is necessary to periodically orthonormalize the tangent space basis during time evolution, using a QR-decomposition of U, as U(t) = Q(t)R(t, Δt) (see [8,9]) and updating the columns of U with the columns of Q, i.e. U ← Q. The matrix R(t, Δt) ∈ R K ×K is upper-triangular and its diagonal elements [R] i,i are the local growth rates over a time span Δt of the (now) orthonormal vectors U, which are also known as backward Gram-Schmidt vectors (GSVs) [10,42]. The Lyapunov spectrum is given by 1 The algorithm 1 in the appendix A is a pseudocode for the calculation of the LEs for the ESN following [29,32]. The sign of the Lyapunov exponents indicates the type of the attractor. If the leading exponent is negative, λ 1 < 0, the attractor is a fixed point. If λ 1 = 0, and the remaining exponents are negative, the attractor is a periodic orbit. If at least a Lyapunov exponent is positive, λ 1 > 0, the attractor is chaotic. In chaotic systems, the Lyapunov time τ λ = 1 λ 1 defines a characteristic timescale for two nearby orbits to separate, which gives a scale of the system's predictability horizon [43].
The GSVs, U, constitute a norm-dependent orthonormal basis, which is not time-reversible, due to the frequent orthogonalizations via the QR decomposition. Instead, the covariant Lyapunov vectors (CLVs) form a norm-independent and time-invariant basis of the tangent space, which is covariant with the dynamics. The latter features of the CLVs, which are not possessed by the GSVs, allow us to examine individual expanding and contracting directions of a given dynamical system, thus providing an intrinsic geometrical interpretation of the attractor [6,10], as well as a hierarchical decomposition of spatiotemporal chaos, thanks to their generic localization in physical space [42]. Each bounded nonzero CLV, i.e. 0 < ||v i || < ∞, satisfies the following equation: which shows that the CLV is evolved by the tangent dynamics J(x(t))v i , while the extra term −λ i v i guarantees that its norm is bounded [44]. The name "covariant" means that the ith CLV at time t 1 , v i (x(t 1 )), maps at v i (x(t 2 )) at time t 2 , and vice versa. Mathematically, if M(t, Δt) = exp( t+Δt t J(x, τ )dτ ) is the system's tangent evolution operator (which contains a path-ordered exponential), covariant means M(t, Δt)v i (t) = v i (t + Δt); time-invariance of CLVs naturally arises from the previous expression, as If the Lyapunov spectrum is non-degenerate (such as for the cases considered here), each CLV v i is associated with the Lyapunov exponent λ i and is uniquely defined (up to a phase).
An important subclass of chaotic systems are uniformly hyperbolic systems, which have a uniform splitting between expanding and contracting directions, i.e. there are no tangencies between the unstable, neutral, and stable subspaces [45] that form the tangent space. Because of their simple geometrical structure, many theoretical tools have been developed in recent years. Hyperbolic systems have structurally stable dynamics and linear response, meaning that their statistics vary smoothly with parameter variations [46]. In prac-tice, violations of hyperbolicity are commonly reported in the literature [44,47,48], whereas true hyperbolic systems are rare [49]. Thanks to the chaotic hypothesis [5,50,51], high-dimensional chaotic systems can be practically treated as hyperbolic systems, i.e. using techniques developed for hyperbolic systems, regardless of hyperbolicity violations. This is because many convenient statistical properties of uniformly hyperbolic systems, such as ergodicity, existence of physical invariant measures, exponential mixing and welldefined time averages with large deviation laws [52,53], can be found in the macroscopic scale dynamics of certain large non-uniformly hyperbolic systems [46].
An application of CLVs is to assess the degree of hyperbolicity of the underlying chaotic dynamics. The tangent space of hyperbolic systems, at each point x, can be directly decomposed into three invariant sub- Here E U x is the unstable subspace composed by the CLVs associated with positive LEs, E N x is the neutral subspace spanned by the CLVs associated with the zero LEs, and E S x is the stable subspace spanned by the CLVs associated with negative LEs. In hyperbolic systems, the distribution of angles between subspaces is bounded away from zero. In Sect. 3, we will study in detail the angles θ U,N , θ U,S , and θ N ,S between pairs of the subspaces, and compare the ability of the ESN to accurately learn both the long-term statistics, and the phase space finite-time variability of the angles. Because the GSVs are mutually orthogonal, they cannot assess the degree of hyperbolicity of the attractor. Moreover, CLVs are key to the optimization of chaotic acoustic oscillations [44], as well as in reduced-order modelling [54]; they can reveal two uncoupled subspaces of the tangent space, one that comprises the physical modes carrying the relevant information of the trajectory, and another composed of strongly decaying spurious modes [10]. Two recent attempts to extract CLVs from data-driven approaches, which do not employ a neural network, can be found in [55,56].
We explain the algorithm we employ to compute the CLVs; for further details, we refer the interested reader to [10,42,44]. The GSVs are generated by numerically solving Eqs. (1) and (4) simultaneously and performing a QR-decomposition every m timesteps. In this way, after a time-lapse Δt, the GSVs at time t + Δt are given by: We can define the CLVs V (t) in terms of the GSVs as where C is an upper triangular matrix that contains the CLV expansion coefficients, [C(t)] ji = c j,i (t), for j ≤ i. Hence, the objective is to calculate C(t). Because the CLVs have by choice a unit norm, each column of the matrix C has to be normalized independently, i.e. i j=1 (c j,i (t)) 2 = 1, ∀i. We start by writing the evolution equation of the CLVs as We can re-write Eq. (9) via Eq. (8) and solve with respect to C(t) This equation is evolved backwards in time starting from the end of the forward-in-time simulation. We employ the solve_triangular routine of scipy [57] to invert R(t, Δt) and solve with respect to C(t).
The C and D matrices are initialized to the identity matrix I. We leave a sufficient spin-up and spin-down transient time at the beginning and end of our total time window, before we compute the CLVs via Eq. (8), to ensure that they are converged. The algorithm 2 in the appendix A is a pseudocode for the calculation of the CLVs.
To estimate the expansions and contractions of the tangent space on finite-time intervals of length Δt = t 2 −t 1 , we compute the finite-time Lyapunov exponents (FTLEs) as Λ i = 1 Δt ln[R] i,i . Hence, λ i is the longtime average of Λ i . The FTLE Λ 1 physically quantifies the exponential growth rate of a vector u 1 during the time interval Δt; therefore, Λ 2 quantifies the exponential growth rate of the vector u 2 that is orthogonal to u 1 by construction. Hence, as the GSVs form an orthogonal basis, looking at individual FTLEs for Λ i , i ≥ 2, lacks a physical meaning. Instead, the sum of the first n FTLEs is a growth rate in Δt for a typical n-dimensional volume Vol n in the tangent space [9,58] Accordingly, the diagonal matrix D(t, Δt) contains the CLV local growth factors of Δt). We can extract the finite-time covariant Lyapunov exponents (FTCLEs) from the logarithm of these growth factors for a time interval Δt Each FTCLE quantifies a finite-time exponential expansion or contraction rate along a covariant direction given by v i . Hence, each individual FTCLE has a physical interpretation, in contrast to the FTLEs, as explained before. On the other hand, now the sums of FTCLEs lack a physical meaning [58]. The long-time average of the FTCLEs is equal to the Lyapunov exponents,

Echo state network
The solution of a dynamical system is a time series. From a data analysis point of view, a time series is a sequentially ordered set of values, in which the order is provided by time. In a discrete setting, time can be thought of as an ordering index. For sequential data, and hence, time series, recurrent neural networks (RNNs) are designed to infer the temporal dynamics through their internal hidden state. However, training RNNs, such as long short-term memory (LSTM) [59] networks and gated recurrent units (GRUs) [60], requires backpropagation through time, which can be a demanding computational task due to the long-lasting time dependencies of the hidden states [61]. This issue is overcome by echo state networks (ESNs) [38,62], a RNN that is a type of reservoir computer, of which the recurrent weights of the hidden state (commonly named "reservoir") are randomly assigned and possess low connectivity. Therefore, only the hidden-to-output weights are trained leading to a simple quadratic optimization problem, which does not require backpropagation (see Fig. 1a for a graphical representation). The reservoir acts as a memory of the observed state history. ESNs have demonstrated accurate inference of chaotic dynamics, such as in [28][29][30][31][32][33][34][35][36]63]. An echo state network maps the state from time index t i to index t i+1 as follows (with a slight abuse of notation, the discrete time is denoted t i ). The evolution equations of the reservoir state and output are governed, respectively, by [36,38] where at any discrete time t i the input vector, y in (t i ) ∈ R N y , is mapped into the reservoir state r ∈ R N r , by the input matrix, W in ∈ R (N y +1)×N r , where N r N y . The updated reservoir state r(t i+1 ) is calculated at each time iteration as a function of the current input y in (t i ) and its previous value r(t i ) via Eq. (14) and then is involved in the calculation of the predicted output, y p (t i+1 ) ∈ R N y via Eq. (15). Here,( ) indicates normalization by the maximum-minus-minimum range of y in in training set component-wise, ( T ) indicates matrix transposition, (;) indicates array concatenation, W ∈ R N r ×N r is the state matrix, b in is the input bias and W out ∈ R (N r +1)×N y is the output matrix. In our applications, the dimension of the input and output vectors is equal to the dimension of the physical system of Eq. (1), i.e. N y ≡ D.
The matrices W in and W are (pseudo)randomly generated and fixed, whilst the weights of the output matrix, W out , are the only trainable elements of the network. The input matrix, W in , has only one element different from zero per row, which is sampled from a uniform distribution in [−σ in , σ in ], where σ in is the input scaling. The state matrix, W, is an Erdös-Renyi matrix with average connectivity d, in which each neuron (each row of W) has on average only d connections (i.e. nonzero elements), which are obtained by sampling from a uniform distribution in [−1, 1]. The echo state property enforces the independence of the reservoir state on the initial conditions, which is satisfied by rescaling W by a multiplication factor, such that the absolute value of the largest eigenvalue [38], i.e. the spectral radius, is smaller than unity. Following [29,36,63,64], we add a bias in the input and output layers to break the inherent symmetry of the basic ESN architecture. Specifically, the input bias, b in is a hyperparameter, selected in order to have the same order of Open-loop and c closed-loop configurations magnitude as the normalized inputs,ŷ in . Differently, the output bias is determined by training the weights of the output matrix, W out .
In Fig. 1b and c, we present the two types of configurations with which the ESN can run, i.e. in open loop or closed loop, respectively. Running in open-loop is necessary for the training stage, as the input data is fed at each step, allowing for the calculation of the reservoir time series r(t i ), t i ∈ [0, T train ], which need to be stored. There is an initial transient time window, the "washout interval", where the output y p (t i ) is not computed. This allows for the reservoir state to satisfy the echo state property, i.e. making it independent of the arbitrarily chosen initial condition, r(t 0 ) = 0, while also synchronizing it with respect to the current state of the system. The training of the output matrix, W out , is performed after the washout interval and involves the minimization of the mean square error between the outputs and the data over the training set where ||·|| is the L 2 norm, N tr +1 is the total number of data in the training set, and y in the input data on which the ESN is trained. Training the ESN is performed by solving with respect to W out via ridge regression of where R ∈ R (N r +1)×N tr and Y d ∈ R N y ×N tr are the horizontal concatenation of the reservoir states with bias, and of the output data, respectively; I is the identity matrix and β is the Tikhonov regularization parameter [65].
On the other hand, in the closed-loop configuration (Fig. 1c) the output y p at time step t i is used as an input at time step t i+1 , in a recurrent manner, allowing for the autonomous temporal evolution of the network. The closed-loop configuration is used for validation (i.e. hyperparameter tuning, see Sect. 2.3) and testing, but not for training. For our purposes, we independently train N ESN ∈ [5,10] networks, of which we take the ensemble average to increase the statistical accuracy of the prediction and evaluate its uncertainty. We start with N ESN = 10 trained networks, but during postprocessing we may discard any network that shows spurious temporal evolution. The N ESN networks are statistically independent thanks to: (1) initializing the random matrices W in and W with different seeds, and (2) training each network with chaotic time series staring from different initial points on the attractor.

Jacobian of the ESN
In this subsection, we mathematically derive the Jacobian of the echo state network. Equations (14)-(15) are a discrete map [2,32], and the continuous-time formulae derived for the Lyapunov exponents and CLVs in Sect. 2.1 can be adapted for a discrete-time system. The Jacobian of the ESN reservoir is the total derivative of the hidden state dynamics at a single timestep [29] where from Eq.
is the updated squared hidden state at timestep t i+1 . The Jacobian of the ESN is cheap to calculate as the expression W T in W T out + W T is a constant matrix, which is fixed after the training of W out . The only time-varying component is the hidden state. The Jacobian J ∈ R N r ×N r is used for the extraction of the Lyapunov spectrum and the CLVs of a trained ESN. We time-march D Lyapunov vectors u i ∈ R N r and periodically perform QR decompositions, where Q ∈ R N r ×D , and R ∈ R D×D . The same CLV algorithm described in Sect. 2.1 is employed to extract D covariant Lyapunov vectors v i ∈ R N r from a trained ESN. The pseudocode is given in algorithm 2.

Validation
The dataset is split into three subsets, which are the training, validation, and testing subsets in a timeordered fashion. During training, the ESN runs in openloop, while during validation and testing, the ESN runs in closed-loop and the prediction at each step becomes the input for the next step. After training the ESN, its validation is necessary for the determination of the hyperparameters. The objective is to compute the hyperparameters that minimize the logarithm of the MSE (16). The logarithm of the MSE is preferred because the error varies by orders of magnitude for different hyperparameters, as explained in [36]. In general, instead of Eq. (16), other types of error functions can be used for the hyperparameter tuning, such as the maximization of the prediction horizon [29,34,36] or the minimization of the kinetic energy differences [64].
Here the input scaling, σ in , the spectral radius, ρ, and the Tikhonov parameter, β, are the ESN hyperparameters that are being tuned [38,64]. In order to select  [66]. We set b in = 1, d = 3 and add Gaussian noise with zero mean and standard deviation, σ n = 0.0005σ y , where σ y is the standard deviation of the data component-wise, to the training and validation data. Adding noise to the data improves the performance of ESNs in chaotic dynamics by alleviating overfitting [32]. A summary of the hyperparameters is shown in Table 1.
One of the most commonly used validation strategy for RNNs is the single shot validation (SSV) [67], in which the data are split into a training set, followed by a single small validation set; see Fig. 2a. As the ESN now runs in closed loop, the size of the validation set is limited by the chaotic nature of the signal. In particular, at the beginning of the validation set, the input y(t 0 ) of the ESN is initialized to the target value. However, chaos causes the predicted trajectory to quickly diverge from the target trajectory in a few Lyapunov times τ λ . The validation interval is therefore small and not representative of the full training set, which causes poor performance in the test set [64]. An improvement to the performance with cheap computations is achieved by the the recycle validation (RV), which was recently proposed by [64]. In the RV, the network is trained only once on the entire training dataset (in open loop), and validation is performed on multiple intervals already used for training (but now in closed loop); see Fig. 2b. In Fig. 2 Schematic representation of the a single shot, and b recycling validation strategies. Here, y represents the degrees of freedom of the data. Three sequential validation intervals are shown for the Recycle Validation [64] this work, we use the chaotic recycle validation (RVC), where the validation interval simply shifts as a small multiple of the first Lyapunov exponent, N val = 3λ 1 .

Results
In this section, we present the numerical results, which include a thorough comparison between the statistics produced by the autonomous temporal evolution of the ESN and the target dynamical system. The selected observables are the statistics of the degrees of freedom, the Lyapunov exponents, the angles between the CLVs or subspaces composed of CLVs, and the finitetime covariant Lyapunov exponents. We separate our analysis into two subsections, which contain two lowdimensional systems and then two higher-dimensional systems.

Low-dimensional chaotic systems
As a first case, we consider two low-dimensional dynamical systems that exhibit chaotic behaviour: Lorenz 63 (L63) [1] and Rössler [39] attractors. The Lorenz 63 system is a reduced-order model of atmospheric convection for a single thin layer of fluid that is heated uniformly from below and cooled from above, which is defined by: We chose the parameters [σ, β, ρ] = [10, 8/3, 28] to ensure a chaotic behaviour. The Rössler attractor, which models equilibrium in chemical reactions, is governed by: We choose the parameters [a, b, c] = [0.1, 0.1, 14] to ensure a chaotic behaviour.
To generate the target set, we evolve the dynamical systems forward in time with a fourth-order Runge-Kutta (RK4) integrator and a timestep dt = 0.005 for both L63 and Rössler, which is sufficiently small for a good temporal resolution. (We tested slightly larger/smaller timesteps with no significant differences. Results not shown.) We perform a QR decomposition every m = 1 timesteps for L63 and every m = 5 timesteps for Rössler. For all systems, we generate a training set of size 1000τ λ and a test set of size 4000τ λ , for the CLV statistics to converge, where τ λ = 1/λ 1 is the Lyapunov time, which is the inverse of the maximal Lyapunov exponent λ 1 .
First, we test whether the ESN correctly learns the chaotic attractor from a statistical point of view, i.e. whether the ESN correctly learns the long-term statistics of the degrees of freedom when it evolves in the closed-loop (autonomous) mode. By estimating the probability density function (PDF) of the degrees of freedom of the ESNs, as a normalized histogram, and comparing it with the corresponding PDF of the target set, we extract information on the invariant measure of the considered chaotic system. This is shown in Fig. 3   Second, we test whether the ESN correctly learns the Lyapunov spectrum. Table 2 shows the ESN predictions on the Lyapunov exponents for the L63 and Rössler attractors, which are compared with the target exponents. The leading exponent is accurately predicted with a 0.2% error in the L63 and 1.5% error in the Rössler system. In chaotic systems, there exists a neutral Lyapunov exponent, which is associated with the direction of dx dt . In these cases, the neutral Lyapunov exponents are λ 2 = 0 for both systems, which are correctly inferred by the ESN within a O(10 −5 ) error, or less. For the smallest, and negative exponent, which is generally harder to extract because it is highly damped, the relative error is about 0.6% for L63 and 2.1% for Rössler. Therefore, the ESNs can accurately capture the tangent dynamics of a low-dimensional chaotic attractor.
Third, we investigate the angles between the CLVs. We assess whether the ESNs learn the long-term statistics of these quantities, but also whether, they correctly infer the distribution and fluctuations of those observables in the phase space. In other words, whether the ESNs learn the geometrical structure of the attractor and its tangent space. In Fig. 4, we present an analysis of the distribution of principal angles between the CLVs, θ a,b ∈ [0 • , 90 • ], on the topology of the L63 attractor. The attractor is well reproduced by a selected ESN (middle column), compared to the target (left column). The size of both trajectories is 300τ λ . In this case, there are three principal angles between the CLVs; θ U,N is the angle between the unstable and neutral CLV; θ U,S is the angle between the unstable and stable CLV; θ N ,S is the angle between the neutral and stable CLV. The colouring of the attractor is associated with the measured θ a,b . The black and dark red colours identify small angles, i.e. regions of the attractor where neartangencies between the CLVs occur. Possible tangencies between CLVs or invariant manifolds composed of CLVs (as will be discussed later for higher dimensional chaotic systems) are of significant importance, as they signify that the attractor is non-hyperbolic [10] (see Sect. 2.1). The right column is the mean absolute difference between the target and the ESN. The x, y, z domain is discretized with 50 bins in each direction; then, the mean θ a,b is calculated from each of the three-dimensional bins for the 300τ λ long trajectory. Finally, the absolute difference between ESN and target is calculated for each bin. The plots follow the same colour scheme as the colourbar, with black and dark red colours indicating < 2 • differences with a maximum of ∼ 10 • . Figure 4 shows that the ESN is able to accurately learn the dynamics of the tangent linear space of the attractor. In Fig. 5, we show the PDF of the principal angles between the three CLVs, for which there is agreement between target and ESN results in all cases for both L63 and Rössler, even for smaller angles. The nonzero count of events close to θ → 0 indicates that the two considered systems are non-hyperbolic, which is consistent with the literature [58].
Fourth, we analyse the distribution on the attractor, as well as the statistics, of the Finite-time Covariant Lyapunov Exponents, for a time-lapse of Δt = m dt timestep, and assess the accuracy of the trained ESNs. For the considered low-dimensional systems there are three FTCLEs with each showing the finite-time growth rate of the corresponding Covariant Lyapunov Vectors.
In Fig. 6, we visualize the distribution of the single timestep FTCLEs, in the case of the Rössler attractor, which is well reproduced by a selected ESN (middle column), compared to the target (left column). The size of both trajectories is 300τ λ . FTCLE 1 is the finite-time exponent for the unstable CLV, FTCLE 2 is for the neutral CLV, and FTCLE 3 is for the stable CLV. The colouring is associated with the values of the FTCLEs. Large positive FTCLEs correspond to high finite-time growth rates and, thus, reduced predictability. The distribution of the leading FTCLE on the attractor is similar between the target and ESN. The second FTCLE and third FTCLE, which correspond to the neutral CLV and stable CLVs, accordingly, also show good agreement between the two. The mean difference between the target and the ESN on the attractor is plotted in the right column, in which black identifies Λ c i ≈ 0. The right column shows that most of the small differences between the ESN and the target are located in the region of large variation of z.
Finally, Fig. 7 shows the PDF of the three FTCLEs. There is agreement between the ESN-inferred quantities and the target in all cases, in particular in the Rössler attractor for the most-probable statistics. The small deviation in Fig. 7a for L63 corresponds to the statistics around the peak of the first FTCLE, Λ c 1 , but the tails of the distributions are well reproduced. The mean of the Λ c i distributions coincides with the LEs λ i , which holds true for all our results. A behaviour as in Fig. 7a implies that in this case the finite-time values Λ c 1 are less peaked around the mean value, even though their long-time average coincides with the Lyapunov exponent λ 1 . Nevertheless, in Figs. 7d-f for Rössler the statistics around the peak (and beyond) are well captured.  We refer the interested reader to our supplementary material where the corresponding results of Figs. 4 and 6 for both attractors are shown. Also, the statistics of FTLEs, as well as their distribution on the chaotic attractors, are presented in the supplementary material.

Higher-dimensional chaotic systems
We follow the same analysis and approach as in Sect. 3.1 for two higher-dimensional chaotic systems, both of which are related to atmospheric physics and meteorology. The first is a reduced-order model of atmospheric blocking events by Charney and DeVore [40] (CdV), which is a six-dimensional truncation of the equations for barotropic flow with orography. We employ the formulation of [68,69], which is forced by a zonal flow profile that can be barotropically unstable.
The governing equations are: where the model coefficients are . (23) Equation (22) [68,69]. In particular, the CdV model allows for two metastable states, the so-called "zonal" state, which represents the approximately zonally symmetric jet stream in the midlatitude atmosphere, and the "blocked" state, which refers to a diverse class of weather patterns that are a persistent deviation from the zonal state. Blocking events are known to be associated with regional extreme weather, from heatwaves in summer to cold spells in winter [70]. The dynamical properties of CLVs in connection to blocking events were recently investigated for a series of more complex atmospheric models than CdV [11,12,71], which demonstrated that CLVs are good candidates for blockings precursors, as well as a good basis for model reduction. In the previous work [34], the CdV system was used as a training model for the ESN, with the purpose of studying short-term accurate prediction of chaos, and quantifying the benefit of Physics informed echo state networks [34].
The second higher-dimensional system that we consider is the Lorenz 96 (L96) model [41], which is a system of coupled ordinary differential equations that describes the large-scale behaviour of the mid-latitude atmosphere, and the transfer of a scalar atmospheric quantity. Three characteristic processes of atmospheric systems (advection, dissipation, and external forcing) are included in the model, whose equations are We set periodic boundary conditions, i.e. x 1 = x D+1 . In our analysis, we chose D = 20 degrees of freedom. The external forcing is set to F = 8, which ensures a chaotic evolution [32]. We integrate the system with RK4 and dt = 0.01. We perform a QR decomposition every m = 5 timesteps for CdV and every m = 10 timesteps for L96. Similar to the previous section, we generate a training set of size 1000τ λ and a test set of size 4000τ λ .
First, Fig. 8 shows the PDF of the six degrees of freedom of CdV, and the first six from L96 (the PDFs of the rest 13 dofs have similar shape and agreement between ESN and target). We use a semilogarithmic scale to emphasize that the agreement between target (black line) and ESN (red dashed line) is accurate for the tails of the distributions, which effectively correspond to the edges of each attractor. As in Sect. 3.1 in order to evaluate uncertainty and robustness, we start with N ESN = 10 trained networks, but during postprocessing we discard any network that shows spurious temporal evolution, and perform a further averaging of the PDFs of each network's observable. Therefore, the PDFs of Fig. 8 are the outcome of averaging N ESN = 5 and N ESN = 9 PDFs with the same binning, for CdV and L96, respectively.
Second, Fig. 9 shows the Lyapunov exponents spectrum of (a) CdV and (b) L96 for D = 20 and compares the target (black squares) with the ESN prediction (red circles). The CdV model has a single positive Lyapunov exponent, with the average value of 5 ESNs resulting in λ 1 = 0.0214, and for the 5 independent target sets, λ targ 1 = 0.0232 with an 8% absolute error. The second Lyapunov exponent is zero (to numerical error) and corresponds to the neutral direction, with λ 2 = −3×10 −5 for ESN, and λ  target. Overall, excluding λ 2 , the mean absolute error of the CdV Lyapunov spectrum here is 3.7%, which is negligibly small.
With respect to the L96 Lyapunov spectra in Fig. 9b, the agreement between target and ESN across all 20 exponents is good. In particular, there are 6 positive, 1 zero and 13 negative exponents. The maximal exponent predicted from the ensemble of N ESN = 9 ESNs is equal to λ 1 = 1.551, and for the 9 independent target sets, λ Those directions in tangent space decay exponentially fast and the accuracy that the ESN achieves is consistent. For L96 the mean absolute error of the Lyapunov spectrum is approximately 0.5%.
To further elaborate, the L96 is known to be an extensive system [72,73], which means that quantities such as the surface width, the entropy and the attractor dimension scale linearly with its dimensionality D. For the Lyapunov spectrum, this means that the pro- Third, we investigate the statistics of the principal angles, θ ∈ [0 • , 90 • ], between the three subspaces that partition the invariant manifolds, which are the unstable E U x , neutral E N x and stable E S x , spanned by the corresponding CLVs. The extraction of the principal angles between two linear subspaces requires a singular value decomposition of their matrix product Γ a,b = E a x E b x (assuming the CLVs are ordered as stacked columns, according to their Lyapunov exponent order), because all paired products between the CLVs spanning the subspaces do not provide all the angles [10,74]. The angles are given by and we analyse the smallest singular value. Here, we use the implemented routine scipy.linalg.sub space_angles of the scipy package [57] in python and analyse the minimum angle in order to track homoclinic tangencies between the subspaces. This implementation is based on the algorithm presented in [75], which has improved accuracy with respect to Eq. (25) in the estimation of small angles. In Fig. 10, we study the PDFs of the three principal angles between the linear subspaces for CdV and L96. In CdV, the unstable and neutral subspaces are spanned only by the corresponding CLVs, while the stable subspace is spanned by the remaining four CLVs, of which λ i < 0. In L96 with D = 20, the unstable subspace is spanned by the first six CLVs, the neutral subspace is spanned only by the 7th CLV, and the stable subspace is spanned by the remaining 13 CLVs. Focusing on Fig. 10a-c for CdV, we notice that this system is non-hyperbolic because the PDFs are populated close to θ → 0. Specifically for Fig. 10a, the binning is geometrically spaced and denser close to θ → 0. Interestingly, the PDF of θ U,N of CdV for small angles follows a power-law P DF(θ ) ∼ θ −α for θ → 0 and until ≈ 10 • , before it saturates. A different shape that is still highly non-hyperbolic is shown for the PDFs of θ U,S and θ N ,S in Fig. 10b and c, in which the binning is linear and both axes in logarithmic scale. Figure 10d-f shows the same statistics in the case of L96, which is also non-hyperbolic, as there is strong frequency of tangencies, θ → 0. In all plots of Fig. 10, the agreement of the subspace angle statistics between target and ESN is good, which demonstrates that the ESN has achieved a robust and accurate learning of the ergodic properties from higher-dimensional data.
Fourth, the statistics of FTCLEs (Λ c i ), for a timelapse of Δt = m dt timesteps, in the cases of CdV and L96 are shown in Fig. 11. All six Λ c i are shown for CdV, while a representative set of six Λ c i are shown for L96, such that λ i > 0 for k = 1, 2, 3, λ i = 0 for k = 7, and λ i < 0 for k = 11, 17. For CdV, the most probable statistics are well captured by the ESN, which is in agreement with the target data. There are slight deviations at the tails of the distributions, which are still in agreement within error bars (shaded region). In the case of L96, the agreement is good for both the most probable statistics and the tails, for all FTCLEs (also those not shown). The first moment of the distributions, i.e. the mean of the FTCLEs time series, must be equal to the Lyapunov exponents, λ i = 1 T T 0 Λ c i , which indeed holds for all the cases considered here. The agreement between ESN and target sets in Fig. 11 shows that the ESN is able to accurately learn the finitetime variability of the CLV growth rates also for higher dimensional systems that are characterized by many Lyapunov exponents.
Finally, in Table 3 we show the estimated Kaplan-Yorke dimension [76] for all the considered systems and compare the outcomes of the ESN and target. This where k is such that the sum of the first k LEs is positive and the sum of the first k+1 LEs is negative. We observe a good agreement in all cases with ≤ 1% error. This observation further confirms the ability of the ESN to accurately learn the properties of the chaotic attractor.

Conclusion
Stability analysis is a principled mathematical tool to quantitatively answer key questions on the behaviour of nonlinear systems: Will infinitesimal perturbations grow in time (i.e. is the system linearly unstable)? If so, what are the perturbations' growth rates (i.e. how linearly unstable is the system)? What are the directions of growth? To answer these questions, traditionally, we linearize the equations of the dynamical system around a reference point, and compute the properties of the tangent space, the dynamics of which is governed by the Jacobian. The overarching goal of this paper is to propose a method that infers the stability properties directly from data, which does not rely on the knowledge of the dynamical differential equations. We tackle chaotic systems, which have a linearized behaviour that is more general and intricate than periodic or quasi-periodic oscillations. First, we propose the echo state network with the recycle validation as a tool to accurately learn the chaotic dynamics from data. The data are provided by the integration of lowand higher-dimensional prototypical chaotic dynamical systems. These systems are qualitatively different from each other and are toy models that describe diverse physical settings, ranging from climatology and meteorology to chemistry. Second, we mathematically derive the Jacobian of the echo state network (Eq. (18)). In contrast to other recurrent neural networks, such as long short-term memory networks or gated recurrent units, the Jacobian of the ESN is mathematically simple and computationally straightforward. Third, we analyse the stability properties inferred from the ESN and compare them with the target properties (ground truth) obtained by linearizing the equations. The ESN correctly infers quantities that characterize the chaotic dynamics and its tangent space (i) the long-term statistics of the solution, for which we compute the probability density function of each state variable; (ii) the covariant Lyapunov vectors, which are a physical basis for the tangent space that is covariant with the dynamics; (iii) the Lyapunov spectrum, which is the set of eigenvalues of the Oseledets matrix that are the perturbations' average exponential growths; (iv) the finitetime Lyapunov exponents, which are the finite-time growth along the covariant Lyapunov vectors; and (v) the angles between the stable, neutral, and unstable splittings of the tangent space, which informs about the degree of hyperbolicity of the attractor. We show that these quantities can be accurately learned from data by the ESN, with negligible numerical errors.
As mathematically and numerically shown in [44], the stability properties of fixed points (with eigenvalue analysis) and periodic solutions (with Floquet analysis) can be inferred from covariant Lyapunov analysis. Therefore, this work opens up new opportunities for the inference of stability properties from data in nonlinear systems, from simple fixed points, through periodic oscillations, to chaos.
Acknowledgements This research has received financial support from the ERC Starting Grant No. PhyCo 949388. LM gratefully acknowledges financial support TUM Institute for Advanced Study (German Excellence Initiative and the EU 7th Framework Programme No. 291763). We are grateful to Alberto Racca for insightful discussions regarding the ESN. GM is also grateful to Valerio Lucarini for insightful discussions regarding dynamical systems theory.

Data availability
The implementation of the ESN follows [36] and the code can be found in the github repository https://github. com/gmargazo/ESN-CLVs.git.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/ by/4.0/.

A Algorithms to compute LEs and CLVs
In this section, we present two algorithms for the computation of LEs and CLVs. Algorithm 1 is used to calculate the first D LEs of an ESN, where N r is the dimensionality of the hidden state and D is the dimensionality of the input state. This algorithm follows the methods described in [29,32]. Algorithm 2 computes the first D CLVs for both the ESN and target chaotic systems, using the approach outlined in [42]. These algorithms are crucial for understanding the dynamics and predictability of the systems being studied.

B Robustness
An important aspect of data-driven approaches is their ability to perform accurately under a variety of conditions. In this section, we evaluate the robustness of our approach by using smaller training sets (less data) subject to noise levels that are higher than those of Sect. 3. We also test the effect of using a loss function other than the mean square error (MSE), as defined in Eq. (16), on the accuracy of the learning. The ESN architecture follows [36], where it was trained with chaotic data from the Lorenz 63 and Lorenz 96 systems, and was robustly optimized to maximize the prediction horizon under different validation strategies.

B.1 Training with less data and higher noise intensity
It has been demonstrated that adding a small amount of Gaussian centered noise proportional to the standard deviation of the chaotic signal during training can   [32,36]. Noise aids the ESN to generalize to unseen data. In Sect. 3 we add Gaussian noise with a zero mean and standard deviation, σ n = δσ y , where δ = 0.05%, and σ y is the standard deviation of the data component-wise. We consider the Lorenz 96 with D = 10 degrees of freedom and F = 8, such that the system is chaotic. We increase the noise intensity to δ = {0.5%, 5%, 10%}. We also quantify the effect of less training data by using 100τ λ and 500τ λ long time series, i.e. 1/10 and half of the 1000τ λ long time series that we used in Sect. 3. Figure 12 shows the effects in the Lyapunov spectrum. For 12a, where the training set is 100τ λ long, there is a good agreement between the target (black squares) and the ESN (coloured points) positive exponents. As expected, a gradual deterioration appears as the noise increases. In 12b for a 500τ λ long training set, the agreement is good for all exponents with a smaller difference for negative exponents compared to (a). After training N ESN = 10 statistically independent networks with chaotic time series, some might eventually evolve towards a fixed point or a periodic orbit instead (i.e. they show spurious behaviour). Here, for 100τ λ long training time series, no ESN evolves spuriously at 0.05% and 0.5% noise. However, at 10% noise, half of the networks show spurious evolution, and are discarded at post-processing. Instead, for 500τ λ long training time series, one and two out of ten evolves spuriously at 0.05% and 0.5% noise, respectively, but none at 5% and 10% noise, which ensures robustness of the network.
As a further test, in Fig. 13 we consider the minimum angles between subspaces spanned by CLVs. In 13a-c the ESNs are trained with 100τ λ long time series, and accordingly in 13d-f with 500τ λ . Overall, the results are in good agreement with the target ensuring the robustness of the ESN. A slight and gradual disagreement is observed as the noise intensity increases, in particular for θ U,N .   (16), is a commonly used loss function in the ESN architecture [38]. We By comparing the stability properties obtained using the MSE and MAE loss functions, we can gain a better understanding of the potential impact of the choice of loss function on the performance of ESN. In Fig. 14 the results correspond to a 100τ λ long training set, where Eq. 27 was used as a loss function. The Lyapunov spectrum of Fig. 14a is qualitatively similar to Fig. 12a.
In practice, training with MAE resulted in less stable ESNs, with increased failures during the test set. For a 100τ λ long training set, at 10%σ y noise with MAE, 80% of ESNs failed, in contrast to 50% with MSE for the same noise. Figure 14b-d are similar to Fig. 13a-c showing minor differences. We also trained the ESNs with 500τ λ long training sets, as in Sect. B.1. Interestingly, we obtain similar results with Figs. 12b and 13df, with no significant differences (result not shown). Based on our analyses, we can conclude that the process of extracting the stability properties of an ESN is robust against higher levels of noise, smaller training sets, and the use of a MAE loss function. Our results suggest that a good practice is to use small to moderate levels of centered Gaussian noise in the training set, a sufficiently large reservoir size, and a training trajectory of at least 100τ λ .