1 Introduction

Stochastic dynamical systems are widely used to model and study systems that evolve under the influence of both deterministic and random effects. They offer a framework for understanding, predicting, and controlling systems exhibiting randomness. This makes them invaluable across various scientific, engineering, and economic applications.

Given a state-space \(\Omega \subset \mathbb {R}^d\) and a sample space \(\Omega _s\), we consider a discrete-time stochastic dynamical system

$$\begin{aligned} \pmb {x}_{n} = F(\pmb {x}_{n-1},\tau _n), \qquad n\ge 1, \quad \pmb {x}_n \in \Omega , \end{aligned}$$
(1)

where \(\{\tau _n\}_{n\in \mathbb {N}}\in \Omega _s\) are independent and identically distributed (i.i.d.) random variables with distribution \(\rho \) supported on \(\Omega _s\), \(\pmb {x}_0\in \Omega \) is an initial condition, and \(F: \Omega \times \Omega _s\rightarrow \Omega \) is a function. In many applications, the function F is unknown or cannot be studied directly, which is the premise of this paper. We adopt the notation \(F_{\tau }(\pmb {x})=F(\pmb {x},\tau )\) for convenience and express \(\pmb {x}_{n} = (F_{\tau _n}\circ \cdots \circ F_{\tau _1})(\pmb {x}_0)\), where ‘\(\circ \)’ denotes the composition of functions.

With the assumptions above, equation (1) describes a discrete-time Markov process. For such systems, the Kolmogorov backward equation governs the evolution of an observable [34, 40], with the right-hand side defined as the stochastic Koopman operator [51]. The works [51, 57] have spurred increased interest in the data-driven approximation of both deterministic and stochastic Koopman operators and in analyzing their spectral properties [11, 43, 54]. Prominent applications span a variety of fields including fluid dynamics [31, 52, 66, 68], epidemiology [64], neuroscience [9, 14, 47], finance [46], robotics [6, 8], power systems [75, 76], and molecular dynamics [39, 59, 69, 70].

Although the function F is usually nonlinear, the stochastic Koopman operator is always linear; however, it operates on an infinite-dimensional space of observables. Of particular interest is the spectral content of the Koopman operator near the unit circle, which corresponds to slow subspaces encapsulating the long-term dynamics. If finite-dimensional eigenspaces can capture this spectral content effectively, they can serve as a finite-dimensional approximation. Numerous algorithms have been developed to approximate the spectral properties of Koopman operators [1, 2, 10, 12, 26, 30, 42, 48, 52, 55]. Among these, dynamic mode decomposition (DMD) is particularly popular [44]. Initially introduced in the fluids community [67, 68], DMD’s connection to the Koopman operator was established in [66]. Since then, several extensions and variants of DMD have been developed [4, 15, 19, 63, 84, 85], including methods tailored for stochastic systems [24, 72, 82, 87].

At its core, DMD is a projection method. It is widely recognized that achieving convergence and meaningful applications of DMD can be challenging due to the infinite-dimensional nature of Koopman operators [12, 23, 37, 84]. Challenges include the presence of spurious (unphysical) modes resulting from projection, essential spectra,Footnote 1 the absence of non-trivial finite-dimensional invariant subspaces, and the verification of Koopman mode decompositions (KMDs). Residual Dynamic Mode Decomposition (ResDMD) has been introduced to address these issues for deterministic systems [20, 23]. ResDMD facilitates a data-driven approach to compute residuals associated with the full infinite-dimensional Koopman operator, thus enabling the computation of spectral properties with controlled errors and the verification of learned dictionaries and KMDs. Despite the evident importance of analyzing stochastic systems through the Koopman perspective, similar verified DMD methods in this setting are absent.

This paper presents several infinite-dimensional techniques for the data-driven analysis of stochastic systems. The central concept we explore is going beyond expectations to include higher moments within the Koopman framework. Figure 1 illustrates this point by depicting the evolution of two eigenfunctions associated with the stochastic Van der Pol oscillator (detailed in Sect. 5.2), alongside the expectation determined by the stochastic Koopman operator. Both eigenvalues and eigenfunctions are computed with a negligible projection error.Footnote 2 Notably, although both corresponding eigenvalues oscillate at the same frequency due to having identical arguments, the variances of the trajectories exhibit significant differences. This divergence is quantified by what we define as a variance residual (see Sect. 3.2).

Fig. 1
figure 1

The evolution of two eigenfunctions on the attractor of the stochastic Van der Pol oscillator from Sect. 5.2. The plots show the arguments. In blue, we see a sample of the true trajectories, while the expected values predicted from the stochastic Koopman operator are shown in red. Top: Eigenfunction associated with \(m=0\) and \(k=1\) in Table 1. The variance residual is small, and trajectories hug the expectation closely. Bottom: Eigenfunction associated with \(m=1\) and \(k=1\) in Table 1. The variance residual is large, and trajectories deviate from the expectation

1.1 Contributions

The contributions of our paper are as follows:

  • Variance Incorporation: We integrate the concept of variance into the Koopman framework and establish its relationship with batched Koopman operators. Proposition 2 decomposes a mean squared Koopman error into an infinite-dimensional residual and a variance term. Additionally, we present methodologies (see Algorithms 1 and 2) for independently calculating these components, thereby enhancing the understanding of the spectral properties of the Koopman operator and the deviation from mean dynamics.

  • Variance-Pseudospectra: We introduce a novel concept of pseudospectra, termed variance-pseudospectra (see Definition 2), which serves as a measure of statistical coherency.Footnote 3 We also offer algorithms for computing these pseudospectra (see Algorithms 3 and 4) and prove their convergence.

  • Convergence Theory: Sect. 4 of our paper is dedicated to proving a suite of convergence theorems. These pertain to the spectral properties of stochastic Koopman operators, the accuracy of KMD forecasts, and the derivation of concentration bounds for estimating Koopman matrices from a finite set of snapshot data.

Various examples are given in Sect. 5 and code is available at: https://github.com/MColbrook/Residual-Dynamic-Mode-Decomposition.

1.2 Previous work

Existing literature on stochastic Koopman operators primarily addresses the challenge of noisy observables in extended dynamic mode decomposition (EDMD) methodologies [82], and in techniques for debiasing DMD [27, 35, 77]. A related concern is the estimation error in Koopman operator approximations due to the finite nature of data sets. This issue is present in both deterministic and stochastic scenarios. As [84] describes, EDMD converges with large data sets to a Galerkin approximation of the Koopman operator. The work in [58] thoroughly analyzes kernel autocovariance operators, including nonasymptotic error bounds under classical ergodic and mixing assumptions. In [60], the authors offer the first comprehensive probabilistic bounds on the finite-data approximation error for truncated Koopman generators in stochastic differential equations (SDEs) and nonlinear control systems. They examine two scenarios: (1) i.i.d. sampling and (2) ergodic sampling, with the latter assuming exponential stability of the Koopman semigroup. Additionally, the variational approach to conformational dynamics (VAC), which bears similarities to DMD, is known for providing spectral estimates of time-reversible processes that result in a self-adjoint transition operator. The connection of VAC with Koopman operators is detailed in [83], and the approximation of spectral information with error bounds is discussed in [39].

1.3 Data-driven setup

We present data-driven methods that utilize a dataset of “snapshot” pairs alongside a dictionary of observables. While numerous approaches for selecting a dictionary exist in the literature [17, 32, 80,81,82, 84, 85], this topic is not the primary focus of our current study.Footnote 4 Following the methodology outlined in [79], we consider our given data to consist of pairs of snapshots, which are

$$\begin{aligned} \texttt {S}=\left\{ (\pmb {x}^{(m)},\pmb {y}^{(m)})\right\} _{m=1}^M,\quad \pmb {y}^{(m)}=F(\pmb {x}^{(m)},\tau _m). \end{aligned}$$
(2)

Unlike in deterministic systems, for stochastic systems, it can be beneficial for \(\texttt {S}\) to include the same initial condition \(\pmb {x}^{(m)}\) multiple times, as each execution of the dynamics yields an independent realization of a trajectory. We say that \(\texttt {S}\) is \(M_1\)-batched if it can be split into \(M_1\) subsets such that

$$\begin{aligned} \texttt {S}&=\cup _{j=1}^{M_1}{} \texttt {S}_j,\\ \texttt {S}_j&=\{(\pmb {x}^{(j)},\pmb {y}^{(j,k)}):k=1,\ldots ,M_2,\pmb {y}^{(j,k)}=F(\pmb {x}^{(j)},\tau _{j,k})\}. \end{aligned}$$

In other words, for each \(\pmb {x}^{(j)}\), we have multiple realizations of \(F_\tau (\pmb {x}^{(j)})\). Using batched data, we can approximate higher-order stochastic Koopman operators representing the moments of the trajectories. An unbatched dataset can be adapted to approximate a batched dataset by categorizing or “binning” the \(\pmb {x}\) points in the snapshot data. In practical scenarios, one may encounter a combination of both batched and unbatched data. Depending on the type of snapshot data used, Galerkin approximations of stochastic Koopman operators can be achieved in the limit of large datasets (as discussed in Sect. 2.2).

2 Mathematical preliminaries

This section discusses several foundational concepts upon which our paper builds.

2.1 The stochastic Koopman operator

Let \(g:\Omega \rightarrow \mathbb {C}\) be a function, commonly called an observable. Given an initial condition \(\pmb {x}_0\in \Omega \), measuring the initial state of the dynamical system through g yields the value \(g(\pmb {x}_0)\). One time-step later, the measurement \(g(\pmb {x}_1) = g(F_\tau (\pmb {x}_0)) = (g\circ F_\tau )(\pmb {x}_0)\) is obtained, where \(\tau \) is a realization from a probability distribution supported on \(\Omega _s\), i.e., \(\tau \sim \rho \). The “pull-back” operator, given g, outputs the “look ahead” measurement function \(g\circ F_\tau \). This function is a random variable, and the stochastic Koopman operator is its expectation [56]:

$$\begin{aligned} \mathscr {K}_{(1)}[g] = \mathbb {E}_{\tau }\left[ g\circ F_\tau \right] =\int _{\Omega _s} g\circ F_\tau \,\textrm{d}\rho (\tau ). \end{aligned}$$
(3)

Here, \(\mathbb {E}_{\tau }\) represents the expectation with respect to the distribution \(\rho \). The subscript (1) indicates this is the first moment. Throughout the paper, we assume that the domain of the operator \(\mathscr {K}_{(1)}\) is \(L^2(\Omega ,\omega )\), where \(\omega \) is a positive measure on \(\Omega \). This space is equipped with an inner product and norm, denoted by \(\langle \cdot ,\cdot \rangle \) and \(\Vert \cdot \Vert \), respectively. We do not assume that \(\mathscr {K}_{(1)}\) is compact or self-adjoint.

We now introduce the batched Koopman operator, designed to capture the variance and other higher-order moments in the trajectories of dynamical systems. For \(r\in \mathbb {N}\) and \(g:\Omega ^{r}\rightarrow \mathbb {C}\), we define

$$\begin{aligned} \mathscr {K}_{(r)}[g] = \mathbb {E}_{\tau }\left[ g(F_\tau ,\ldots ,F_\tau )\right] , \end{aligned}$$
(4)

where the same realization \(\tau \sim \rho \) is used for the r arguments of g. Notably, both the classical and the batched versions of the Koopman operators adhere to the semigroup property, as we will demonstrate.

Proposition 1

For any \(r,n\in \mathbb {N}\),

$$\begin{aligned} \mathscr {K}_{(r)}^n[g]=\mathbb {E}_{\tau _1,\ldots ,\tau _n}\left[ g(F_{\tau _n}\circ \cdots \circ F_{\tau _1},\ldots ,F_{\tau _n}\circ \cdots \circ F_{\tau _1})\right] . \end{aligned}$$

Proof

For \(r=1\), see [24]. For \(r>1\), note that \(\mathscr {K}_{(r)}\) is a first-order Koopman operator of a dynamical system on \(\Omega ^r\). \(\square \)

This proposition indicates that n applications of the stochastic Koopman operator yield the expected value of an observable after n time steps. It is crucial to understand that \(\mathscr {K}_{(1)}\) only calculates the expected value. To gain insights into the variability around this mean and to understand the projection error inherent in DMD methods, we need to consider higher-order statistics, such as the variance. These aspects are further explored in Sect. 3.

2.2 Extended dynamic mode decomposition

EDMD is a widely-used method for constructing a finite-dimensional approximation of the Koopman operator \(\mathscr {K}_{(1)}\), utilizing the snapshot data \(\texttt {S}\) in (2). This approach involves projecting the infinite-dimensional Koopman operator onto a finite-dimensional matrix and approximating its entries. For notational simplicity, we will omit the subscript (1) when referring to the Koopman operator in this section. Originally, EDMD assumes that the initial conditions are independently drawn from a distribution \(\omega \) [84]. However, in our adaptation, we apply EDMD to any given \(\texttt {S}\), treating the \(\pmb {x}^{(m)}\) as quadrature nodes for integration with respect to \(\omega \). This flexibility allows us to use different quadrature weights depending on the specific scenario.

One first chooses a dictionary \(\{\psi _1,\ldots ,\psi _{N}\}\) in the space \(L^2(\Omega ,\omega )\). This dictionary consists of a list of observables that form a finite-dimensional subspace \(V_N=\textrm{span}\{\psi _1,\ldots ,\psi _{N}\}\). EDMD computes a matrix \(K\in \mathbb {C}^{N\times N}\) that approximates the action of \(\mathscr {K}\) within this subspace. Specifically, the goal is to achieve \(K=\mathscr {P}_{V_{N}}\mathscr {K}\mathscr {P}_{V_{N}}^*\), where \(\mathscr {P}_{V_{N}}:L^2(\Omega ,\omega )\rightarrow V_N\) is the orthogonal projection onto \(V_N\). In the Galerkin framework, this equates to:

$$\begin{aligned} \langle \mathscr {K}[\psi _j],\psi _i\rangle = \sum _{s=1}^N K_{s,j}\langle \psi _s,\psi _i\rangle , \qquad 1\le i,j\le N. \end{aligned}$$

A matrix K satisfying this relationship is given by

$$\begin{aligned} K = G^{\dagger }A, \qquad G_{i,j} = \langle \psi _j,\psi _i\rangle ,\quad A_{i,j} = \langle \mathscr {K}[\psi _j],\psi _i\rangle . \end{aligned}$$

Commonly, we stack the \(\Psi \) and define the feature map

$$\begin{aligned} \Psi (\pmb {x})=\begin{bmatrix}\psi _1(\pmb {x})&\cdots&\psi _N(\pmb {x}) \end{bmatrix}^\top \in \mathbb {C}^{1\times N}. \end{aligned}$$

Then, for any \(g\in V_N\), we use the shorthand \(g=\Psi \pmb {g}\) for \(g(\pmb {x}) = \sum _{j=1}^N g_j\psi _j(\pmb {x})\). With the previously defined K, the approximation becomes

$$\begin{aligned} \mathscr {K}[g](\pmb {x}) \approx \sum _{i=1}^N \left( \sum _{j=1}^N K_{i,j}g_j\right) \psi _i(\pmb {x})=\Psi (\pmb {x})K\pmb {g}. \end{aligned}$$

The accuracy of this approximation depends on how well \(V_N\) can approximate \(\mathscr {K}g\).

The entries of the matrices G and A are inner products and must be approximated using the trajectory data \(\texttt {S}\). For quadrature weights \(\{w_m\}\), we define \(\tilde{G}\) as the numerical approximation of G:

$$\begin{aligned} \tilde{G}_{i,j} = \sum _{m=1}^{M} w_{m} \psi _j(\pmb {x}^{(m)})\overline{\psi _i(\pmb {x}^{(m)})}\approx \langle \psi _j\,,\psi _i\rangle \,= {G}_{i,j}\,. \end{aligned}$$
(5)

The weights \(\{w_m\}\) reflect the significance assigned to each snapshot in the dataset, influenced by factors such as data distribution or reliability, which we will explore further. Similarly, for A, we define

$$\begin{aligned} \tilde{A}_{i,j} = \sum _{m=1}^{M} w_{m} \psi _j(\pmb {y}^{(m)})\overline{\psi _i(\pmb {x}^{(m)})} \approx \langle \mathscr {K}[\psi _j]\,,\psi _i\rangle \,=A_{i,j}\,. \end{aligned}$$
(6)

Let \(\Psi _X,\Psi _Y\in \mathbb {C}^{M\times N}\) collect the dictionary’s evaluations of these samples:

$$\begin{aligned} \Psi _X=\begin{pmatrix} \Psi ^\top (\pmb {x}^{(1)})\\ \vdots \\ \Psi ^\top (\pmb {x}^{(M)}) \end{pmatrix}\,,\quad \Psi _Y=\begin{pmatrix} \Psi ^\top (\pmb {y}^{(1)})\\ \vdots \\ \Psi ^\top (\pmb {y}^{(M)}) \end{pmatrix}\,, \end{aligned}$$
(7)

and let \(W=\textrm{diag}(w_1,\ldots ,w_{M})\). Then we can succinctly write

$$\begin{aligned} \tilde{G}=\Psi _X^*W\Psi _X,\quad \tilde{A}=\Psi _X^*W\Psi _Y. \end{aligned}$$
(8)

Throughout this paper, the symbol \(\tilde{X}\) denotes an estimation of the quantity X.

Various sampling methods converge in the large data limit, meaning that

$$\begin{aligned} \lim _{M\rightarrow \infty } \tilde{G}=G,\quad \lim _{M\rightarrow \infty } \tilde{A}=A. \end{aligned}$$
(9)

We detail three convergent sampling methods:

  1. (i)

    Random sampling: In the initial definition of EDMD, \(\omega \) is a probability measure and \(\{\pmb {x}^{(m)}\}_{m=1}^M\) are independently drawn according to \(\omega \) with each quadrature weight set to \(w_m=1/M\). The strong law of large numbers guarantees that (9) holds with probability one [38, Section 3.4] [41, Section 4]. Typically, convergence occurs at a Monte Carlo rate of \(\mathscr {O}(M^{-1/2})\) [13].

  2. (ii)

    Ergodic sampling: If the stochastic dynamical system is ergodic, the Birkhoff–Khinchin theorem [33, Theorem II.8.1, Corollary 3] supports convergence using data from a single trajectory for almost every initial point. Specifically, we use:

    $$\begin{aligned}{} & {} \pmb {x}^{(m+1)}=F(\pmb {x}^{(m)},\tau _{m}),\quad w_m=1/M. \end{aligned}$$

    This sampling method’s analysis for stochastic Koopman operators is detailed in [82]. An advantage is that knowledge of \(\omega \) is not required. However, the convergence rate depends on the specific problem [36]. Note that in an ergodic system, the stochastic Koopman operator is an isometry on \(L^1(\Omega ,\omega )\) but typically not on \(L^2(\Omega ,\omega )\).

  3. (iii)

    High-order quadrature: When the dictionary and F are sufficiently regular, and the dimension d is not too large, and if we can choose the \(\{\pmb {x}^{(m)}\}_{m=1}^{M}\), employing a high-order quadrature rule is advantageous. For deterministic systems, this approach can significantly increase convergence rates in (9) [23]. In stochastic systems, high-order quadrature applies primarily to batched snapshot data. We may select \(\{\pmb {x}^{(j)}\}_{j=1}^{M_1}\) based on an \(M_1\)-point quadrature rule with associated weights \(\{w_j\}_{j=1}^{M_1}\). Convergence is achieved as \(M_2\rightarrow \infty \), effectively applying Monte Carlo integration of the random variable \(\tau \) over \(\Omega _s\) for each fixed \(\pmb {x}^{(j)}\).

The convergence described in (9) implies that the eigenvalues obtained through EDMD converge to the spectrum of \(\mathscr {P}_{V_{N}}\mathscr {K}\mathscr {P}_{V_{N}}^*\) as \(M\rightarrow \infty \). Therefore, approximating the spectrum of \(\mathscr {K}\), denoted \(\textrm{Sp}(\mathscr {K})\), by the eigenvalues of \(\tilde{K}\) is closely related to the so-called finite section method [7]. However, just as the finite section method can be prone to spectral pollution, which refers to the appearance of spurious modes that accumulate even as the size of the dictionary increases, this is also a concern for EDMD [84]. Consequently, having a method to validate the accuracy of the proposed eigenvalue-eigenvector pairs becomes crucial, which is one of the key functions of ResDMD.

2.3 Residual dynamic mode decomposition (ResDMD)

Accurately estimating the spectrum of \(\mathscr {K}\) is critical for analyzing dynamical systems. For deterministic systems, ResDMD achieves this goal, providing robust spectral estimates [20, 23]. Unlike classical DMD methods, ResDMD introduces an additional matrix specifically designed to approximate \(\mathscr {K}^*\mathscr {K}\). This enhancement not only offers rigorous error guarantees for the spectral approximation but also enables a posteriori assessment of the reliability of the computed spectra and Koopman modes. This capability is particularly valuable in addressing issues such as spectral pollution, which are common challenges in DMD-type methods.

ResDMD is built around the approximation of residuals associated with \(\mathscr {K}\), providing an error bound. For any given candidate eigenvalue-eigenvector pair \((\lambda ,g)\), with \(\lambda \in \mathbb {C}\) and \(g=\Psi \,\pmb {g}\in V_{N}\), one can consider the relative squared residual as follows:

$$\begin{aligned}&\frac{\int _{\Omega }\left| \mathscr {K}[g](\pmb {x})-\lambda g(\pmb {x})\right| ^2\,\textrm{d}\omega (\pmb {x})}{\int _{\Omega }\left| g(\pmb {x})\right| ^2\,\textrm{d}\omega (\pmb {x})}\nonumber \\&\quad =\frac{\langle \mathscr {K}[g],\mathscr {K}[g]\rangle -\lambda \langle g,\mathscr {K}[g]\rangle -\overline{\lambda }\langle \mathscr {K}[g],g\rangle +|\lambda |^2\langle g,g\rangle }{\langle g,g\rangle }. \end{aligned}$$
(10)

This pair \((\lambda ,g)\) can be computed either from K or other methods. A small residual means that \(\lambda \) can be approximately considered as an eigenvalue of \(\mathscr {K}\), with g as the corresponding eigenfunction. The relative residual in (10) serves as a measure of the coherency of observables, indicating that observables with smaller residuals play a significant role in the dynamics of the system. If the relative (non-squared) residual is bounded by \(\epsilon \), then \(\mathscr {K}^ng=\lambda ^n g+\mathscr {O}(n\epsilon )\). In other words, \(\lambda \) characterizes the coherent oscillation and the decay/growth in the observable g with time.

The residual is closely related to the notion of pseudospectra [78].

Definition 1

For any \(\lambda \in \mathbb {C}\), define:

$$\begin{aligned} \sigma _{\textrm{inf}}(\lambda )=\inf \left\{ \Vert \mathscr {K}[g]-\lambda g\Vert :g{\in }L^2(\Omega ,\omega ),\Vert g\Vert =1\right\} . \end{aligned}$$

For \(\epsilon >0\), the approximate pointFootnote 5\(\epsilon \)-pseudospectrum is

$$\begin{aligned} \textrm{Sp}_{\epsilon }(\mathscr {K})=\textrm{Cl}\left( \left\{ \lambda \in \mathbb {C}:\sigma _{\textrm{inf}}(\lambda )<\epsilon \right\} \right) , \end{aligned}$$

where \(\textrm{Cl}\) denotes closure of a set. Furthermore, we say that g is a \(\epsilon \)-pseudoeigenfunction if there exists \(\lambda \in \mathbb {C}\) such that the relative squared residual in (10) is bounded by \(\epsilon ^2\).

To compute (10), notice that three of the four inner products appearing in the numerator are:

$$\begin{aligned} \langle \mathscr {K}[g],g\rangle =\pmb {g}^*A\pmb {g},\;\langle g,\mathscr {K}[g]\rangle =\pmb {g}^*A^*\pmb {g},\; \langle g,g\rangle =\pmb {g}^*G\pmb {g}, \end{aligned}$$
(11)

with AG numerically approximated by EDMD (8). Hence, the success of the computation relies on finding a numerical approximation to \(\langle \mathscr {K}[g],\mathscr {K}[g]\rangle \). To that end, we deploy the same quadrature rule discussed in (5)-(6) and set

$$\begin{aligned} L=[L_{i,j}]\,,\quad L_{i,j} = \langle \mathscr {K}[\psi _j],\mathscr {K}[\psi _i]\rangle ,\quad \tilde{L}=\Psi _Y^*W\Psi _Y\,, \end{aligned}$$
(12)

then \(\langle \mathscr {K}[g],\mathscr {K}[g]\rangle \approx \pmb {g}^*\Psi _Y^*W\Psi _Y\pmb {g}=\pmb {g}^*\tilde{L}\pmb {g}\). We obtain a numerical approximation of (10) as

$$\begin{aligned} {[}\textrm{res}(\lambda ,g)]^2=\frac{\pmb {g}^*\left[ \tilde{L}- \lambda \tilde{A}^* - \overline{\lambda }\tilde{A} + |\lambda |^2\tilde{G}\right] \pmb {g}}{\pmb {g}^*\tilde{G}\pmb {g}}.\nonumber \\ \end{aligned}$$
(13)

The matrix L introduced by ResDMD formally corresponds to an approximation of \(\mathscr {K}^*\mathscr {K}\). The computation utilizes the same dataset as that employed for \(\tilde{G}\) and \(\tilde{A}\) and is computationally efficient to construct. The work presented in [23] demonstrates that the approximation outlined in (13) can be effectively used in various algorithms for rigorously computing the spectra and pseudospectra of \(\mathscr {K}\) for deterministic systems. However, these results from [23] are not directly applicable to stochastic systems.

3 Variance from the Koopman perspective

When analyzing a system with inherent stochasticity, basing conclusions only on the mean trajectory can lead to misleading interpretations, as illustrated in Fig. 1. To achieve a more accurate statistical understanding of such systems, it is crucial to quantify how much and in what ways the trajectory deviates from this mean. This need for a more comprehensive analysis underpins our exploration into quantifying the variance.

3.1 Variance via Koopman operators

For any observable \(g\in L^2(\Omega ,\omega )\) and \(\pmb {x}\in \Omega \), \(g(F_\tau (\pmb {x}))\) is a random variable. One can define its moments:

$$\begin{aligned} \mathbb {E}_{\tau }[(g(F_\tau (\pmb {x})))^r]=\int _{\Omega _s} [g(F_\tau (\pmb {x}))]^r\,\textrm{d}\rho (\tau ),\quad r\in \mathbb {N}. \end{aligned}$$

Recalling the definitions in (4), this becomes:

$$\begin{aligned} \mathbb {E}_{\tau }[(g(F_\tau (\pmb {x})))^r]=\mathscr {K}_{(r)}[g\otimes \cdots \otimes g](\pmb {x},\ldots ,\pmb {x}). \end{aligned}$$

This means that the r-th order Koopman operator directly computes the moments of the trajectory. In particular, the combination of the first and the second moment provides the following variance term:

$$\begin{aligned} \text {Var}_{\tau }[g(F_\tau (\pmb {x}))]&= \mathbb {E}_\tau \left[ |g(F_\tau (\pmb {x}))|^2\right] -|\mathbb {E}_\tau [g(F_\tau (\pmb {x}))]|^2\\&= \mathscr {K}_{(2)}[g\otimes \overline{g}](\pmb {x},\pmb {x})-|\mathscr {K}_{(1)}[g](\pmb {x})|^2\,. \end{aligned}$$

We integrate the local definition of variance over the entire domain to define:

$$\begin{aligned} \text {Var}_{\tau }[g(F_\tau )]&= \int _\Omega \text {Var}_{\tau }[g(F_\tau (\pmb {x})]\,\textrm{d}\omega (\pmb {x}). \end{aligned}$$
(14)

The following proposition provides a Koopman analog of decomposing an integrated mean squared error (IMSE).

Proposition 2

Let \(g,h\in L^2(\Omega ,\omega )\), then

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau +h\Vert ^2\right] \\&\quad =\Vert \mathscr {K}_{(1)}[g]+h\Vert ^2+\int _{\Omega }\textrm{Var}_{\tau }\left[ \left( g\circ F_\tau \right) (\pmb {x})\right] \,\textrm{d}\omega (\pmb {x}). \end{aligned} \end{aligned}$$
(15)

Proof

We expand \(|g(F_\tau (\pmb {x}))+h(\pmb {x})|^2\) for a fixed \(\pmb {x}\in \Omega \) and take expectations to find that

$$\begin{aligned}&\mathbb {E}_{\tau }\left[ |g(F_\tau (\pmb {x}))+h(\pmb {x})|^2\right] \\&\quad =\mathbb {E}_{\tau }\left[ |g(F_\tau (\pmb {x}))|^2\right] {+}\mathscr {K}_{(1)}[g](\pmb {x})\overline{h(\pmb {x})}\\&\qquad {+}h(\pmb {x})\overline{\mathscr {K}_{(1)}[g](\pmb {x})}{+}|h(\pmb {x})|^2\\&\quad =|\mathscr {K}_{(1)}[g](\pmb {x})+h(\pmb {x})|^2+\mathbb {E}_{\tau }\left[ |g(F_\tau (\pmb {x}))|^2\right] \\&\qquad -\left| \mathbb {E}_{\tau }\left[ g(F_\tau (\pmb {x}))\right] \right| ^2. \end{aligned}$$

The result now follows by integrating over \(\pmb {x}\) with respect to the measure \(\omega \). \(\square \)

Similarly, for any two functions \(g,h\in L^2(\Omega ,\omega )\), we define the covariance:

$$\begin{aligned} \mathscr {C}(g,h)= & {} \int _{\Omega }\mathbb {E}_{\tau }[(g\circ F_\tau {-}\mathscr {K}_{(1)}[g])\overline{(h\circ F_\tau {-}\mathscr {K}_{(1)}[h])}]\,\text {d}\omega (\pmb {x}) \nonumber \\ \end{aligned}$$
(16)

and obtain the following similar result using covariance:

$$\begin{aligned}&\int _{\Omega } \mathbb {E}_{\tau }[g(F_\tau (\pmb {x}))\overline{h(F_\tau (\pmb {x}))}] \,\textrm{d} \omega (\pmb {x}) \nonumber \\&\qquad =\langle \mathscr {K}[g],\mathscr {K}[h] \rangle + \mathscr {C}(g,h)\,. \end{aligned}$$

Proposition 2 is analogous to the decomposition of an IMSE and is practically useful. Suppose we use an observation h to approximate \(-g\circ F_\tau \), in an attempt to minimize \(\Vert g\circ F_\tau +h\Vert ^2\). An unbiased estimator is \(-\mathscr {K}_{(1)}[g]\); however, this approximation will not be perfect due to the variance term in (15). Therefore, there is a variance-residual tradeoff for stochastic Koopman operators. Depending on the type of trajectory data collected, one can approximate the quantities \(\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau +h\Vert ^2\right] \) and \(\Vert \mathscr {K}_{(1)}[g]+h\Vert ^2\) in (15) and hence, estimate the third variance term.

Example 1

[Circle map] Let \(\Omega =[0,1]_{\textrm{per}}\) be the periodic interval and consider

$$\begin{aligned} F(\pmb {x},\tau )=\pmb {x}+c+f(\pmb {x})+\tau \,\,\,\,\,\,\textrm{mod}(1), \end{aligned}$$

where \(\Omega _s=[0,1]_{\textrm{per}}\), \(\rho \) is absolutely continuous, and c is a constant. Let \(\psi _j(\pmb {x})=e^{2\pi i j\pmb {x}}\) for \(j\in \mathbb {Z}\). Then

$$\begin{aligned} \mathscr {K}_{(1)}[\psi _j](\pmb {x})=\psi _j(\pmb {x})e^{2\pi i jf(\pmb {x})}e^{2\pi i jc}\int _{\Omega _s} e^{2\pi i j\tau }\,\textrm{d}\rho (\tau ). \end{aligned}$$
(17)

Define the constants

$$\begin{aligned} \alpha _j=e^{2\pi i jc}\int _{\Omega _s} e^{2\pi i j\tau }\,\textrm{d}\rho (\tau ). \end{aligned}$$

Let D be the operator that multiplies each \(\psi _j\) by \(\alpha _j\). Then \(\mathscr {K}_{(1)}=T D\), where T is the Koopman operator corresponding to \(\pmb {x}\mapsto \pmb {x}+f(\pmb {x})\). Since \(\rho \) is absolutely continuous, the Riemann–Lebesgue lemma implies that \(\lim _{|j|\rightarrow \infty }\alpha _j=0\) and hence D is a compact operator. It follows that if T is bounded, then \(\mathscr {K}_{(1)}\) is a compact operator. A straightforward computation using (14) shows that

$$\begin{aligned} \int _{\Omega }\textrm{Var}_{\tau }[\psi _j(F_\tau (\pmb {x}))]\,\textrm{d}\omega (\pmb {x}) = 1-|\alpha _j|^2. \end{aligned}$$
(18)

For example, if \(f=0\), \(\mathscr {K}_{(1)}\) has pure point spectrum with eigenfunctions \(\psi _j\). However, as \(|j|\rightarrow \infty \), the variance converges to one and \(\psi _j\) become less statistically coherent. This example is explored further in Sect. 5.1. \(\square \)

Another immediate application of the variance term is in providing an estimated bound for the Koopman operator prediction of trajectories.

Proposition 3

We have

$$\begin{aligned} \begin{aligned}&\mathbb {P}\left( \left| g\circ F_{\tau _n}\circ \cdots \circ F_{\tau _1}(\pmb {x})-\mathscr {K}^n[g](\pmb {x})\right| \ge a\right) \\&\quad \le \frac{1}{a^2}\text {Var}_{\tau _1,\ldots ,\tau _n} \left[ g\circ F_{\tau _n}\circ \cdots \circ F_{\tau _1}(\pmb {x})\right] \\&\quad =\frac{1}{a^2}\left( \mathscr {K}_{(2)}^n[g\otimes \overline{g}](\pmb {x},\pmb {x})-|\mathscr {K}_{(1)}^n[g](\pmb {x})|^2\right) \end{aligned} \end{aligned}$$
(19)

for any \(a>0\).

Proof

the result follows from combining Proposition 1 and (14) with Chernoff’s bound. \(\square \)

The bound can be combined with concentration bounds for \(\Psi \tilde{K}^n-\mathscr {K}^n\) (see Sect. 4.2).

3.2 ResDMD in stochastic systems

In the deterministic setting, ResDMD provides an efficient way to evaluate the accuracy of candidate eigenpairs through the computation of an additional matrix L in (12). However, what happens in the stochastic setting?

Suppose that \((\lambda , g)\) is a candidate eigenpair of \(\mathscr {K}_{(1)}\) with \(g\in V_N\). Resembling (10), we consider

$$\begin{aligned} \frac{\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau -\lambda g\Vert ^2\right] }{\Vert g\Vert ^2}. \end{aligned}$$
(20)

We can write the numerator in terms of A, G, and L, i.e.,

$$\begin{aligned} \mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau -\lambda g\Vert ^2\right]&=\pmb {g}^*(L-\lambda A^*-\overline{\lambda }A+|\lambda |^2G)\pmb {g}\\&=\lim _{M\rightarrow \infty }\pmb {g}^*(\tilde{L}-\lambda \tilde{A}^*-\overline{\lambda }\tilde{A}+|\lambda |^2\tilde{G})\pmb {g}. \end{aligned}$$

Hence, we define

$$\begin{aligned} {[}\textrm{res}^{\textrm{var}}(\lambda ,g)]^2=\frac{\pmb {g}^*\left[ \tilde{L}-\lambda \tilde{A}^*-\overline{\lambda }\tilde{A}+|\lambda |^2\tilde{G}\right] \pmb {g}}{\pmb {g}^*\tilde{G}\pmb {g}}, \end{aligned}$$
(21)

which furnishes an approximation of (20). Setting \(h=-\lambda g\) in Proposition 2, we see that

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau -\lambda g\Vert ^2\right] \\&\quad =\mathbb {E}_{\tau }\left[ \int _{\Omega }|g(F_\tau (\pmb {x}))-\lambda g(\pmb {x})|^2\,\textrm{d}\omega (\pmb {x})\right] \\&\quad =\underbrace{\Vert \mathscr {K}_{(1)}[g]-\lambda g\Vert ^2}_{\text {squared residual}} +\underbrace{\int _{\Omega }\textrm{Var}_{\tau }\left[ g(F_\tau (\pmb {x}))\right] \,\textrm{d}\omega (\pmb {x})}_{\text {integrated variance of}\, g\circ F_\tau }. \end{aligned} \end{aligned}$$
(22)

Thus, \(\textrm{res}^{\textrm{var}}(\lambda ,g)\) approximates the sum of the squared residual \(\Vert \mathscr {K}[g]-\lambda g\Vert ^2\) and the integrated variance of \(g\circ F_{\tau }\). For stochastic systems, the integrated variance of \(g\circ F_\tau \) is usually nonzero so that

$$\begin{aligned} \lim _{M\rightarrow \infty }\textrm{res}^{\textrm{var}}(\lambda ,g)> \Vert \mathscr {K}_{(1)}[g]-\lambda g\Vert \Vert g\Vert . \end{aligned}$$
(23)

Based on this notion and drawing an analogy with Definition 1, we make the following definition.

Definition 2

For any \(\lambda \in \mathbb {C}\), define:

$$\begin{aligned} \sigma _{\text {inf}}^{\text {var}} (\lambda )= & {} \inf \left\{ \sqrt{\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau {-}\lambda g\Vert ^2\right] }: g{\in }L^2(\Omega ,\omega ),\Vert g\Vert =1\right\} . \end{aligned}$$

For \(\epsilon >0\), we define the variance-\(\epsilon \)-pseudospectrum as

$$\begin{aligned} \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})=\textrm{Cl}\left( \left\{ \lambda \in \mathbb {C}:\sigma _{\textrm{inf}}^{\textrm{var}}(\lambda )<\epsilon \right\} \right) , \end{aligned}$$

where \(\textrm{Cl}\) denotes the closure of a set. Furthermore, we say that g is a variance-\(\epsilon \)-pseudoeigenfunction if there exists \(\lambda \in \mathbb {C}\) such that \(\sqrt{\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau {-}\lambda g\Vert ^2\right] }\le \epsilon \).

Superficially, this definition is a straightforward extension of Definition 1. However, there are some essential differences. Both the conceptual understanding and the computation methods need to be modified.

First, the relation (22) shows that \(\textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\) takes into account uncertainty through the variance term. Hence, the variance-pseudospectrum provides a notion of statistical coherency. Furthermore, comparing Definitions 1 and 2, we have

$$\begin{aligned} \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\subset \textrm{Sp}_{\epsilon }(\mathscr {K}_{(1)}). \end{aligned}$$

If the dynamical system is deterministic, then \(\textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\) is equal to the approximate point \(\epsilon \)-pseudospectrum. However, in the presence of variance, they are no longer equal.

Second, the relation (22) gives a computational surprise. Following the same derivation between (10)–(13), with L, A, and G accordingly adjusted through replacing \(\mathscr {K}\) by \(\mathscr {K}_{(1)}\) in (11)–(12), we can still compute the variance-residual term. However, the original residual itself, \(\textrm{res}(\lambda ,g)\), needs a modification. Recalling (10), in the same spirit of EDMD, if \(g\in V_N\), we write

$$\begin{aligned}&\Vert \mathscr {K}_{(1)}[g]-\lambda g\Vert ^2\\&\quad = \langle \mathscr {K}_{(1)}[g]\,,\mathscr {K}_{(1)}[g]\rangle -\lambda \langle g,\mathscr {K}_{(1)}[g]\rangle \\&\qquad -\bar{\lambda } \langle \mathscr {K}_{(1)}[g],g\rangle + |\lambda |^2\langle g,g\rangle \\&\quad = \pmb {g}^*({H}-\lambda {A}^*-\overline{\lambda }{A}+|\lambda |^2{G})\pmb {g}, \end{aligned}$$

where H is a newly introduced matrix with

$$\begin{aligned} H_{i,j}=\langle \mathscr {K}_{(1)}[\psi _j],\mathscr {K}_{(1)}[\psi _i] \rangle . \end{aligned}$$
(24)

We employ the quadrature rule for the \(\pmb {x}\)-domain to approximate this new term. If \(\texttt {S}\) is batched with \(M_2=2\), then we can form the matrix

$$\begin{aligned} \tilde{H}_{i,j}=\sum _{l=1}^{M_1} w_{l} \psi _j(\pmb {y}^{(l,1)})\overline{\psi _i(\pmb {y}^{(l,2)})}. \end{aligned}$$

Since \(\tau _{l,1}\) and \(\tau _{l,2}\) are independent, we have

$$\begin{aligned} \lim _{M_1\rightarrow \infty } \tilde{H}_{i,j}=H_{i,j}=\langle \mathscr {K}[\psi _j],\mathscr {K}[\psi _i] \rangle . \end{aligned}$$
(25)

We stress that \(\mathscr {K}_{(1)}\) is applied separately to \(\psi _i\) and \(\psi _j\) and thus \(\tau _{l,1}\) and \(\tau _{l,2}\) need to be independent realizations.

The convergence in (25) allows us to compute the spectral properties of \(\mathscr {K}_{(1)}\) directly (see Sect. 3.3). In particular, instead of (13), we now have

$$\begin{aligned} {[}\textrm{res}(\lambda ,g)]^2=\frac{\pmb {g}^*\left[ \tilde{H}-\lambda \tilde{A}^*-\overline{\lambda }\tilde{A}+|\lambda |^2\tilde{G}\right] \pmb {g}}{\pmb {g}^*\tilde{G}\pmb {g}}\nonumber \\ \end{aligned}$$
(26)

and the approximate decomposition

$$\begin{aligned} \begin{aligned}&\int _{\Omega }\textrm{Var}_{\tau }\left[ g(F_\tau (\pmb {x}))\right] \,\textrm{d}\omega (\pmb {x})=\pmb {g}^*\left( L-H\right) \pmb {g}\\&\quad \approx \pmb {g}^*\left( \tilde{L}{-}\tilde{H}\right) \pmb {g}=\Vert g\Vert ^2\left( [\textrm{res}^{\textrm{var}}(\lambda ,g)]^2{-}[\textrm{res}(\lambda ,g)]^2\right) , \end{aligned} \end{aligned}$$
(27)

which becomes exact in the large data limit.

3.3 Algorithms

In the derivations above, we noticed that one-batched data permits computation only of \(\textrm{res}^\textrm{var}(\lambda ,g)\), while two-batched data also permits the computation of \(\textrm{res}(\lambda ,g)\). Algorithms 1 and 2 approximate the relative residuals of EDMD eigenpairs in the scenario of unbatched and batched data, respectively. In Algorithm 2, we have taken an average when computing \(\tilde{A}\) and \(\tilde{L}\) to reduce quadrature error, and an average when computing \(\tilde{H}\) to ensure that it is self-adjoint (and positive semi-definite). Algorithm 3 approximates the pseudospectrum and corresponding pseudoeigenfunctions, given batched snapshot data. Algorithm 4 approximates the variance-pseudospectrum and corresponding variance-pseudoeigenfunctions, and does not need batched data. Note that the computational complexity of all of these algorithms scales the same as those for ResDMD, which is discussed in [20, 23]. In particular, Algorithms 1 and 2 scale the same as EDMD.

Algorithm 1
figure a

: Eigenpairs and residuals.

Algorithm 2
figure b

: Eigenpairs and residuals (batched data).

Algorithm 3
figure c

: Pseudospectra (batched data).

Algorithm 4
figure d

: Variance-pseudospectra.

4 Theoretical guarantees

We now prove the correctness of the algorithms mentioned above. Specifically, through a series of theorems, we demonstrate that the computations of \(\tilde{A},\tilde{G},\tilde{L}\), and \(\tilde{H}\) are accurate and that the spectral estimates can be trusted. To achieve this, we divide the section into three subsections, each focusing on demonstrating the accuracy of the spectrum, the predictive power, and the matrices, respectively. The universal assumptions made in this section are as follows:

  • \(\mathscr {K}_{(1)}\) is bounded.

  • \(\{\psi _j\}_{j=1}^N\) are linearly independent for any finite N.

  • \(V_N\subset V_{N+1}\) and the union, \(\cup _N V_N\), is dense in \(L^2(\Omega ,\omega )\).

The algorithms and proofs can be readily adapted for an unbounded \(\mathscr {K}_{(1)}\). The latter two assumptions can also be relaxed with minor modifications.

4.1 Accuracy in finding spectral quantities

In this subsection, we prove the convergence of our algorithms. We have already discussed the convergence of residuals in Algorithms 1 and 2, under the assumption of convergence of the finite matrices \(\tilde{G},\tilde{A},\tilde{L}\), and \(\tilde{H}\) in the large data limit. Hence, we focus on Algorithm 4. We first define the functions

$$\begin{aligned} f_{M,N}(\lambda )=\min _{\pmb {g}\in \mathbb {C}^{N}} \textrm{res}^{\textrm{var}}(\lambda ,\Psi \pmb {g}), \end{aligned}$$

and note that \(r_j=f_{M,N}(z_j)\) in Algorithm 4. Our first lemma describes the limit of these functions as \(M\rightarrow \infty \) and \(N\rightarrow \infty \).

Lemma 1

Suppose that

$$\begin{aligned} \lim _{M\rightarrow \infty } \tilde{G}=G,\quad \lim _{M\rightarrow \infty } \tilde{A}=A,\quad \lim _{M\rightarrow \infty } \tilde{L}=L, \end{aligned}$$

then \(f_N(\lambda )=\lim _{M\rightarrow \infty }f_{M,N}(\lambda )\) exists. Moreover, \(f_N\) is a nonincreasing function of N and converges to \(\sigma _{\textrm{inf}}^{\textrm{var}}\) from above and uniformly on compact subsets of \(\mathbb {C}\) as a function of the spectral parameter \(\lambda \).

Proof

The limit \(f_N(\lambda )=\lim _{M\rightarrow \infty }f_{M,N}(\lambda )\) follows trivially from the convergence of matrices. Moreover, we have

$$\begin{aligned} f_N(\lambda )&=\min _{\pmb {g}\in \mathbb {C}^{N}}\sqrt{\frac{\pmb {g}^*(L-\lambda A^*-\overline{\lambda }A+|\lambda |^2G)\pmb {g}}{\pmb {g}^*G\pmb {g}}}\\&=\inf \left\{ \sqrt{\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau -\lambda g\Vert ^2\right] }:g\in V_N,\Vert g\Vert =1\right\} . \end{aligned}$$

Since \({V}_{N}\subset {V}_{N+1}\), \(f_N(\lambda )\) is nonincreasing in N. By definition, we also have

$$\begin{aligned} f_N(\lambda )\ge \sigma _{\textrm{inf}}^{\textrm{var}}(\lambda ). \end{aligned}$$

Let \(\delta >0\) and choose \(g\in L^2(\Omega ,\omega )\) such that \(\Vert g\Vert =1\) and

$$\begin{aligned} \sqrt{\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau -\lambda g\Vert ^2\right] }\le \sigma _{\textrm{inf}}^{\textrm{var}}(\lambda )+\delta . \end{aligned}$$

Since \(\cup _N V_N\) is dense in \(L^2(\Omega ,\omega )\), there exists some n and \(g_{n}\in {V}_{n}\) such that \(\Vert g_n\Vert =1\) and

$$\begin{aligned} \sqrt{\mathbb {E}_{\tau }\left[ \Vert g_n\circ F_\tau -\lambda g_n\Vert ^2\right] }\le \sqrt{\mathbb {E}_{\tau }\left[ \Vert g\circ F_\tau -\lambda g\Vert ^2\right] }+\delta . \end{aligned}$$

It follows that \(f_n(\lambda )\le \sigma _{\textrm{inf}}^{\textrm{var}}(\lambda )+2\delta \). Since this holds for any \(\delta >0\), \(\lim _{N\rightarrow \infty }f_N(\lambda )=\sigma _{\textrm{inf}}^{\textrm{var}}(\lambda )\). Since \(\sigma _{\textrm{inf}}^{\textrm{var}}(\lambda )\) is continuous in \(\lambda \), \(f_N\) converges uniformly down to \(\sigma _{\textrm{inf}}^{\textrm{var}}\) on compact subsets of \(\mathbb {C}\) by Dini’s theorem. \(\square \)

Let \(\{\textrm{Grid}(N)=\{z_{1,N},z_{2,N},\ldots ,z_{k(N),N}\}\}\) be a sequence of grids, each finite, such that for any \(\lambda \in \mathbb {C}\),

$$\begin{aligned} \lim _{N\rightarrow \infty }\textrm{dist}(\lambda ,\textrm{Grid}(N))=0. \end{aligned}$$

For example, we could take

$$\begin{aligned} \textrm{Grid}(N)=\frac{1}{N}\left[ \mathbb {Z}+i\mathbb {Z}\right] \cap \{z\in \mathbb {C}:|z|\le N\}. \end{aligned}$$
(28)

In practice, one considers a grid of points over the region of interest in the complex plane. Lemma 1 tells us that to study Algorithm 4 in the large data limit, we must analyze

$$\begin{aligned} \Gamma ^\epsilon _{N}(\mathscr {K}_{(1)})=\left\{ \lambda \in \textrm{Grid}(N):f_N(\lambda )<\epsilon \right\} . \end{aligned}$$

To make the convergence of Algorithm 4 precise, we use the Attouch–Wets metric defined by [5]:

$$\begin{aligned}{} & {} d_{\text {AW}} (C_1,C_2)\quad =\!\sum _{n=1}^{\infty } 2^{-n}\min \big \{1,\underset{\left| x\right| \le n}{\sup }\left| \text {dist} (x,C_1) - \text {dist}(x,C_2)\right| \big \}, \end{aligned}$$

where \(C_1,C_2\) are closed nonempty subsets of \(\mathbb {C}\). This metric corresponds to local uniform converge on compact subsets of \(\mathbb {C}\). For any closed nonempty sets C and \(C_n\), \(d_{\textrm{AW}}(C_n,C)\rightarrow {0}\) if and only if for any \(\delta >0\) and \(B_m(0)\) (closed ball of radius \(m\in \mathbb {N}\) about 0), there exists N such that if \(n>N\) then \(C_n\cap B_m(0)\subset {C+B_{\delta }(0)}\) and \(C\cap B_m(0)\subset {C_n+B_{\delta }(0)}\). The following theorem contains our convergence result.

Theorem 1

(Convergence to variance-pseudospectrum) Let \(\epsilon >0\). Then, \(\Gamma ^\epsilon _{N}(\mathscr {K}_{(1)})\subset \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\) and

$$\begin{aligned} \lim _{N\rightarrow \infty }d_{\textrm{AW}}\left( \Gamma ^\epsilon _{N}(\mathscr {K}_{(1)}),\textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\right) =0. \end{aligned}$$

Proof

Lemma 1 shows that \(\Gamma ^\epsilon _{N}(\mathscr {K}_{(1)})\subset \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\). To prove convergence, we use the characterization of the Attouch–Wets topology. Suppose that m is large such that \(B_m(0)\cap \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\ne \emptyset \). Since \(\Gamma ^\epsilon _{N}(\mathscr {K}_{(1)})\subset \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\), we clearly have \(\Gamma _{N}^{\epsilon }(\mathscr {K}_{(1)})\cap B_m(0)\subset \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\). Hence, we must show that given \(\delta >0\), there exists \(n_0\) such that if \(N>n_0\) then \(\textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\cap B_m(0)\subset {\Gamma _{N}^{\epsilon }(\mathscr {K}_{(1)})+B_{\delta }(0)}\). Suppose for a contradiction that this statement is false. Then, there exists \(\delta >0\), \(\lambda _{n_j}\in \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\cap B_m(0)\), and \(n_j\rightarrow \infty \) such that

$$\begin{aligned} \textrm{dist}(\lambda _{n_j},\Gamma _{n_j}^{\epsilon }(\mathscr {K}_{(1)}))\ge \delta . \end{aligned}$$

Without loss of generality, we can assume that \(\lambda _{n_j}\rightarrow \lambda \in \textrm{Sp}_{\epsilon }^{\textrm{var}}(\mathscr {K}_{(1)})\cap B_m(0)\). There exists some z with \(\sigma _{\textrm{inf}}^{\textrm{var}}(z)<\epsilon \) and \(\left| \lambda -z\right| \le \delta /2\). Let \(z_{n_j}\in \textrm{Grid}(n_j)\) such that \(|z-z_{n_j}|\le \textrm{dist}(z,\textrm{Grid}(n_j))+{n_j}^{-1}.\) Since \(\sigma _{\textrm{inf}}^{\textrm{var}}\) is continuous and \(f_N\) converges locally uniformly to \(\sigma _{\textrm{inf}}^{\textrm{var}}\), we must have \(f_{n_j}(z_{n_j})<\epsilon \) for large \(n_j\) so that \(z_{n_j}\in \Gamma _{n_j}^{\epsilon }(\mathscr {K}_{(1)})\). But \( \left| z_{n_j}-\lambda \right| \le \left| z-\lambda \right| +\left| z_{n_j}-z\right| \le \delta /2 + |z-z_{n_j}|, \) which is smaller than \(\delta \) for large \(n_j\), and we reach the desired contradiction. \(\square \)

4.2 Error bounds for iterations

We now aim to bound the difference between \(\tilde{K}^n\) and \(\mathscr {K}^n\), a step crucial for measuring the accuracy of our approximation of the mean trajectories in \(L^2(\Omega ,\omega )\). This effort, in conjunction with the Chernoff-like bound presented in (19), enables us to compute the statistical properties of the trajectories and their forecasts. Our approach to establishing these bounds is twofold. First, we consider the difference between \(\tilde{K}^n\) and \(\mathscr {K}^n\), taking into account both the estimation errors and the errors intrinsic to the subspace. Subsequently, we establish concentration bounds for the estimation errors of \(\tilde{G}\), \(\tilde{A}\), and \(\tilde{L}\).

Theorem 2

(Error bound for forecasts) Define the quantities

$$\begin{aligned} I_G&=G^{\frac{1}{2}}\tilde{G}^{-\frac{1}{2}},\\ \Delta _G&=\Vert I_G\Vert \Vert (I-I_G^{-1})\Vert +\Vert (I-I_G)\Vert ,\\ \Delta _A&=\Vert \mathscr {K}\Vert (1+\Vert I_G\Vert )\Vert I_G-I\Vert +\Vert I_G\Vert ^2 \Vert G^{-\frac{1}{2}}(A-\tilde{A})G^{- \frac{1}{2}}\Vert . \end{aligned}$$

Let \(g=\sum _{j=1}^N\pmb {g}_j\psi _j\in V_N\) and suppose that

$$\begin{aligned} \Vert \mathscr {K}^n_{(1)}g-\mathscr {P}_{V_N}^*(\mathscr {P}_{V_N}\mathscr {K}_{(1)}\mathscr {P}_{V_N}^*)^ng\Vert \le \delta _n(g)\Vert g\Vert . \end{aligned}$$

Then

$$\begin{aligned} \Vert \Psi \tilde{K}^n\pmb {g}-\mathscr {K}^n_{(1)}g\Vert \le C_n\Vert g\Vert , \end{aligned}$$

where

$$\begin{aligned} C_n=\left[ \frac{\Vert \mathscr {K}\Vert ^n-\Delta _A^n}{\Vert \mathscr {K}\Vert -\Delta _A}\Delta _A(\Delta _G+1)+\Vert \mathscr {K}\Vert ^n\Delta _G+\delta _n(g)\right] . \end{aligned}$$

Proof

We introduce the two matrices

$$\begin{aligned} T=G^{-1/2}AG^{-1/2},\quad \tilde{T}=\tilde{G}^{-1/2}\tilde{A}\tilde{G}^{-1/2}. \end{aligned}$$

Note that

$$\begin{aligned} \Vert T\Vert =\sup _{x\in \mathbb {C}^N}\frac{\Vert TG^{1/2}x\Vert }{\Vert G^{1/2}x\Vert }&=\sup _{x\in \mathbb {C}^N}\frac{\Vert G^{1/2}Kx\Vert }{\Vert G^{1/2}x\Vert }\\&=\Vert \mathscr {P}_{V_{N}}\mathscr {K}\mathscr {P}_{V_{N}}^*\Vert \le \Vert \mathscr {K}\Vert . \end{aligned}$$

We can re-write \(\tilde{T}\) as

$$\begin{aligned} \tilde{T}&=I_G^*G^{-1/2}{\tilde{A}} G^{-1/2}I_G\\&=I_G^*TI_G+I_G^*G^{-1/2}({\tilde{A}}-A)G^{-1/2}I_G\\&=T+(I_G-I)^*TI_G+T(I_G-I)\\&\quad +I_G^*G^{-1/2}({\tilde{A}}-A)G^{-1/2}I_G. \end{aligned}$$

It follows that

$$\begin{aligned} \Vert T-\tilde{T}\Vert&\le \Vert \mathscr {K}\Vert (1+\Vert I_G\Vert )\Vert I_G-I\Vert \\&\quad +\Vert I_G\Vert ^2\Vert G^{-1/2}(A-\tilde{A})G^{-1/2}\Vert \\&=\Delta _A. \end{aligned}$$

We have that

$$\begin{aligned} T^n-\tilde{T}^n=T(T^{n-1}-\tilde{T}^{n-1})+({T}-\tilde{T})\tilde{T}^{n-1}. \end{aligned}$$

A simple proof by induction now shows that

$$\begin{aligned} \Vert T^n-\tilde{T}^n\Vert&\le \Vert {T}-\tilde{T}\Vert \sum _{j=0}^{n-1}\Vert T\Vert ^j\Vert \tilde{T}\Vert ^{n-1-j}\\&\le \Delta _A\sum _{j=0}^{n-1}\Vert \mathscr {K}\Vert ^{j}(\Vert \mathscr {K}\Vert +\Delta _A)^{n-1-j}\\&= \Delta _A\frac{\Vert \mathscr {K}\Vert ^n-\Delta _A^n}{\Vert \mathscr {K}\Vert -\Delta _A}. \end{aligned}$$

We wish to bound the quantity

$$\begin{aligned}&\Vert \Psi K^n\pmb {g}-\Psi \tilde{K}^n\pmb {g}\Vert =\Vert {T}^n{G}^{1/2}\pmb {g}-I_G\tilde{T}^n\tilde{G}^{1/2}\pmb {g}\Vert \\&\quad \le \Vert {T}^n-\tilde{T}^n\Vert \Vert g\Vert +\Vert \tilde{T}^n{G}^{1/2}\pmb {g}-I_G\tilde{T}^n\tilde{G}^{1/2}\pmb {g}\Vert . \end{aligned}$$

We can express the final term on the right-hand side as

$$\begin{aligned} \tilde{T}^n{G}^{1/2}\pmb {g}-I_G{\tilde{T}}^n\tilde{G}^{1/2}\pmb {g}&=I_G\tilde{T}^n(I-I_G^{-1}){G}^{1/2}\pmb {g}\\&\quad +(I-I_G){\tilde{T}}^n{G}^{1/2}\pmb {g}. \end{aligned}$$

It follows that

$$\begin{aligned}&\Vert \tilde{T}^n{G}^{1/2}\pmb {g}-I_G\tilde{T}^n\tilde{G}^{1/2}\pmb {g}\Vert \le \Vert \tilde{T}^n\Vert \Vert {G}^{1/2}\pmb {g}\Vert \Delta _G\\&\quad \le \left( \Vert \mathscr {K}\Vert ^n+\Vert {T}^n-\tilde{T}^n\Vert \right) \Delta _G\Vert g\Vert \end{aligned}$$

and hence that

$$\begin{aligned}&\Vert \Psi K^n\pmb {g}{-}\Psi \tilde{K}^n\pmb {g}\Vert {\le } \left[ \Vert {T}^n{-}\tilde{T}^n\Vert (\Delta _G{+}1){+}\Vert \mathscr {K}\Vert ^n\Delta _G\right] \Vert g\Vert \\&\quad \le \left[ \frac{\Vert \mathscr {K}\Vert ^n-\Delta _A^n}{\Vert \mathscr {K}\Vert -\Delta _A}\Delta _A(\Delta _G+1)+\Vert \mathscr {K}\Vert ^n\Delta _G\right] \Vert g\Vert . \end{aligned}$$

The theorem now follows from the triangle inequality

\(\square \)

This theorem explicitly tells us how much to trust the prediction using the computed Koopman matrix, compared with the true Koopman operator. The quantities \(\Delta _G\) and \(\Delta _A\) represent errors due to estimation or quadrature. They are both expected to be small. The quantity \(\delta _n(g)\) is an intrinsic invariant subspace error that depends on the dictionary and observable g. To approximate \(\delta _n(g)\), note that

$$\begin{aligned} \mathscr {K}^{n}[g]{-}\Psi K^n\pmb {g}{=}\sum _{j=1}^n\mathscr {K}^{n-j}[\mathscr {K}[\Psi K^{j-1}\pmb {g}]{-}\Psi K^j \pmb {g}] \end{aligned}$$

and hence

$$\begin{aligned} \Vert \mathscr {K}^{n}[g]{-}\Psi K^n\pmb {g}\Vert {\le }\sum _{j=1}^n\Vert \mathscr {K}\Vert ^{n{-}j}\Vert \mathscr {K}[\Psi K^{j{-}1}\pmb {g}]{-}\Psi K^j \pmb {g}\Vert . \end{aligned}$$
(29)

To bound the term on the right-hand side, we can use the matrix H in (24) and the fact that

$$\begin{aligned} \Vert \mathscr {K}\Psi \pmb {v}{-}\Psi Kv\Vert =\sqrt{\pmb {v}^*H\pmb {v}{-}2\textrm{Re}(\pmb {v}^*K^*A\pmb {v}){+}\pmb {v}^*K^*GK\pmb {v}} \end{aligned}$$
(30)

for any \(\pmb {v}\in \mathbb {C}^N\).

4.3 Estimation error for computation of A, G, and L

To effectively estimate \(\mathscr {K}_{(1)}g\) and \(\textrm{Sp}_\epsilon ^\textrm{var}(\mathscr {K}_{(1)})\) in practical applications, it is imperative to have reliable approximations of A, G, and L. We provide a justification for our ability to construct such approximations from trajectory data with high probability, employing concentration bounds. The subsequent result delineates the requisite number of samples and basis functions needed to achieve a desired level of accuracy with high probability. To ensure this level of accuracy, several reasonable assumptions about the stochastic dynamical system are necessary.

Assumption 1

We suppose that \(\pmb {x}^{(m)}\) in the snapshot data are sampled at random according to \(\omega \), independent of \(\tau \), and for simplicity, assume that \(\omega \) is a probability measure.Footnote 6 We assume that \(\tau :\Omega _s\rightarrow \mathscr {H}\) for some Hilbert space \(\mathscr {H}\) and let \(\kappa =(\pmb {x},\tau )\). In this section, \(\mathbb {E}\) and \(\mathbb {P}\) are with respect to the joint distribution of \(\kappa \). We assume that

  • The random variable \(\kappa \) is sub-Gaussian, meaning that there exists some \(a>0\) such that

    $$\begin{aligned} \mathbb {E}\left[ e^{\Vert \kappa -\mathbb {E}(\kappa )\Vert ^2/a^2}\right] <\infty . \end{aligned}$$

    This allows us to define the following finite quantity:

    $$\begin{aligned} \Upsilon =\inf \left\{ s>0:e^{\frac{\mathbb {E}[\Vert \kappa -\mathbb {E}(\kappa )\Vert ^2]}{s^2}}\mathbb {E}\left[ e^{\frac{1}{s^2}\Vert \kappa -\mathbb {E}(\kappa )\Vert ^2}\right] \le 2\right\} . \end{aligned}$$
  • The dictionary functions are uniformly bounded and satisfy the following Lipschitz condition:

    $$\begin{aligned} |\psi _k(\pmb {x})-\psi _k(\pmb {x}')|\le c_k\Vert \pmb {x}-\pmb {x}'\Vert . \end{aligned}$$
  • The function F is Lipschitz with

    $$\begin{aligned} \Vert F(\kappa )-F(\kappa ')\Vert \le c\Vert \kappa -\kappa '\Vert . \end{aligned}$$

With these assumptions, we can show that our approximations of A, G, and L are good with high probability.

Theorem 3

(Concentration bound on estimation errors) Under Assumption 1 we have, for any \(t>0\),

$$\begin{aligned}&\mathbb {P}\left( \Vert \tilde{A}{-} A\Vert _{\textrm{Fr}}< t\right) {\ge }1{-}\exp \left( 2\log (2N){-}\frac{Mt^2}{24\Upsilon ^2(c^2{+}1)\alpha ^2\beta ^2}\right) \\&\mathbb {P}\left( \Vert \tilde{G}{-} G\Vert _{\textrm{Fr}}< t\right) {\ge }1{-}\exp \left( 2\log (2N){-}\frac{Mt^2}{48\Upsilon ^2\alpha ^2\beta ^2}\right) \\&\mathbb {P}\left( \Vert \tilde{L}{-} L\Vert _{\textrm{Fr}}< t\right) {\ge }1{-}\exp \left( 2\log (2N){-}\frac{Mt^2}{48\Upsilon ^2c^2\alpha ^2\beta ^2}\right) , \end{aligned}$$

where \(\Vert \cdot \Vert _{\textrm{Fr}}\) denotes the Frobenius norm, and \(\alpha \) and \(\beta \) are given by

$$\begin{aligned} \alpha =\sqrt{\sum _{k=1}^Nc_k^2},\quad \beta =\sqrt{\sum _{k=1}^N\Vert \psi _k\Vert _{L^\infty }^2}. \end{aligned}$$

Proof

We first argue for \(\Vert \tilde{A}-A\Vert _{\textrm{Fr}}\). Fix \(j,k\in \{1,\ldots ,N\}\) and define the random variable

$$\begin{aligned} X=\psi _k(F(\pmb {x},\tau ))\overline{\psi _j(\pmb {x})}. \end{aligned}$$

Then

$$\begin{aligned} \left| X(\kappa )-X(\kappa ')\right| \le (c_kc\Vert \psi _j\Vert _{L^\infty }+c_j\Vert \psi _k\Vert _{L^\infty })\Vert \kappa -\kappa '\Vert . \end{aligned}$$

Let \(c_{j,k}=c_kc\Vert \psi _j\Vert _{L^\infty }+c_j\Vert \psi _k\Vert _{L^\infty }\). The above Lipschitz bound for X implies that

$$\begin{aligned} \left| \mathbb {E}[X]-X(\kappa ')\right|&\le c_{j,k}\int _{\Omega \times \Omega _s}\Vert \kappa -\kappa '\Vert \,\textrm{d} \mathbb {P}(\kappa )\\&\le c_{j,k}\sqrt{\Vert \kappa -\mathbb {E}(\kappa )\Vert ^2+\mathbb {E}(\Vert \kappa -\mathbb {E}(\kappa )\Vert ^2)}, \end{aligned}$$

where we have used Hölder’s inequality to derive the last line. It follows that

$$\begin{aligned} \mathbb {E}\left[ \exp \left( \frac{\left| \mathbb {E}[X]-X\right| ^2}{\Upsilon ^2c_{j,k}^2}\right) \right] \le 2. \end{aligned}$$

Let \(Y=\textrm{Re}\left( \mathbb {E}\left[ X\right] -X\right) \) and \(\lambda \ge 0\). Since \(\mathbb {E}[Y]=0\), we have

$$\begin{aligned} \mathbb {E}\left[ \exp \left( \lambda Y\right) \right]= & {} 1+\sum _{l=2}^\infty \frac{\lambda ^l\mathbb {E}[Y^l]}{l!}\\\le & {} 1+\frac{\lambda ^2}{2}\mathbb {E}\left[ Y^2\exp (\lambda |Y|)\right] . \end{aligned}$$

For any \(b>0\), we have \(\lambda |Y|\le \lambda ^2/(2b)+b|Y|^2/2\). We also have \(bY^2\le \exp (bY^2/2)\). It follows that

$$\begin{aligned} \mathbb {E}\left[ \exp \left( \lambda Y\right) \right] \le 1+\frac{\lambda ^2}{2b}e^{\lambda ^2/(2b)}\mathbb {E}\left[ \exp (bY^2)\right] . \end{aligned}$$

We select \(b=1/(\Upsilon ^2c_{j,k}^2)\) and use the fact that \(\mathbb {E}\left[ \exp (bY^2)\right] \le \mathbb {E}\left[ \exp (b|\mathbb {E}[X]-X|^2)\right] \le 2\) to obtain

$$\begin{aligned} \mathbb {E}\left[ \exp \left( \lambda Y\right) \right] \le 1+\frac{\lambda ^2}{b}e^{\frac{\lambda ^2}{2b}} \le \left( 1+\frac{\lambda ^2}{b}\right) e^{\frac{\lambda ^2}{2b}} \le e^{\frac{3\lambda ^2}{2b}}. \end{aligned}$$

Now let \(\{Y^{(m)}\}_{m=1}^{M}\) independent copies of Y, then

$$\begin{aligned}&\mathbb {P}\left( \frac{1}{M}\sum _{m=1}^{M}Y^{(m)}\ge t\right) \\&\quad =\mathbb {P}\left( \exp (\lambda \sum _{m=1}^{M}Y^{(m)})\ge \exp (\lambda Mt) \right) \\&\quad \le e^{-\lambda Mt}\mathbb {E}\left[ \exp \left( \lambda \sum _{m=1}^{M}Y^{(m)}\right) \right] \\&\quad = e^{-\lambda Mt}\prod _{m=1}^{M}\mathbb {E}\left[ \exp \left( \lambda Y\right) \right] \\&\quad \le \exp \left( 3M\lambda ^2/(2b)-\lambda M t\right) , \end{aligned}$$

where we use Markov’s inequality in the first inequality. Minimizing over \(\lambda \), we obtain

$$\begin{aligned} \mathbb {P}\left( \frac{1}{M}\sum _{m=1}^{M}Y^{(m)}\ge t\right) \le \exp \left( -Mbt^2/6\right) . \end{aligned}$$

We can argue in the same manner for \(-Y\) and deduce that

$$\begin{aligned} \mathbb {P}\left( \frac{1}{M}\left| \sum _{m=1}^{M}Y^{(m)}\right| \ge t\right) \le 2\exp \left( -Mbt^2/6\right) . \end{aligned}$$

Similarly, we can argue for the imaginary part of \(\mathbb {E}[X]-X\).

We now allow jk to vary and let \(X_{j,k}=\psi _k(F(\pmb {x},\tau ))\overline{\psi _j(\pmb {x})}\). For \(t>0\), consider the events

$$\begin{aligned} S_{j,k,1}&:\frac{1}{M}\left| \sum _{m=1}^{M}\textrm{Re}\left( \mathbb {E}[X_{j,k}]-X_{j,k}(\kappa _m)\right) \right| \\&\quad< \frac{t\Upsilon c_{j,k}}{\sqrt{2\Upsilon ^2 \sum _{l,p=1}^Nc_{l,p}^2}},\\ S_{j,k,2}&:\frac{1}{M}\left| \sum _{m=1}^{M}\textrm{Im}\left( \mathbb {E}[X_{j,k}]-X_{j,k}(\kappa _m)\right) \right| \\&\quad < \frac{t\Upsilon c_{j,k}}{\sqrt{2\Upsilon ^2 \sum _{l,p=1}^Nc_{l,p}^2}}. \end{aligned}$$

Then

$$\begin{aligned} \mathbb {P}(\cap _{j,k,i}S_{j,k,i})&\ge 1 - \sum _{j,k=1}^N (\mathbb {P}(S_{j,k,1}^c)+\mathbb {P}(S_{j,k,2}^c))\\&\ge 1-4N^2\exp \left( -\frac{Mt^2}{12\Upsilon ^2\sum _{l,p=1}^Nc_{l,p}^2}\right) . \end{aligned}$$

Moreover, the AM-GM inequality implies that

$$\begin{aligned} c_{l,p}^2\le 2c^2c_k^2\Vert \psi _j\Vert _{L^\infty }^2+2c_j^2\Vert \psi _k\Vert _{L^\infty }^2 \end{aligned}$$

and hence

$$\begin{aligned} \sum _{l,p=1}^Nc_{l,p}^2\le 2(c^2+1)\alpha ^2\beta ^2. \end{aligned}$$

It follows that

$$\begin{aligned} \mathbb {P}(\cap _{j,k,i}S_{j,k,i})\ge 1-\exp \left( 2\log (2N)-\frac{Mt^2}{24\Upsilon ^2(c^2+1)\alpha ^2\beta ^2}\right) . \end{aligned}$$

If \(\cap _{j,k,i}S_{j,k,i}\), then \(\Vert \tilde{A}-A\Vert _{\textrm{Fr}}< t\). We can argue in the same manner, without the function F, to deduce that

$$\begin{aligned} \mathbb {P}(\Vert \tilde{G}-G\Vert _{\textrm{Fr}}< t)\ge 1-\exp \left( 2\log (2N)-\frac{Mt^2}{48\Upsilon ^2\alpha ^2\beta ^2}\right) . \end{aligned}$$

Finally, for the matrix L and its estimate \(\tilde{L}\), we derive similar concentration bounds for \(\psi _k(F(\pmb {x},\tau ))\overline{\psi _j(F(\pmb {x},\tau ))}\) to see that

$$\begin{aligned} \mathbb {P}(\Vert \tilde{L}-L\Vert _{\textrm{Fr}}< t)\ge 1-\exp \left( 2\log (2N)-\frac{Mt^2}{48\Upsilon ^2c^2\alpha ^2\beta ^2}\right) . \end{aligned}$$

The statement of the theorem now follows. \(\square \)

This theorem explicitly spells out the number of basis functions and samples required to approximate the three matrices appearing in Theorem 2. Roughly speaking, if we set

$$\begin{aligned} \exp \left( 2\log (2N)-{Mt^2}\right) \sim N^2\exp \left( -Mt^2\right) \le \delta , \end{aligned}$$

then

$$\begin{aligned} M\sim |\ln {\delta }-2\ln {N}|/{t^2}. \end{aligned}$$

For any fixed tolerance t, the confidence exponentially tightens up when M, the number of samples, increases. The idea is similar to other concentration inequality type bounds: if one samples from the same distribution many times, the sample mean becomes closer and closer to the true mean, and this bound gives the confidence interval for the tail bound. On the other hand, when N increases, more entries in the matrices need to be approximated, so it brings a logarithmically negative effect. More samples are needed to balance out the increase of N.

5 Examples

We now present three examples. The first two are based on numerically sampled trajectory data, while the final example utilizes collected experimental data.

5.1 Arnold’s circle map

For our first example, we revisit the circle map discussed in Example 1, setting \(c=1/5\), \(\rho \) as the uniform distribution on [0, 1], and defining

$$\begin{aligned} f(\pmb {x})=\frac{1}{4\pi }\sin (2\pi \pmb {x}). \end{aligned}$$

Our dictionary consists of Fourier modes \(\{\exp (ij\pmb {x}):j=-n,\ldots ,n\}\) with \(n=20\) (yielding \(N=41\)), and we use batched trajectory data with \(M_1=100\) equally spaced \(\{\pmb {x}^{(j)}\}\), and \(M_2=2\times 10^4\). Figure 2 illustrates the convergence of the matrices \(\tilde{A},\tilde{L}\), and \(\tilde{H}\). We do not display the convergence of \(\tilde{G}\) as its error was on the order of machine precision, a result of the exponential convergence achieved by the trapezoidal quadrature rule across different batches. Figure 3 shows the residuals computed using Algorithm 2. The quantity \(\textrm{res}^{\textrm{var}}(\lambda ,g)\) deviates from (18) (the formula for \(f=0\)), particularly when \(|\lambda |\) is small. As n increases, the residuals \(\textrm{res}(\lambda ,g)\) converge to zero, indicating more accurate computation of the spectral content of \(\mathscr {K}_{(1)}\). However, the residuals \(\textrm{res}^{\textrm{var}}(\lambda ,g)\) converge to finite positive values, except for the trivial eigenvalue 1, which satisfies \(\lim _{M\rightarrow \infty }\textrm{res}^{\textrm{var}}(\lambda ,g)=0\).

To underscore the significance of variance in our analysis, Fig. 4 displays the absolute value of the matrix \(\tilde{L}-\tilde{H}\), which approximates the covariance matrix defined in (16). Notably, the covariance disappears for the constant function \(\exp (ij\pmb {x})\) with \(j=0\), and the matrix is diagonally dominated. Figure 5 presents the results obtained from applying Algorithms 3 and 4. These results align in areas where the variance is minimal (large \(|\lambda |\)). However, in regions where \(|\lambda |\) is small, the variance component in (27) becomes significant. This observation leads us to infer that only about seven eigenpairs are of meaningful significance in a statistically coherent framework.

Fig. 2
figure 2

Estimation error for the matrices \(\tilde{A},\tilde{L}\) and \(\tilde{H}\) for the circle map. The solid line shows the expected Monte-Carlo convergence rate

Fig. 3
figure 3

Residuals for the circle map computed using Algorithm 2

Fig. 4
figure 4

Absolute values of the matrix \(\tilde{L}-\tilde{H}\) for the circle map. This difference corresponds to the covariance matrix in (16)

Fig. 5
figure 5

Pseudospectra versus variance pseudospectra. Left: Output of Algorithm 3 for the circle map. Right: Output of Algorithm 4 for the circle map. We have shown the minimized residuals over a contour plot of \(\epsilon \) in both cases. The red dots correspond to the EDMD eigenvalues

5.2 Stochastic Van der Pol oscillator

We now consider the stochastic differential equation

$$\begin{aligned} \textrm{d} X_1&= X_2 \textrm{d}t\\ \textrm{d}X_2&= \left[ \mu (1-X_1^2)X_2-X_1\right] \textrm{d}t +\sqrt{2\delta }\textrm{d} B_t, \end{aligned}$$

where \(B_t\) denotes standard one-dimensional Brownian motion, \(\delta >0\), and \(\mu >0\).Footnote 7 This equation represents a noisy version of the Van der Pol oscillator. In the absence of noise, the Van der Pol oscillator exhibits a limit cycle to which all initial conditions converge, except for the unstable fixed point at the origin. The introduction of noise transforms the system, resulting in a global attractor that forms a band around the deterministic system’s limit cycle.

Fig. 6
figure 6

Pseudospectra versus variance pseudospectra. Left: Output of Algorithm 3 for the stochastic Van der Pol oscillator. Right: Output of Algorithm 4 for the stochastic Van der Pol oscillator. We have shown the minimized residuals over a contour plot of \(\epsilon \) in both cases. The red dots correspond to the EDMD eigenvalues

Table 1 Computed eigenvalues of the stochastic Van der Pol oscillator, and the residuals computed using Algorithm 2. We have ordered them according to perturbations of \(\hat{\lambda }_{m,k}\). Due to conjugate symmetry, we have only shown eigenvalues with non-negative imaginary parts

The generator of the stochastic solutions, known as the backward Kolmogorov operator, is described in [25, Section 9.3]. It is a second-order elliptic type differential operator \(\mathscr {L}\), defined by

$$\begin{aligned} {[}\mathscr {L}g](X_1,X_2)&= \begin{pmatrix} \pmb {x}_2\\ \mu (1-X_1^2)X_2-X_1 \end{pmatrix} \cdot \nabla g(X_1,X_2)\\&\quad +\delta \nabla ^2g(X_1,X_2). \end{aligned}$$

For a discrete times step \(\Delta _t\), the Koopman operator is given by \(\exp (\Delta _t \mathscr {L})\). In the absence of noise (\(\delta =0\)), the Koopman operator has eigenvalues forming a lattice [53, Theorem 13]:

$$\begin{aligned} \left\{ \hat{\lambda }_{m,k}=\exp ([-m\mu + ik\omega _0]\Delta _t):k\in \mathbb {Z},m\in \mathbb {N}\cup \{0\}\right\} , \end{aligned}$$

where \(\omega _0\approx 1-\mu ^2/16\) is the base frequency of the limit cycle [74]. When \(\delta \) is moderate, the base frequency of the averaged limit cycle remains similar to that in the deterministic case [45].

We simulate the dynamics using the Euler–Maruyama method [65] with a time step of \(3\times 10^{-3}\). Data are collected along a single trajectory of length \(M_1=10^6\) with \(M_2=2\), starting the sampling after the trajectory reaches the global attractor. We employ 318 Laplacian radial basis functions with centers on the attractor as our dictionary. The parameters are set to \(\mu =0.5\), \(\delta =0.02\), and \(\Delta _t=0.3\).

Figure 6 displays the results obtained using Algorithms 3 and 4. Similar to observations from the circle map example, \(\textrm{Sp}_\epsilon (\mathscr {K}_{(1)})\) and \(\textrm{Sp}_\epsilon ^\textrm{var}(\mathscr {K}_{(1)})\) exhibit greater similarity near the unit circle. The lattice-like structure in the eigenvalues is also evident, with the EDMD-computed eigenvalues appearing as perturbations of the set \(\{\hat{\lambda }_{m,k}\}\). Table 1 lists some of these eigenvalues alongside the residuals calculated using Algorithm 2. We observe that as |k| increases, \(\textrm{res}(\lambda ,g)\) also increases, and similarly, \(\textrm{res}^{\textrm{var}}(\lambda ,g)\) increases with m. For any given eigenvalue, \(\textrm{res}(\lambda ,g)\) decreases to zero with larger dictionaries. In contrast, \(\textrm{res}^{\textrm{var}}(\lambda ,g)\) approaches a finite nonzero value, except for the trivial eigenvalue, which has a constant eigenfunction exhibiting zero variance. Figure 7 illustrates the corresponding eigenfunctions on the attractor, showcasing their beautiful modal structure.

In this example, the norm of the Koopman operator \(\Vert \mathscr {K}\Vert \) is approximately 1, and the subspace error \(\delta _n(g)\) predominantly contributes to the bound established in Theorem 2. We analyze the two observables \(X_1\) and \(X_2\), each starting from a point randomly selected on the attractor. Figure 8 presents the calculated values of \(\delta _n(X_1)\) and \(\delta _n(X_2)\) as per (29) and (30), along with the variance of the trajectory. Additionally, Fig. 9 compares the values computed using \(K^nX_i\) with the actual values of \(\mathscr {K}^nX_i\), obtained by integrating the generator \(\mathscr {L}\). Together, these figures demonstrate the convergence of the mean trajectories toward the dominant subspace of \(\mathscr {K}\).

5.3 Neuronal population dynamics

As a final example, we apply our approach to experimental neuroscience data. Recent technological advancements in this field now allow for the simultaneous monitoring of large neuronal populations in the brains of awake, behaving animals. This development has spurred significant interest in employing data-driven methods to derive physically meaningful insights from high-dimensional neural measurements [62].

To analyze complex neural data, researchers have employed a variety of analytical tools to uncover features like low-dimensional manifolds, latent population dynamics, within-trial variance, and trial-to-trial variability. However, existing methods often examine these features in isolation [16, 29, 61, 73]. From a dynamical systems perspective, a unified model that captures these distinct aspects of neural data would be highly advantageous. In this context, the Koopman operator framework offers a compelling approach to analyzing high-dimensional neural observables [47]. DMD has emerged as a prominent method for the spatiotemporal decomposition of diverse datasets [9, 14]. Nevertheless, a limitation of DMD is its lack of explicit uncertainty quantification regarding the modes and forecasts it uncovers. This aspect is particularly vital in neural time series analysis, where it is challenging to identify physically meaningful spectral components [28].

Our framework offers a unified, data-driven solution to uncover validated latent dynamical modes and their associated variance in neural data. To demonstrate its efficacy, we applied it to high-dimensional neuronal recordings from the visual cortex of awake mice, as publicly shared by the Allen Brain Observatory [71], involving 400–800 neurons per mouse. Our focus was on the “Drifting Gratings” task epoch, wherein mice were presented with gratings drifting in one of eight directions (0\(^{\circ }\), 45\(^{\circ }\), etc.), modulated sinusoidally at one of five temporal frequencies. We specifically analyzed responses to gratings modulated at 15 Hz across all eight directions, as these stimuli consistently elicited an identifiable eigenvalue in the neural data corresponding to the expected frequency. This analysis encompassed 120 trials per mouse (stimulus duration of 2 s) for a total of 20 mice, as detailed in [71]. We computed distinct stochastic Koopman operators for 15 different arousal levels, categorized by the average pupil diameter measured during the 500ms before each stimulus [49]. For this analysis, DMD was employed to identify 100 dictionary functions.

Fig. 7
figure 7

Computed eigenfunctions (real part shown) of the stochastic Van der Pol oscillator. Due to conjugate symmetry, we have only shown eigenfunctions corresponding to eigenvalues with non-negative imaginary parts

Fig. 8
figure 8

Left: Subspace errors \(\delta _n(X_1)\) and \(\delta _n(X_2)\) for the stochastic Van der Pol oscillator, computed using (29) and (30). Right: Variance of trajectory. We have rescaled the horizontal axis in both plots to correspond to time

Fig. 9
figure 9

Comparison of computed \(K^nX_i\), where \(K\in \mathbb {C}^{N\times N}\) is the EDMD matrix, and the true values of \(\mathscr {K}^nX_i\)

Our data-driven approach was effective in identifying an isolated, population-level coherent mode at the stimulus frequency. As illustrated in Fig. 10, this is evidenced by a distinct eigenvalue, highlighted in green, which consistently appears as a clear local minimum in the variance pseudospectra contour plots across various arousal states. Without the variance pseudospectra, discerning which DMD eigenvalues are reliable and indicative of coherence can be challenging. We observed that individual neurons displayed a variety of waveforms, all linked to this single linear dynamic mode. Demonstrating the diversity of these responses, Fig. 11 showcases five randomly chosen sample trajectories from the KMD. These trajectories highlight the distinct spike counts and/or timings of different neurons, all parsimoniously represented by a single latent mode.

Fig. 10
figure 10

Variance pseudospectra for a single mouse in the neuronal population dynamics example. Each case corresponds to a pupil diameter of \(8\%\) (left), \(28\%\) (middle), and \(43\%\) (right). The identified mode is shown in green, and the red dots show the other DMD eigenvalues. The variance pseudospectra changes considerably as the arousal state changes, but the green eigenvalue shows little variability

Importantly, neuronal responses demonstrate significant trial-to-trial variability, a phenomenon of considerable physiological interest due to its close relationship with ongoing fluctuations in an animal’s internal state. Dynamical systems approaches are adept at modeling this type of variability, which often stems from changes in the neural population’s pre-stimulus state [61]. Furthermore, the extent of this variability is heavily influenced by internal states like arousal and attention, as detailed in [50]. Our stochastic modeling approach enables us to additionally estimate this second source of trial-to-trial variability in neuronal responses.

To validate the physiological significance of our variance estimates, we analyzed the variance linked to the Koopman operators computed across each of 15 levels of pupil diameter, effectively using pupil diameter as a parameter for the Koopman operator in relation to arousal. Our hypothesis was that this analysis would reflect the well-known “U-shape” pattern described by the Yerkes–Dodson law [86], with variance minimized at intermediate arousal levels [49]. Figure 10 indicates that the eigenvalue or expectation derived from 10 remains consistent across various arousal states. However, from Fig. 12, a notable modulation in variance residuals is observed in accordance with arousal levels, aligning with our predictions: the variance associated with the leading mode is specifically reduced at intermediate arousal levels. This pattern underscores the physiological relevance of the variance estimates yielded by our modeling approach. Consequently, our findings suggest that arousal systematically influences dynamical variance, providing both practical and physiological rationales for employing dynamical models that explicitly estimate variance. Overall, our data-driven framework offers a unified and formal representation of neural dynamics, parsimoniously capturing multiple physiologically significant features in the data.

Fig. 11
figure 11

Randomly selected sample trajectories from the Koopman mode corresponding to the eigenvalue shown in green in Fig. 10. The reference gray region in the left region shows the wavelength predicted by the eigenvalue

Fig. 12
figure 12

The variance relative squared residual as a function of the arousal state. The red lines show the average across the mice, and the green error bounds correspond to the standard error of the mean. The “U-shape” is characteristic of the so-called Yerkes–Dodson law, which we produce in a data-driven fashion from the dynamics

6 Conclusion

We have demonstrated the role of variance in the Koopman analysis of stochastic dynamical systems. To effectively study projection errors in data-driven approaches for these systems, it is crucial to move beyond expectations and study more than just the stochastic Koopman operator. Incorporating variance into the Koopman framework enhances our understanding of spectral properties and the related projection errors. By analyzing various types of residuals, we have developed data-driven algorithms capable of computing the spectral properties of infinite-dimensional stochastic Koopman operators. Furthermore, we introduced the concept of variance pseudospectra, a tool designed to assess statistical coherency. From a computational perspective, our work includes several convergence theorems pertinent to the spectral properties of these operators. In the realm of experimental neural recordings, our framework has proven effective in extracting and compactly representing multiple data features with known physiological significance.

There are several avenues of future work related to this paper. One such direction involves an analysis of the algorithms and theorems presented in Sect. 4 in scenarios involving noisy snapshot data. Another avenue explores the trade-offs between computing the squared residual and variance terms, as outlined in (15), potentially reflecting variance-bias trade-offs in statistical analysis. Additionally, we aim to assess the robustness and generalizability of the proposed framework across further stochastic dynamical systems.