Spatiotemporal pattern extraction by spectral analysis of vector-valued observables

We present a data-driven framework for extracting complex spatiotemporal patterns generated by ergodic dynamical systems. Our approach, called Vector-valued Spectral Analysis (VSA), is based on an eigendecomposition of a kernel integral operator acting on a Hilbert space of vector-valued observables of the system, taking values in a space of functions (scalar fields) on a spatial domain. This operator is constructed by combining aspects of the theory of operator-valued kernels for machine learning with delay-coordinate maps of dynamical systems. In contrast to conventional eigendecomposition techniques, which decompose the input data into pairs of temporal and spatial modes with a separable, tensor product structure, the patterns recovered by VSA can be manifestly non-separable, requiring only a modest number of modes to represent signals with intermittency in both space and time. Moreover, the kernel construction naturally quotients out dynamical symmetries in the data, and exhibits an asymptotic commutativity property with the Koopman evolution operator of the system, enabling decomposition of multiscale signals into dynamically intrinsic patterns. Application of VSA to the Kuramoto-Sivashinsky model demonstrates significant performance gains in efficient and meaningful decomposition over eigendecomposition techniques utilizing scalar-valued kernels.


Introduction
Spatiotemporal pattern formation is ubiquitous in physical, biological, and engineered systems, ranging from molecular-scale reaction-diffusion systems, to engineering-and geophysical-scale convective flows, and astrophysical flows, among many examples [1][2][3]. The mathematical models for such systems are generally formulated by means of partial differential equations (PDEs), or coupled ordinary differential equations, with dissipation playing an important role in the development of low-dimensional effective dynamics on attracting subsets of the state space [4]. In light of this property, many pattern forming systems are amenable to analysis by empirical, data-driven techniques, complementing the scientific understanding gained from first-principles approaches.
Historically, many of the classical Proper Orthogonal Decomposition (POD) and Principal Component Analysis (PCA) techniques for spatiotemporal pattern extraction have been based on the spectral properties of temporal and spatial covariance operators estimated from snapshot data [5,6]. In Singular Spectrum Analysis (SSA) and related algorithms [7][8][9], combining this approach with delay-coordinate maps of dynamical systems [10][11][12][13][14] generally improves the representation of the information content of the data in terms of a few meaningful modes. More recently, advances in machine learning and applied harmonic analysis [15][16][17][18][19][20][21] have led to techniques for recovering temporal and spatial patterns through the eigenfunctions of kernel integral operators (e.g., heat operators) defined intrinsically in terms of a Riemannian geometric structure of the data. In particular, in a family of techniques called Nonlinear Laplacian Spectral Analysis (NLSA) [22], and independently in [23], the diffusion maps algorithm [18] was combined with delay-coordinate maps to extract spatiotemporal patterns through the eigenfunctions of a kernel integral operator adept at capturing distinct and physically meaningful timescales in individual eigenmodes from multiscale high-dimensional signals.
At the same time, spatial and temporal patterns have been extracted from eigenfunctions of Koopman [24][25][26][27][28][29][30][31] and Perron-Frobenius [32] operators governing the evolution of observables and probability measures, respectively, in dynamical systems [33,34]. Koopman eigenfunction analysis is also related to the Dynamic Mode Decomposition (DMD) algorithm [35] and Linear Inverse Model techniques [36]. An advantage of these approaches is that they target operators defined intrinsically for the dynamical system generating the data, and thus able, in principle, to recover temporal and spatial patterns of higher physical interpretability and utility in predictive modeling than POD and kernel integral operator based approaches. In practice, however, the Koopman and Perron-Frobenius operators tend to have significantly more complicated spectral properties (e.g., non-isolated eigenvalues and/or continuous spectra), hindering the stability and convergence of datadriven approximation techniques. In [27,30,31], these issues were addressed through an approximation scheme for the generator of the Koopman group with rigorous convergence guarantees, utilizing a datadriven orthonormal basis of the Hilbert space of the dynamical system acquired through diffusion maps. There, it was also shown that the eigenfunctions recovered by kernel integral operators defined on delaycoordinate mapped data (e.g., the covariance and heat operators in SSA and NLSA, respectively) in fact converge to Koopman eigenfunctions in the limit of infinitely many delays, indicating a deep connection between these two branches of data analysis algorithms.
All of the techniques described above recover from the data a set of temporal patterns and a corresponding set of spatial patterns, sometimes referred to as "chronos" and "topos" modes, respectively [5]. In particular, for a dynamical system with a state space X developing patterns in a physical domain Y , each chronos mode, ϕ j , corresponds to a scalar-(real-or complex-) valued function on X, and the corresponding topos mode, ψ j , corresponds to a scalar-valued function on Y . Spatiotemporal reconstructions of the data with these approaches thus correspond to linear combinations of tensor product patterns of the form ϕ j ⊗ ψ j , mapping pairs of points (x, y) in the product space Ω = X × Y to the number ϕ j (x)ψ j (y). For a dynamical system possessing a compact invariant set A ⊆ X (e.g., an attractor) with an ergodic invariant measure, the chronos modes effectively become scalar-valued functions on A, which may be of significantly smaller dimension than X, increasing the robustness of approximation of these modes from finite datasets.
Evidently, for spatiotemporal signals F (x, y) of high complexity, tensor product patterns, with separable dependence on x and y, can be highly inefficient in capturing the properties of the input signal. That is, the number l of such patterns needed to recover F at high accuracy via a linear superposition is generally large, with none of the individual patterns ϕ j ⊗ ψ j being representative of F . In essence, the problem is similar to that of approximating a non-separable space-time signal in a tensor product basis of temporal and spatial basis functions. Another issue with tensor product decompositions based on scalarvalued eigenfunctions is that in the presence of nontrivial spatial symmetries, the recovered patterns are oftentimes pure symmetry modes (e.g., Fourier modes in a periodic domain with translation invariance), with minimal dynamical significance and physical interpretability [6,37].
Here, we present a framework for spatiotemporal pattern extraction, called Vector-valued Spectral Analysis (VSA), designed to alleviate the shortcomings mentioned above. The fundamental underpinning of VSA is that time-evolving spatial patterns have a natural structure as vector-valued observables on the system's state space, and thus data analytical techniques operating on such spaces are likely to offer maximal descriptive efficiency and physical insight. We show that eigenfunctions of kernel integral operators on vector-valued observables, constructed by combining aspects of the theory of operator-valued kernels [38][39][40] with delaycoordinate maps of dynamical systems [10][11][12][13][14]: a) Are superior to conventional algorithms in capturing signals with intermittency in both space and time; b) Naturally incorporate any underlying dynamical symmetries, eliminating redundant modes and thus improving physical interpretability of the results; c) Have a correspondence with Koopman operators, allowing detection of intrinsic dynamical timescales; and, d) Can be stably approximated via data-driven techniques that provably converge in the asymptotic limit of large data.
The plan of this paper is as follows. Section 2 introduces the class of dynamical systems under study, and provides an overview of data analysis techniques based on scalar kernels. In Section 3, we present the VSA framework for spatiotemporal pattern extraction using operator-valued kernels, and in Section 4 discuss the behavior of the method in the presence of dynamical symmetries, as well as its correspondence with Koopman operators. Section 5 describes the data-driven implementation of VSA. In Section 6, we present applications to the Kuramoto-Sivashinsky (KS) PDE model [41,42] in chaotic regimes. Our primary conclusions are described in Section 7. Technical results and descriptions of basic properties of kernels and Koopman operators are collected in four appendices. An application of VSA to a toy spatiotemporal signal featuring spatially localized propagating disturbances can be found in [43].

Dynamical system and spaces of observables
We begin by introducing the dynamical system and the spaces of observables under study. The dynamics evolves by a C 1 flow map Φ t : X → X, t ∈ R, on a manifold X, possessing a Borel ergodic invariant probability measure µ with compact support A ⊆ X (e.g., an attractor). The system develops patterns on a spatial domain Y , which has the structure of a compact metric space equipped with a finite Borel measure (volume) ν. As a natural space of vector-valued observables, we consider the Hilbert space H = L 2 (X, µ; H Y ) of square-integrable functions with respect to the invariant measure µ, taking values in H Y = L 2 (Y, ν). That is, the elements of H are equivalence classes of functions f : X → H Y , such that for µ-almost every dynamical state x ∈ X, f (x) is a scalar (complex-valued) field on Y , square-integrable with respect to ν.
For every such observable f , the map t → f (Φ t (x)) describes a spatiotemporal pattern generated by the dynamics. Given f , f ∈ H and g, g ∈ H Y , the corresponding inner products on H and H Y are given by f , f H = X f (x), f (x) H Y dµ(x) and g, g H Y = Y g * (y)g (y) dν(y), respectively.
An important property of H is that it exhibits the isomorphisms where H X = L 2 (X, µ) and H Ω = L 2 (Ω, ρ) are Hilbert spaces of scalar-valued functions on X and the product space Ω = X × Y , square-integrable with respect to the invariant measure µ and the product measure ρ = µ × ν, respectively (the inner products of H X and H Ω have analogous definitions to the inner product of H Y ). That is, every f ∈ H can be equivalently viewed as an element of the tensor product space H X ⊗ H Y , meaning that it can be decomposed as f = ∞ j=0 ϕ j ⊗ ψ j for some ϕ j ∈ H X and ψ j ∈ H Y , or it can be represented by a scalar-valued function f ∈ H Ω such that f (x)(y) = f (x, y). Of course, not every observable f ∈ H is of pure tensor-product form, f = ϕ ⊗ ψ, for some ϕ ∈ H X and ψ ∈ H Y .
We consider that measurements F (x n ) of the system are taken along a dynamical trajectory x n = Φ nτ (x 0 ), n ∈ N, starting from a point x 0 ∈ X at a fixed sampling interval τ > 0 through a continuous vector-valued observation map F ∈ H. We assume that τ is such that µ is an ergodic invariant probability measure of the discrete-time map Φ τ .

Separable data decompositions via scalar kernel eigenfunctions
Before describing the operator-valued kernel formalism at the core of VSA, we outline the standard approach to separable decompositions of spatiotemporal data as in (1) via eigenfunctions of kernel integral operators associated with scalar-valued kernels. In this context, a kernel is a continuous bivariate function k : X ×X → R, which assigns a measure of correlation or similarity to pairs of dynamical states in X. Sometimes, but not always, we will require that k be symmetric, i.e., k(x, x ) = k(x , x) for all x, x ∈ X. Two examples of popular kernels used in applications (both symmetric) are the covariance kernels employed in POD, and radial Gaussian kernels, which are frequently used in manifold learning applications. Note that in both of the above examples the dependence of k(x, x ) on x and x is through the values of F at these points alone; this allows k(x, x ) to be computable from observed data, without explicit knowledge of the underlying dynamical states x and x .
Hereafter, we will always work with such "data-driven" kernels. Associated with every scalar-valued kernel is an integral operator K : H X → H X , acting on f ∈ H X according to the formula If k is symmetric, then by compactness of A and continuity of k, K is a compact, self-adjoint operator with an associated orthonormal basis {ϕ 0 , ϕ 1 , . . .} of H X consisting of its eigenfunctions. Moreover, the eigenfunctions ϕ j corresponding to nonzero eigenvalues are continuous. These eigenfunctions are employed as the chronos modes in (1), each inducing a continuous temporal pattern, t → ϕ j (Φ t (x)), for every state x ∈ X. The spatial pattern ψ j ∈ H Y corresponding to ϕ j is obtained by pointwise projection of the observation map onto ϕ j , namely where F y ∈ H X is the continuous scalar-valued function on X satisfying F y (x) = F (x)(y) for all x ∈ X.

Delay-coordinate maps and Koopman operators
A potential shortcoming of spatiotemporal pattern extraction via the kernels in (2) and (3) is that the corresponding integral operators depend on the dynamics only indirectly, e.g., through the geometrical structure of the set F (A) ⊂ H Y near which the data is concentrated. Indeed, a well known deficiency of POD, particularly in systems with symmetries, is failure to identify low-variance, yet dynamically important patterns [37]. As a way of addressing this issue, it has been found effective [7-9, 22, 23] to first embed the observed data in a higher-dimensional data space through the use of delay-coordinate maps, and then extract spatial and temporal patterns through a kernel operating in delay-coordinate space. For instance, analogs of the covariance and Gaussian kernels in (2) and (3) in delay-coordinate space are given by and respectively, here Q ∈ N is the number of delays. The covariance kernel in (6) is essentially equivalent to the kernel employed in multi-channel SSA [9] in an infinite-channel limit, and the Gaussian kernel in (7) is closely related to the kernel utilized in NLSA (though the NLSA kernel employs a state-dependent distance scaling akin to (19) ahead, as well as Markov normalization, and these features lead to certain technical advantages compared to unnormalized radial Gaussian kernels).
As is well known [10][11][12][13][14], delay-coordinate maps can help recover the topological structure of state space from partial measurements of the system (i.e., non-injective observation maps), but in the context of kernel algorithms they also endow the kernels, and thus the corresponding eigenfunctions, with an explicit dependence on the dynamics. In [30,31], it was established that as the number of delays Q grows, the integral operators K Q associated with a family of scalar kernels k Q operating in delay-coordinate space converge in operator norm, and thus in spectrum, to a compact kernel integral operator K ∞ on H X commuting with the Koopman evolution operators [33,34] of the dynamical system. The latter are the unitary operators U t : H X → H X , t ∈ R, acting on observables by composition with the flow map, thus governing the evolution of observables in H X under the dynamics.
In the setting of measure-preserving ergodic systems, associated with U t is a distinguished orthonormal set {z j } of observables z j ∈ H X consisting of Koopman eigenfunctions (see Appendix A). These observables have the special property of exhibiting time-periodic evolution under the dynamics at a single frequency α j ∈ R intrinsic to the dynamical system, U t z j = e iαj t z j , even if the underlying dynamical flow Φ t is aperiodic. Moreover, every Koopman eigenspaces is onedimensional by ergodicity. Because commuting operators have common eigenspaces, and the eigenspaces of compact operators corresponding to nonzero eigenvalues are finite-dimensional, it follows that as Q increases, the eigenfunctions of K Q at nonzero eigenvalues acquire increasingly coherent (periodic or quasiperiodic) time evolution associated with a finite number of Koopman eigenfrequencies α j . This property significantly enhances the physical interpretability and predictability of these patterns, providing justification for the skill of methods such as SSA and NLSA in extracting dynamically significant patterns from complex systems. Conversely, because kernel integral operators are generally more amenable to approximation from data than Koopman operators (which can have a highly complex spectral behavior, including non-isolated eigenvalues and continuous spectrum), the operators K Q provide an effective route for identifying finite-dimensional approximation spaces to stably and efficiently solve the Koopman eigenvalue problem.

Differences between covariance and Gaussian kernels
Before closing this section, it is worthwhile pointing out two differences between covariance and Gaussian kernels, indicating that the latter may be preferable to the former in applications. First, Gaussian kernels are strictly positive, and bounded below on compact sets. That is, for every compact set S ⊆ X (including S = A), there exists a constant c S > 0 such that k(x, x ) ≥ c S for all x, x ∈ S. This property allows Gaussian kernels to be normalizable to ergodic Markov diffusion kernels [18,21]. In a dynamical systems context, an important property of such kernels is that the corresponding integral operators always have an eigenspace at eigenvalue 1 containing constant functions, which turns out to be useful in establishing well-posedness of Galerkin approximation techniques for Koopman eigenfunctions [30]. Markov diffusion operators are also useful for constructing spaces of observables of higher regularity than L 2 , such as Sobolev spaces.
Second, if there exists a finite-dimensional linear subspace of H Y containing the image of A under F , then the integral operator K associated with the covariance kernel has necessarily finite rank (bounded above by the dimension of that subspace), even if F is an injective map on A. This effectively limits the richness of observables that can be stably extracted from data-driven approximations to covariance eigenfunctions. In fact, it is a well known property of covariance kernels that every eigenfunction ϕ j at nonzero corresponding eigenvalue depends linearly on the observation map; specifically, up to proportionality constants, ϕ j (x) = ψ j , F (x) H Y , and the number of such patterns is clearly finite if F (x) spans a finitedimensional linear space as x is varied. On the other hand, apart from trivial cases, the kernel integral operators associated with Gaussian kernels have infinite rank, even if F is non-injective, and moreover if F is injective they have no zero eigenvalues. In the latter case, data-driven approximations to the eigenfunctions of K provide an orthonormal basis for the full H X space. Similar arguments also motivate the use of Gaussian kernels over polynomial kernels. In effect, by invoking the Taylor series expansion of the exponential function, a Gaussian kernel can be thought of as an "infinite-order" polynomial kernel.

Vector-valued spectral analysis (VSA) formalism
The main goal of VSA is to construct a decomposition of the observation map F via an expansion of the form where the c j and φ j are real-valued coefficients and vector-valued observables in H, respectively. Along a dynamical trajectory starting at x ∈ X, every φ j gives rise to a spatiotemporal pattern t → φ j (Φ t (x)), generalizing the time series t → ϕ j (Φ t (x)) from Section 2.2. A key consideration in the VSA construction is that the recovered patterns should not necessarily be of the form φ j = ϕ j ⊗ ψ j for some ϕ j ∈ H X and ψ j ∈ H Y , as would be the case in the conventional decomposition in (1). To that end, we will determine the φ j through the vector-valued eigenfunctions of an integral operator acting on H directly, as opposed to first identifying scalar-valued eigenfunctions in H X , and then forming tensor products with the corresponding projection-based spatial patterns, as in Section 2.2. As will be described in detail below, the integral operator nominally employed by VSA is constructed using the theory of operator-valued kernels [38][39][40], combined with delay-coordinate maps and Markov normalization as in NLSA.

Operator-valued kernel and vector-valued eigenfunctions
Let B(H Y ) be the Banach space of bounded linear maps on H Y , equipped with the operator norm. For our purposes, an operator-valued kernel is a continuous map l : X × X → B(H Y ), mapping pairs of dynamical states in X to a bounded operator on H Y . Every such kernel has an associated integral operator L : H → H, acting on vector-valued observables according to the formula (cf. (4)) where the integral above is a Bochner integral (a vector-valued generalization of the Lebesgue integral). Note that operator-valued kernels and their corresponding integral operators can be viewed as generalizations of their scalar-valued counterparts from Section 2.2, in the sense that if Y only contains a single point, then H Y is isomorphic to the vector space of complex numbers (equipped with the standard operations of addition and scalar multiplication and the inner product w, z C = w * z), and B(H Y ) is isomorphic to the space of multiplication operators on C by complex numbers. In that case, the action l(x, x ) f (x ) of the linear map l(x, x ) ∈ B(H Y ) on the function f (x ) ∈ H Y becomes equivalent to multiplication of the complex number f (x), where f is a complex-valued observable in H X , by the value k(x, x ) ∈ C of a scalar-valued kernel k on X.
Consider now an operator-valued kernel l : X × X → B(H Y ), such that for every pair (x, x ) of states in X, l(x, x ) = L xx is a kernel integral operator on H Y associated with a continuous kernel l xx : Y × Y → R with the symmetry property This operator acts on a scalar-valued function g ∈ H Y on the spatial domain via an integral formula analogous to (4), viz.
Moreover, it follows from (9) that the corresponding operator L on vector-valued observables is self-adjoint and compact, and thus there exists an orthonormal basis { φ j } of H consisting of its eigenfunctions, Hereafter, we will always order the eigenvalues λ j of integral operators in decreasing order starting at j = 0. By continuity of l and l xx , every eigenfunction φ j at nonzero corresponding eigenvalue is a continuous function on X, taking values in the space of continuous functions on Y . Such eigenfunctions can be employed in the VSA decomposition in (8) with the expansion coefficients Note that, as with scalar kernel techniques, the decomposition in (8) does not include eigenfunctions at zero corresponding eigenvalue, for, to our knowledge, no data-driven approximation schemes are available for such eigenfunctions. See Section 5 and Appendix D for further details.
Because H is isomorphic as Hilbert space to the space H Ω of scalar-valued observables on the product space Ω = X × Y (see Section 2.1), every operator-valued kernel satisfying (9) can be constructed from a symmetric scalar kernel k : Ω × Ω → R by defining l(x, x ) = L xx as the integral operator associated with the kernel l xx (y, y ) = k(ω, ω ), ω = (x, y), ω = (x , y ).
In particular, the vector-valued eigenfunctions of L are in one-to-one correspondence with the scalar-valued eigenfunctions of the integral operator K : H Ω → H Ω associated with k, where That is, the eigenvalues and eigenvectors of K satisfy the equation Kφ j = λ j φ j for the same eigenvalues as those of L, and we also have It is important to note that unless k is separable as a product of kernels on X and Y , i.e., k((x, y), (x , y )) = k (X) (x, x )k (Y ) (y, y ) for some k (X) : X × X → R and k (Y ) : Y × Y → R, the φ j will not be of pure tensor product form, φ j = ϕ j ⊗ ψ j with ϕ j ∈ H X and ψ j ∈ H Y . Thus, passing to an operator-valued kernel formalism allows one to perform decompositions of significantly higher generality than the conventional approach in (1).

Operator-valued kernels with delay-coordinate maps
While the framework described in Section 3.1 can be implemented with a very broad range of kernels, VSA employs kernels leveraging the insights gained from SSA, NLSA, and related techniques on the use of kernels operating in delay-coordinate space. That is, analogously to the kernels employed by these methods that depend on the values F ((x)), F (Φ −τ (x)), . . . , F (Φ −(Q−1)τ (x)) of the observation map on dynamical trajectories, VSA is based on kernels on the product space Ω that also depend on data observed on dynamical trajectories, but with the key difference that this dependence is through the local values F y (x), F y (Φ −τ (x)), . . . , F y (Φ −(Q−1)τ (x)) of the observation map at each point y in the spatial domain Y . Specifically, defining the family of pointwise delay embedding mapsF Q : Ω → R Q with Q ∈ N and we require that the kernels k Q : Ω × Ω → R utilized in VSA have the following properties: 2. The sequence of kernels k 1 , k 2 , . . . converges in H Ω ⊗ H Ω norm to a kernel k ∞ ∈ H Ω ⊗ H Ω .
3. The limit kernel k ∞ is invariant under the dynamics, in the sense that for all t ∈ R and (ρ × ρ)-a.e.
(ω, ω ) ∈ Ω × Ω, where ω = (x, y) and ω = (x , y ), We denote the corresponding integral operator on vector-valued observables in H, determined through (11), by L Q . As we will see below, operators of this class can be highly advantageous for the analysis of signals with an intermittent spatiotemporal character, as well as signals generated in the presence of dynamical symmetries. In addition, the family L Q exhibits a commutativity with Koopman operators in the infinitedelay limit as in the case of SSA and NLSA.
Let ω = (x, y) and ω = (x , y ) with x, x ∈ X and y, y ∈ Y be arbitrary points in Ω. As concrete examples of kernels satisfying the conditions listed above, and are analogs of the covariance and Gaussian kernels in (2) and (3), respectively, defined on Ω. For the reasons stated in Section 2.4, in practice we generally prefer working with Gaussian kernels than covariance kernels.
Moreover, following the approach employed in NLSA and in [27,31,44,45], we nominally consider a more general class of Gaussian kernels than (18), namely where s Q : Ω → R is a continuous non-negative scaling function. Intuitively, the role of s Q is to adjust the bandwidth (variance) of the Gaussian kernel in order to account for variations in the sampling density and time tendency of the data. The explicit construction of this function is described in Appendix C.1. For the purposes of the present discussion, it suffices to note that s Q (ω) can be evaluated given the values of F y on the lagged trajectory Φ −qτ (x), so that, as with the covariance and radial Gaussian kernels, the class of kernels in (19) also satisfy (15). The existence of limit k ∞ for this family of kernels, as well as the covariance kernels in (17), satisfying the conditions listed above is established in Appendix C.3.

Markov normalization
As a final kernel construction step, when working with a strictly positive, symmetric kernel k Q , such as (18) and (19), we normalize it to a continuous Markov kernel p Q : Ω × Ω → R, satisfying Ω p Q (ω, ·) dρ = 1 for all ω ∈ Ω, using the normalization procedure introduced in the diffusion maps algorithm [18] and in [21]; see Appendix C.2 for a description. Due to this normalization, the corresponding integral operator P Q : H Ω → H Ω is an ergodic Markov operator having a simple eigenvalue λ 0 = 1 and a corresponding constant eigenfunction φ 0 . Moreover, the range of P Q is included in the space of continuous functions of Ω. While this operator is not necessarily self-adjoint (since the kernel p Q resulting from diffusion maps normalization is generally non-symmetric), it can be shown that it is related to a self-adjoint, compact operator by a similarity transformation. As a result, all eigenvalues of P Q are real, and admit the ordering 1 = λ 0 > λ 1 ≥ λ 2 · · · . Moreover, there exists a (non-orthogonal) basis {φ 0 , φ 1 , . . .} of H Ω consisting of eigenfunctions corresponding to these eigenvalues, as well as a dual basis {φ 0 , φ 1 , . . .} consisting of eigenfunctions of P * Q satisfying φ i , φ j HΩ = δ ij . As with their unnormalized counterparts k Q , the sequence of Markov kernels p Q has a well-defined limit p ∞ ∈ H Ω ⊗ H Ω as Q → ∞; see Appendix C.2 for further details. The eigenfunctions φ j induce vector-valued observables φ j ∈ H through (13), which are in turn eigenfunctions of an integral operator P Q : H → H associated with the operator-valued kernel determined via (11) applied to the Markov kernel p Q . Similarly, the dual eigenfunctions φ j induce vector-valued observables φ j ∈ H, which are eigenfunctions of P * Q satisfying φ i , φ j H = δ ij . Equipped with these observables, we perform the VSA decomposition in (8) with the expansion coefficients c j = φ j , F H . The latter expression can be viewed as a generalization of (10), applicable for non-orthonormal eigenbases.

Properties of the VSA decomposition
In this section, we study the properties of the operators K Q employed in VSA and their eigenfunctions in two relevant scenarios in spatiotemporal data analysis, namely data generated by systems with (i) dynamical symmetries, and (ii) non-trivial Koopman eigenfunctions. These topics will be discussed in Sections 4.2 and 4.3, respectively. We begin in Section 4.1 with some general observations on the topological structure of spatiotemporal data in delay-coordinate space, and the properties this structure imparts on the recovered eigenfunctions.

Bundle structure of spatiotemporal data
In order to gain insight on the behavior of VSA, it is useful to consider the triplet (Ω, B Q , π Q ), where B Q = F Q (Ω) is the image of the product space Ω under the delay-coordinate observation map, and π Q : Ω → B Q is the continuous surjective map defined as π Q (ω) =F Q (ω) for any ω ∈ Ω. Such a triplet forms a topological bundle with Ω, B Q , and π Q playing the role of the total space, base space, and projection map, respectively. In particular, π Q partitions Ω into equivalence classes called fibers, on which π Q (ω) attains a fixed value (i.e.,ω lies in [ω] Q if π Q (ω) = π Q (ω)). By virtue of (15), the kernel k Q is a continuous function, constant on the [·] Q equivalence classes, i.e., for all ω, ω ∈ Ω,ω ∈ Therefore, since for any f ∈ H Ω andω ∈ [ω] Q , the range of the integral operator K Q is a subspace of the continuous functions on Ω, containing constant functions on the [·] Q equivalence classes. Correspondingly, the eigenfunctions φ j corresponding to nonzero eigenvalues (which lie in ran K Q ) have the form φ j = η j •π Q , where η j are continuous functions in the Hilbert space L 2 (B Q , π Q * ρ) of scalar-valued functions on B Q , square-integrable with respect to the pushforward of the measure ρ under π Q . We can thus conclude that, viewed as a scalar-valued function on Ω, the VSAreconstructed signal F l from (8) lies in the closed subspace ran K Q = span{φ j : λ j > 0} of H Ω spanned by constant functions on the [·] Q equivalence classes. Note that ran K Q is not necessarily decomposable as a tensor product of H X and H Y subspaces.
Observe now that with the definition of the kernel in (19), the [·] Q equivalence classes consist of pairs of dynamical states x ∈ Ω and spatial points y ∈ Y for which the evolution of the observable F y is identical over Q delays. While one can certainly envision scenarios where these equivalence classes each contain only one point, in a number of cases of interest, including the presence of dynamical symmetries examined below, the [·] Q equivalence classes will be nontrivial, and as a result H Q will be a strict subspace of H Ω . In such cases, the patterns recovered by VSA naturally factor out data redundancies, which generally enhances both robustness and physical interpretability of the results. Besides spatiotemporal data, the bundle construction described above may be useful in other scenarios, e.g., analysis of data generated by dynamical systems with varying parameters [46].

Dynamical symmetries
An important class of spatiotemporal systems exhibiting nontrivial [·] Q equivalence classes is PDE models equivariant under the action of symmetry groups on the spatial domain [6]. As a concrete example, we consider a PDE for a scalar field in H Y , possessing a C 1 inertial manifold; i.e., a finite-dimensional, forwardinvariant submanifold of H Y containing the attractor of the system, and onto which every trajectory is exponentially attracted [4]. In this setting, the inertial manifold plays the role of the state space manifold X. Moreover, we assume that the full system state is observed, so that the observation map F reduces to the inclusion X → H Y .
Consider now a topological group G (the symmetry group) with a continuous left action Γ g Y : Y → Y , g ∈ G, on the spatial domain, preserving null sets with respect to ν. Suppose also that the dynamics is equivariant under the corresponding induced action Γ g X : , on the state space manifold. This means that the dynamical flow map and the symmetry group action commute, ) is a solution starting at Γ g X (x). Additional aspects of symmetry group actions and equivariance are outlined in Appendix B. Our goal for this section is to examine the implications of (22) to the properties of the operators K Q employed in VSA and their eigenfunctions.

Dynamical symmetries and VSA eigenfunctions
We begin by considering the induced action Γ g Ω : Ω → Ω of G on the product space Ω, defined as This group action partitions Ω into orbits, defined for every ω ∈ Ω as the subsets Γ Ω (ω) ⊆ Ω with As with the subsets [ω] Q ⊂ Ω from (20) associated with delay-coordinate maps, the G-orbits on Ω form equivalence classes, consisting of points linked together by symmetry group actions (as opposed to having common values under delay coordinate maps). In general, these two sets of equivalence classes are unrelated, but in the presence of dynamical symmetries, they are, in fact, compatible, as follows: If the equivariance property in (22) holds, then for every ω ∈ Ω, the G-orbit Γ Ω (ω) is a subset of the [ω] Q equivalence class. As a result, the following diagram commutes: x ∈ X, and y ∈ G, Therefore, since F is an inclusion, for every g ∈ G and ω = (x, y) ∈ Ω, we havẽ We therefore conclude from Proposition 1 and (21) that the kernel k Q is constant on G-orbits, and therefore the eigenfunctions φ j corresponding to nonzero eigenvalues of K Q are continuous functions with the invariance property φ j • Γ g Ω = φ j , ∀g ∈ G. This is one of the key properties of VSA, which we interpret as factoring the symmetry group from the recovered spatiotemporal patterns.

Spectral characterization
In order to be able to say more about the implication of the results in Section 4.2.1 at the level of operators, we now assume that the group action Γ g Ω preserves the measure ρ. Then, there exists a unitary representation of G on H Ω , whose representatives are unitary operators R g Ω : Another group of unitary operators acting on H Ω consists of the Koopman operators,Ũ t : H Ω → H Ω , which we define here via a trivial lift of the Koopman operators U t on H X , namelyŨ t = U t ⊗ I H Y , where I H Y is the identity operator on H Y ; see Appendix A for further details. In fact, the map t →Ũ t constitutes a unitary representation of the Abelian group of real numbers (playing the role of time), equipped with addition as the group operation, much like g → R g Ω is a unitary representation of the symmetry group G. The following theorem summarizes the relationship between the symmetry group representatives and the Koopman and kernel integral operators on H Ω . Theorem 2. For every g ∈ G and t ∈ R, the operator R g Ω commutes with K Q andŨ t . Moreover, every function in the range of K Q is invariant under R g Ω , i.e., R g Proof. The commutativity between R g Ω andŨ t is a direct consequence of (22). To verify the claims involving K Q , we use (23) and the fact that Γ g Ω preserves ρ to compute where g and g are arbitrary, and the equalities hold for ρ-a.e. ω ∈ Ω. Setting g = g −1 in the above, and acting on both sides by R g Ω , leads to R g On the other hand, setting g to the identity element of G leads to K Q = R g Ω K Q , completing the proof of the theorem. Because commuting operators have common eigenspaces, Theorem 2 establishes the existence of two sets of common eigenspaces associated with the symmetry group, namely common eigenspaces between Γ g Ω and K Q and those between R g Ω andŨ t . In general, these two families of eigenspaces are not compatible sinceŨ t and K Q many not commute, so for now we will focus on the common eigenspaces between R g Ω and K Q which are accessible via VSA with finitely many delays. In particular, because R g Ω K Q = K Q , and every eigenspace W l of K Q at nonzero corresponding eigenvalue λ l is finite-dimensional (by compactness of that operator), we can conclude that the W l are finite-dimensional subspaces onto which the action of R g Ω reduces to the identity. In other words, the eigenspaces of K Q at nonzero corresponding eigenvalues are finite-dimensional trivial representation spaces of G, and every VSA eigenfunction φ j is also an eigenfunction of R g Ω at eigenvalue 1. At this point, one might naturally ask to what extent these properties are shared in common between VSA and conventional eigendecomposition techniques based on scalar kernels on X. In particular, in the measure-preserving setting for the product measure ρ = µ×ν examined above, it must necessarily be the case that the group actions Γ g X and Γ g Y separately preserve µ and ν, respectively, thus inducing unitary operators Γ g X : H X → H X and Γ g Y : H Y → H Y defined analogously to R g Ω . For a variety of kernels k (X) Q : X × X → R that only depend on observed data through inner products and norms on H Y (e.g., the covariance and Gaussian kernels in Section 2.2), this implies that the invariance property holds for all g ∈ G and x, x ∈ X. Moreover, proceeding analogously to the proof of Theorem 2, one can show that Γ g X and the integral operator K commute, and thus have common ⊂ H Ω and that on the W l subspaces recovered by VSA, is that the former is generally not trivial, i.e., in general, R g Ω does not reduce to the identity map on W . A well-known consequence of this is that the corresponding spatiotemporal patterns ϕ j ⊗ ψ j from (1) become pure symmetry modes (e.g., Fourier modes in dynamical systems with translation invariance), hampering their physical interpretability.
This difference between VSA and conventional eigendecomposition techniques can be traced back to the fact that that on X there is no analog of Proposition 1 relating equivalence classes of points with respect to delay-coordinate maps and group orbits on that space. Indeed, Proposition 1 plays an essential role in establishing the kernel invariance property in (23), which is stronger than (24) as it allows action by two independent group elements. Equation 23 is in turn necessary to determine that K Q R g Ω = K Q in Theorem 2. In summary, these considerations highlight the importance of taking into account the bundle structure of spatiotemporal data when dealing with systems with dynamical symmetries.

Behavior of kernel integral operators in the infinite-delay limit
As discussed in Section 4.2, in general, the kernel integral operators K Q do not commute with the Koopman operatorsŨ t , and thus these families of operators do not share common eigenspaces. Nevertheless, as we establish in this section, under the conditions on kernels stated in Section 3.2, the sequence of operators K Q has an asymptotic commutativity property withŨ t as Q → ∞, allowing the kernel integral operators from VSA to approximate eigenspaces of Koopman operators.
In order to place our results in context, we begin by noting that an immediate consequence of the bundle construction described in Section 4.1 is that if the support of the measure ρ, denotedÃ ⊆ Ω, is connected as a topological space, then in the limit of no delays, Q = 1, the image ofÃ under the delay coordinate mapF 1 is a closed interval in R, and correspondingly the eigenfunctions φ j are pullbacks of orthogonal functions on that interval. In particular, becauseF 1 is equivalent to the vector-valued observation map, in the sense that F (x)(y) =F 1 ((x, y)), the eigenfunctions φ j of the Q = 1 operator corresponding to nonzero eigenvalues are continuous functions, constant on the level sets of the input signal. Therefore, in this limit, the recovered eigenfunctions will generally have comparable complexity to the input data, and thus be of limited utility for the purpose of decomposing complex signals into simpler patterns. Nevertheless, besides the strict Q = 1 limit, the φ j should remain approximately constant on the level sets of the input signal for moderately small values Q > 1, and this property should be useful in a number of applications, such as signal denoising and level set estimation (note that data-driven approximations to φ j become increasingly robust to noise with increasing Q [31]). Mathematically, in this small-Q regime VSA has some common aspects with nonlocal averaging techniques in image processing [47].
We now focus on the behavior of VSA in the infinite-delay limit, where the following is found to hold.
Theorem 3. Under the conditions on the kernels k Q stated in Section 3.2, the associated integral operators K Q converge as Q → ∞ in operator norm, and thus in spectrum, to the integral operator K ∞ associated with the kernel k ∞ . Moreover, K ∞ commutes with the Koopman operatorŨ t for all t ∈ R.
Proof. Since k Q and k ∞ all lie in H Ω ⊗ H Ω , K Q and K ∞ are Hilbert-Schmidt integral operators. As a result, the operator norm K Q − K ∞ is bounded above by k Q − k ∞ HΩ⊗HΩ , and the convergence of K Q − K ∞ to zero follows from the fact that lim Q→∞ k Q − k ∞ HΩ⊗HΩ = 0, as stated in the conditions in Section 3.2.
To verify that K ∞ andŨ t commute, we proceed analogously to the proof of Theorem 2, using the shift invariance of k ∞ in (16) and the fact thatΦ t = Φ t ⊗ I Y preserves the measure ρ to compute where the equalities hold for ρ-a.e. ω ∈ Ω. Pre-multiplying these expressions byŨ t leads to Theorem 3 generalizes the results in [30,31], where analogous commutativity properties between Koopman and kernel integral operators were established for scalar-valued observables. By virtue of the commutativity between K ∞ andŨ t , at large numbers of delays Q, VSA decomposes the signal into patterns with a coherent temporal evolution associated with intrinsic frequencies of the dynamical system. In particular, being a compact operator, K ∞ has finite-dimensional eigenspaces, W l , corresponding to nonzero eigenvalues, whereas the eigenspaces ofŨ t are infinite-dimensional, yet are spanned by eigenfunctions with a highly coherent (periodic) time evolution at the corresponding eigenfrequencies α j ∈ R, see Appendix A.1 for further details. The commutativity between K ∞ andŨ t allows us to identify finitedimensional subspaces W l of H Ω containing distinguished observables which are simultaneous eigenfunctions of K ∞ andŨ t . As shown in Appendix A.2, these eigenfunctions have the form where z jl is an eigenfunction of the Koopman operator U t on H X at eigenfrequency α jl , and ψ jl a spatial pattern in H Y . Note that here we use a two-index notation, z jl and α jl , for Koopman eigenvalues and eigenfrequencies, respectively, to indicate the fact that they are associated with the W l eigenspace of K ∞ . We therefore deduce from (25) that in the infinite-delay limit, the spatiotemporal patterns recovered by VSA have a separable, tensor-product structure similar to the conventional decomposition in (1) based on scalar kernel algorithms. It is important to note, however, that unlike (1), the spatial patterns ψ jl in (25) are not necessarily given by linear projections of the observation map onto the corresponding scalar Koopman eigenfunctions z jl ∈ H X (called Koopman modes in the Koopman operator literature [25]). In effect, taking into account the intrinsic structure of spatiotemporal data as vector-valued observables, allows VSA to recover more general spatial patterns than those associated with linear projections of observed data. Another consideration to keep in mind (which applies for many techniques utilizing delay-coordinate maps besides VSA) is that K ∞ can only recover patterns in a subspace D Ω of H Ω associated with the point spectrum of the dynamical system generating the data (i.e., the Koopman eigenfrequencies; see Appendix A.1). Dynamical systems of sufficient complexity will exhibit a non-trivial subspace D ⊥ Ω associated with the continuous spectrum, which does not admit a basis associated with Koopman eigenfunctions. One can show via analogous arguments to [30] that D ⊥ Ω is, in fact, contained in the nullspace of K ∞ , which is a potentially infinite-dimensional space not accessible from data. Of course, in practice, one always works with finitely many delays Q, which in principle allows recovery of patterns in D ⊥ Ω through eigenfunctions of K Q , and these patterns will not have an asymptotically separable behavior as Q → ∞ analogous to (25).
In light of the above considerations, we can therefore conclude that increasing Q from small values will impart changes to the topology of the base space B Q , and in particular the image of the supportÃ of ρ under π Q , but also the spectral properties of the operators K Q . On the basis of classical delay-embedding theorems [12], one would expect the topology of π Q (Ã) to eventually stabilize, in the sense that for every spatial point y ∈ Y the set A y = A × {y} ⊆Ã will map homeomorphically under π Q for Q greater than a finite number (that is, topologically, π Q (A y ) will be a "copy" of A). However, apart from special cases, K Q will continue changing all the way to the asymptotic limit Q → ∞ where Theorem 3 holds.
Before closing this section, we also note that while VSA does not directly provide estimates of Koopman eigenfrequencies, such estimates could be computed through Galerkin approximation techniques utilizing the eigenspaces of K Q at large Q as trial and test spaces, as done in [27,30,31] for scalar-valued Koopman eigenfunctions. A study of such techniques in the context of vector-valued Koopman eigenfunctions (equivalently, eigenfunctions in H Ω ) is beyond the scope of this work, though it is expected that their well-posedness and convergence properties should follow from fairly straightforward modification of the approach in [27,30,31].

Infinitely many delays with dynamical symmetries
As a final asymptotic limit of interest, we consider the limit Q → ∞ under the assumption that a symmetry group G acts on H Ω via unitary operators R g Ω , as described in Section 4.2. In that case, the commutation relations = 0 imply that there exist finite-dimensional subspaces of H Ω spanned by simultaneous eigenfunctions of R g Ω ,Ũ t , and K ∞ . We know from (25) that these eigenfunctions,z jl , are given by a tensor product between a Koopman eigenfunction z jl ∈ H X and a spatial pattern ψ jl ∈ H Y . It can further be shown (see Appendix B.2) that z jl and ψ jl are eigenfunctions of the unitary operators R g X and R g Y , i.e., R g X z jl = γ g X,jl z jl , R g Y ψ jl = γ g Y,jl ψ jl , |γ g X,jl | = |γ g Y,jl | = 1, and moreover the eigenvalues γ g X,jl and γ g Y,jl satisfy γ g X,jl γ g Y,jl = 1. In particular, we have R g Ω = R g X ⊗ R g Y , and the quantity γ g Ω,jl = γ g X,jl γ g Y,jl is equal to the eigenvalue of R g Ω corresponding toz jl , which is equal to 1 by Theorem 2.
In summary, every simultaneous eigenfunctionz jl of K ∞ ,Ũ t , and R g Ω is characterized by three eigenvalues, namely (i) a kernel eigenvalue λ l associated with K ∞ ; (ii) a Koopman eigenfrequency α jl associated with U t ; and (iii) a spatial symmetry eigenvalue γ g Y,jl (which can be thought of as a "wavenumber" on Y ).

Data-driven approximation
In this section, we consider the problem of approximating the eigenvalues and eigenfunctions of the kernel integral operators employed in VSA from a finite dataset consisting of time-ordered measurements of the vectorvalued observable F . Specifically, we assume that available to us are measurements F (x 0 ), F (x 1 ), . . . , F (x N −1 ) taken along an (unknown) orbit x n = Φ nτ (x 0 ) of the dynamics at the sampling interval τ , starting from an initial state x 0 ∈ X. We also consider that each scalar field F (x n ) ∈ H Y is sampled at a finite collection of points y 0 , y 1 , . . . , y S−1 in Y . Given such data, and without assuming knowledge of the underlying dynamical flow and/or state space geometry, our goal is to construct a family of operators, whose eigenvalues and eigenfunctions converge, in a suitable sense, to those of K Q , in an asymptotic limit of large data, N, S → ∞.
In essence, we seek to address a problem on spectral approximation of kernel integral operators from an unstructured grid of points (x n , y s ) in Ω.

Data-driven Hilbert spaces and kernel integral operators
An immediate consequence of the fact that the dynamics is unknown is that the invariant measure µ defining the Hilbert space H X = L 2 (X, µ) is also unknown (arguably, apart from special cases, µ would be difficult to explicitly determine even if Φ t were known). This means that instead of H X we only have access to an N -dimensional Hilbert space H X,N = L 2 (X, µ N ) associated with the sampling measure µ N = N −1 n=0 δ xn /N on the trajectory X N = {x 0 , . . . , x N −1 }, where δ xn is the Dirac probability measure supported at x n ∈ X. This space consists of equivalence classes of functions on X having common values on the finite set X N ⊂ X, and is equipped with the inner product f, g H X,N = N −1 n=0 f * (x n )g(x n )/N . Because every such equivalence class f is uniquely characterized by N complex numbers, f (x 0 ), . . . , f (x N −1 ), corresponding to the values of one of its representatives on X N , H X,N is isomorphic to C N , equipped with a normalized Euclidean inner product. Thus, we can represent every f ∈ H X,N by an N -dimensional column vector f = (f (x 0 ), . . . , f (x N −1 )) ∈ C N , and every linear operator A : H X,N → H X,N by an N × N matrix A such that Af is equal to the column-vector representation of Af . In particular, associated with every scalar kernel k : X × X → R is a kernel integral operator K N : H X,N → H X,N , acting on f ∈ H X,N according to the formula (cf. (4))

This operator is represented by an
In the setting of spatiotemporal data analysis, one has to also take into account the finite sampling of the spatial domain, replacing H Y = L 2 (Y, ν) by the S-dimensional Hilbert space H Y,S = L 2 (Y, ν S ) associated with a discrete measure ν S = S−1 s=0 w s,S δ ys . Here, the w s,S are positive quadrature weights such that given any continuous function f : Y → C, the quantity and represented by the (N S) × (N S) matrix K = [k Q (ω mr , ω ns )/N ]. Solving the eigenvalue problem for K Q,N S (which is equivalent to the matrix eigenvalue problem for K) leads to eigenvalues λ j,N S ∈ R and eigenfunctions φ j,N S ∈ H Ω,N S . We consider λ j,N S and φ j,N S as data-driven approximations to their eigenvalues and eigenfunctions of K Q from (12), respectively. The convergence properties of this approximation will be made precise in Section 5.2.
A similar data-driven approximation can be performed for operators based on the Markov kernels p Q from Section 3.3, which is our preferred class of kernels for VSA. However, in this case the kernels p Q,N S : Ω×Ω → R associated with the approximating operators P Q,N S on H Ω,N S are Markov-normalized with respect to the measure ρ N S , i.e., Ω p Q,N S (ω, ·) dρ N S = 1, so they acquire a dependence on N and S. Further details on this construction and its convergence properties can be found in Appendix D.

Spectral convergence
For a spectrally-consistent data-driven approximation scheme, we would like to able to establish that, as N and S increase, the sequence of eigenvalues λ j,N S of K N S converges to eigenvalue λ j of K, and for an eigenfunction φ j of K corresponding to λ j there exists a sequence of eigenfunctions φ j,N S of K N S converging to it. While convergence of eigenvalues can be unambiguously understood in terms of convergence of real numbers, in the setting of interest here a suitable notion of convergence of eigenfunctions (or, more generally, eigenspaces) is not obvious, since φ j,N S and φ j lie in fundamentally different spaces. That is, there is no natural way of mapping equivalence classes of functions with respect to ρ N S (i.e., elements of H Ω,N S ) to equivalence classes of functions with respect to ρ (i.e., elements of H Ω ), allowing one, e.g., to establish convergence of eigenfunctions in H Ω norm. This issue is further complicated by the fact, that in many cases of interest, the support A of the invariant measure µ is a non-smooth subset of X of zero Lebesgue measure (e.g., a fractal attractor), and the sampled states x n do not lie exactly on A (as that would require starting states x 0 drawn from a measure zero subset of X, which is not feasible experimentally). In fact, the issues outlined above are common to many other data-driven techniques for analysis of dynamical systems besides VSA (e.g., POD and DMD), yet to our knowledge have not received sufficient attention in the literature.
Here, following [30], we take advantage of the fact that, by the assumed continuity of VSA kernels, every kernel integral operator K Q : H Ω → H Ω from Section 3.2 can be also be viewed as an integral operator on the space C(V) of continuous functions on any compact subset V ⊂ Ω containing the support of ρ. This integral operator, denoted byK Q : C(V) → C(V), acts on continuous functions through the same integral formula as (12), although the domains of K Q andK Q are different. It is straightforward to verify that every eigenfunction φ j ∈ H Ω of K Q at nonzero eigenvalue λ j has a unique continuous representativeφ j ∈ C(V), given byφ andφ j is an eigenfunction ofK Q at the same eigenvalue λ j . Assuming further that V also contains the supports of the measures ρ N S for all N, S ≥ 1, we can defineK Q,N S : C(V) → C(V) analogously to (27). Then, every eigenfunction φ j,N S ∈ H Ω,N S of K Q,N S at nonzero corresponding eigenvalue λ j,N S has a unique continuous representativeφ j,N S , with which is an eigenfunction ofK Q,N S at the same eigenvalue λ j,N S . As is well known, the space C(V) equipped with the uniform norm f C(V) = max ω∈V |f (ω)| becomes a Banach space, and it can further be shown thatK Q,N S andK Q are compact operators on this space. In other words, C(V) can be used as a universal space to establish spectral convergence ofK Q,N S toK Q , using approximation techniques for compact operators on Banach spaces [48]. In [20], Von Luxburg et al. use this approximation framework to establish convergence results for spectral clustering techniques, and their approach can naturally be adapted to show that, under natural assumptions,K Q,N S indeed converges in spectrum toK Q .
In Appendix D, we prove the following result: Theorem 4. Suppose that V ⊆ Ω is a compact set containing the supports of ρ and the family of measures ρ N S , and assume that ρ N S converges weakly to ρ, in the sense that Then, for every nonzero eigenvalue λ j of K Q , including multiplicities, there exist positive integers N 0 , S 0 such that the eigenvalues λ j,N S of K Q,N S with N ≥ N 0 and S ≥ S 0 converge, as N, S → ∞, to λ j . Moreover, for every eigenfunction φ j ∈ H Ω of K corresponding to λ j , there exist eigenfunctions φ j,N S of K Q,N S corresponding to λ j,N S , whose continuous representativesφ j,N S from (29) converge uniformly on V toφ j from (28). Moreover, analogous results hold for the eigenvalues and eigenfunctions of the Markov operators P Q,N S and P Q .
A natural setting where the conditions stated in Theorem 4 are satisfied are dynamical systems with compact absorbing sets and associated physical measures. Specifically, for such systems we shall assume that there exists a Lebesgue measurable subset U of the state space manifold X, such that (i) U is forwardinvariant, i.e., Φ t (U) ⊆ U for all t ≥ 1; (ii) the topological closure U is a compact set containing the support of µ; (iii) U has positive Lebesgue measure in X; and (iv) for any starting state x 0 ∈ U, the corresponding sampling measures µ N converge weakly to µ, i.e., lim N →∞ X f dµ N = X f dµ for all f ∈ C(X). Invariant measures exhibiting Properties (iii) and (iv) are known as physical measures [49]; in such cases the set U is called a basin of µ. Clearly, Properties (i)-(iv) are satisfied if Φ t : X → X is flow on a compact manifold with an ergodic invariant measure supported on the whole of X, but are also satisfied in more general settings, such as certain dissipative flows on noncompact manifolds (e.g., the Lorenz 63 system on X = R 3 [50]). Assuming further that the measures ν S associated with the sampling points y 0 , . . . , y S−1 and the corresponding quadrature weights w 0,S , . . . , w S−1,S on the spatial domain Y converge weakly to ν, i.e., lim S→∞ Y g dν S = Y g dν for every g ∈ C(Y ), the conditions in Theorem 7 are met with V = U × Y , and the measures ρ N S constructed as described in Section 5.1 for any starting state x 0 ∈ U. Under these conditions, the data-driven spatiotemporal patterns φ j,N S recovered by VSA converge for an experimentally accessible set of initial states in X.
6 Application to the Kuramoto-Sivashinsky Model

Overview of the Kuramoto Sivashinsky model
The KS model, originally introduced as a model for wave propagation in a dissipative medium [41], or laminar flame propagation [42], is one of the most widely studied dissipative PDEs displaying spatiotemporal chaos. On a one-dimensional spatial domain Y = [0, L], L ≥ 0, the governing evolution equation for the real-valued scalar field u(t, ·) : Y → R, t ≥ 0, is given bẏ where ∇ and ∆ = −∇ 2 are the derivative and (positive definite) Laplace operators on Y , respectively. In what follows, we always work with periodic boundary conditions, u(t, 0) = u(t, L), ∇u(t, 0) = ∇u(t, L), . . . , for all t ≥ 0. The domain size parameter L controls the dynamical complexity of the system. At small values of this parameter, the trivial solution u = 0 is globally asymptotically stable, but as L increases, the system undergoes a sequence of bifurcations, marked by the appearance of steady spatially periodic modes (fixed points), then traveling waves (periodic orbits), and progressively more complicated solutions leading to chaotic behavior for L 4 × 2π [51][52][53][54][55].
A fundamental property of the KS system is that it possesses a global compact attractor, embedded within a finite-dimensional inertial manifold of class C r , r ≥ 1 [4,[56][57][58][59][60]. That is, there exists a C r submanifold X of the Hilbert space H Y = L 2 (Y, ν) with ν set to the Lebesgue measure, which is invariant under the dynamics, and to which the solutions u(t, ·) are exponentially attracted. This means that after the decay of initial transients, the effective degrees of freedom of the KS system, bounded above by the dimension of X , is finite. Dimension estimates of inertial manifolds [60,61] and attractors [62] of the KS system as a function of L indicate that the system exhibits extensive chaos, i.e., unbounded growth of the attractor dimension with L. As is well known, analogous results to those outlined above are not available for many other important models of complex spatiotemporal dynamics such as the Navier-Stokes equations.
For our purposes, the availability of strong theoretical results and rich spatiotemporal dynamics makes the KS model particularly well-suited to test the VSA framework. In our notation, an inertial manifold X of the KS system will act as the state space manifold X, which is embedded in this case in H Y . Moreover, the compact invariant set A will be a subset of the global attractor supporting an ergodic probability measure, µ. On X, the dynamics is described by a C r flow map Φ t : X → X, t ∈ R, as in Section 2.1. In particular, for every initial condition x 0 ∈ X, the orbit t → x(t) = Φ t (x 0 ) with t ≥ 0 is the unique solution to (31) with initial condition x 0 . While in practice the initial data will likely not lie on X, the exponential tracking property of the dynamics ensures that for any admissible initial condition u ∈ H Y there exists a trajectory x(t) on X to which the evolution starting from u converges exponentially fast.
As stated in Section 5.2, for data-driven approximation purposes, we will formally assume that the measure µ is physical. While, to our knowledge, there are no results in the literature addressing the existence of physical measures (with appropriate modifications to account for the infinite state space dimension) specifically for the KS system, recent results [63,64] on infinite-dimensional dynamical systems that include the class of dissipative systems in which the KS system belongs indicate that analogs of the assumptions made in Section 5.2 should hold.
Another important feature of the KS system is that it admits nontrivial symmetry group actions on the spatial domain Y , which have played a central role in bifurcation studies of this system [51][52][53][54]. In particular, it is a direct consequence of the structure of the governing equation (31) and the periodic boundary conditions that if u(x, t) is a solution, then so are u(x + α, t) and u(−x, t), where α ∈ R. As discussed in Section 4.2, this implies that the dynamics on the inertial manifold is equivariant under the actions induced by the orthogonal group O(2) and the reflection group on the circle. In particular, under the assumption that the O(2) action preserves µ, the theoretical spatial patterns recovered by POD and comparable eigendecomposition techniques would be linear combinations of finitely many Fourier modes [6], which are arguably non-representative of the complex spatiotemporal patterns generated by the KS system. We emphasize that the existence of symmetries does not necessarily imply that they are inherited by data-driven operators for extracting spatial and temporal patterns constructed from a single orbit of the dynamics, since, e.g., the ergodic measure sampled by that orbit may evolve singularly under the symmetry group action. While studies have determined that this type of symmetry breaking indeed occurs at certain dynamical regimes of the KS system [37], the presence of symmetries still dominates the leading spatial patterns recovered by POD and comparable eigendecomposition techniques utilizing scalar-valued kernels.

Analysis datasets
In what follows, we present applications of VSA to data generated by the KS model at the chaotic regimes L = 22 and L = 94; these "standard" regimes have been investigated extensively in the literature (e.g., [54,55]). We have integrated the model using the publicly available Matlab code accompanying Ref. [65]. This code is based on a Fourier pseudospectral discretization, and utilizes a fourth-order exponential timedifferencing Runge-Kutta integrator appropriate for stiff problems. Throughout, we use 65 Fourier modes (which is equivalent to a uniform grid on Y with S = 65 gridpoints and uniform quadrature weights, w s,S = 1/S), and a timestep of τ = 0.25 natural time units.
Each of the experiments described below starts from initial conditions given by setting the first four Fourier coefficients to 0.6 and the remaining 61 to zero. Before collecting data for analysis, we let the system equilibrate near its attractor for a time interval of 2500 natural time units. We compute spatiotemporal patterns using the eigenfunctions φ j,N S of the data-driven Markov operator P Q,N S as described in Section 5. For simplicity, for the rest of this section we will drop the N and S subscripts from our notation for φ j,N S .
In one of our L = 22 experiments, we also compare our results with spatiotemporal patterns computed via POD/PCA and NLSA (see Section 2). The PCA and NLSA methods are applied to the same KS data as VSA, and in the case of NLSA we use the same number of delays. The POD patterns are computed via (1), whereas those from NLSA are obtained via a procedure originally introduced in the context of SSA [9]. This procedure involves first reconstructing in delay-coordinate space through (1) applied to the observation mapF Q from (14), and then projecting down to physical data space by averaging over consecutive delay windows; see [9,22] for additional details. Empirically, this reconstruction approach is known to be more adept at capturing propagating signals than direct reconstruction of the observation map F via (1), though in the KS experiments discussed below the results from the two-step NLSA/SSA reconstruction and direct reconstruction are very similar.

Results and discussion
We begin by presenting VSA results obtained from dataset of N = 1000 samples taken at timestep τ = 0.25 natural time units, using a small number of delays, Q = 15. According to Section 4.3.1, at this small Q value VSA is expected to yield eigenfunctions φ j , which are approximately constant on the level sets of the input signal, and, with increasing j, capture smaller-scale variations in the directions transverse to the level sets. As is evident in Fig. 1, the leading three nonconstant eigenfunctions, φ 1 , φ 2 , and φ 3 , indeed display this behavior, featuring wavenumbers 2, 3, and 4, respectively, in the directions transverse to the level sets. This behavior continues for eigenfunctions φ j with higher j.
To assess the efficacy of these patterns in reconstructing the input signal, we compute their fractional explained variances where φ j H Ω,N S = 1 since we use normalized eigenfunctions. For the φ 1 , φ 2 , and φ 3 eigenfunctions in Fig. 1, these quantities are 0.91, 5.2 × 10 −4 , and 0.016, respectively, which demonstrates that even the one-term (l = 1) reconstruction via (8) captures most of the signal variance. Next, we consider longer datasets with N = 10,000 samples (2500 natural time units), at L = 22 and 94, analyzed using Q = 500 delays. The raw data and representative VSA eigenfunctions from these analyses, as well as NLSA results for L = 22, are displayed in Figures 2 and 3, respectively. Figure 4 highlights a portions of the raw data and VSA eigenfunctions for L = 22 over an interval spanning 1000 time units. Figure 5 shows the L = 22 raw data and the leading five (in ordered of explained variance) spatiotemporal patterns recovered from this dataset by POD. As is customary, we order the recovered POD patterns in order of decreasing explained variance of the input data. In the case of NLSA, the patterns are ordered in decreasing order of the corresponding eigenvalue of the Markov operator associated with the NLSA kernel (analogous to the P Q operator in VSA, but acting on the H X space of scalar-valued observables). Note that since the vector-valued eigenfunctions from VSA are directly interpretable as spatiotemporal patterns (see Section 3), and the VSA decomposition from (8) is given by linear combinations of eigenfunctions with scalar-valued coefficients c j , comparing VSA eigenfunctions with PCA and NLSA spatiotemporal patterns (which are formed by products of scalar-valued eigenfunctions of the corresponding kernel integral operators with spatial patterns) is meaningful, but since our depicted VSA patterns do not include multiplication by c j , these comparisons are to be made only up to scale.
At large Q, we expect the eigenfunctions from VSA to lie approximately in finite-dimensional subspaces of the Hilbert space H of vector-valued observables associated with the point spectrum of the Koopman operator, thus acquiring timescale separation. This is clearly the case in the L = 22 eigenfunctions in Figs. 2 and 4, where φ 1 is seen to capture the evolution of the wavenumber L/2 structures, whereas φ 10 and φ 15 recover smaller-scale traveling waves embedded within the large-scale structures with a general direction of propagation either to the right (φ 10 ) or left (φ 15 ). The fractional explained variances associated with eigenfunctions φ 1 , φ 10 , and φ 15 are 0.23, 0.047, and 0.017, respectively. As expected, these values are smaller than the 0.91 value due to eigenfunction φ 1 for Q = 15, but are still fairly high despite the intermittent nature of the input signal. Ranked with respect to fractional explained variance, φ 1 , φ 10 , and φ 15 are the first, fourth, and fifth among the Q = 200 VSA eigenfunctions.
In contrast, while the patterns from NLSA successfully separate the slow and fast timescales in the input signal (as expected theoretically at large Q [30,31]), they are significantly less efficient in capturing its salient spatial features. Consider, for example, the leading two NLSA patterns shown in Fig. 2(e,f). These patterns are clearly associated with the O(2) family of wavenumber L/2 structures in the raw data, but because they have a low rank, they are unable to represent the intermittent spatial translations of these patterns produced by chaotic dynamics in this regime. Their fractional explained variances are 0.13 and 0.15, respectively. Qualitatively, it appears that the NLSA patterns in Fig. 2(e, f) isolate periods during which   Fig. 2(a-d), but for the time interval [1625,2625], highlighting the features of the small-scale waves in (c, d).  Fig. 2, but for spatiotemporal patterns recovered by POD/PCA. the wavenumber L/2 structures are quasistationary, and translated relative to each other by L/4. In other words, it appears that NLSA captures the unstable equilibria that the system visits in the analysis time period through individual patterns, but does not adequately represent the transitory behavior associated with heteroclinic orbits connecting this family of equilibria. Moreover, due to the presence of the continuous O(2) symmetry, a complete description of the spatiotemporal signal associated with the wavenumber L/2 structures would require several modes. In contrast, VSA effectively captures this dynamics through a small set of leading eigenfunctions.
As can be seen in Fig. 5, POD would also require several modes to capture the wavenumber L/2 unstable equilibria, but in this case the recovered patterns also exhibit an appreciable amount of mixing of the slow timescale characteristic of this family with faster timescales. Modulo this high-frequency mixing, the first (second) POD pattern appears to resemble the first (second) NLSA pattern. The fractional explained variance of the leading two POD patterns, amounting to 0.23 and 0.22, respectively, is higher than the corresponding variances from NLSA, but this is not too surprising given their additional frequency content.
To summarize, these results demonstrate that NLSA improves upon PCA in that it achieves timescale separation through the use of delay-coordinate maps, and VSA further improves upon NLSA in that it quotients out the O(2) symmetry of the system, allowing efficient representation of intermittent space-time signals associated with heteroclinic dynamics in the presence of this symmetry. In separate calculations, we have verified that the L = 22 VSA patterns are robust under corruption of the data by i.i.d. Gaussian noise of variance up to 40% of the raw signal variance.
Turning now to the L = 94 experiments, it is evident from Fig. 3(b) that the dynamical complexity in this regime is markedly higher than for L = 22, as multiple traveling and quasistationary waves can now be accommodated in the domain, resulting in a spatiotemporal signal with high intermittency in both space and time. Despite this complexity, the recovered eigenfunctions ( Fig. 3(b-f)) decompose the signal into a pattern φ 1 that captures the evolution of unstable fixed points and the heteroclinic connections between them, and other patterns, φ 3 , φ 5 , φ 8 , and φ 11 , dominated by traveling waves. The fractional explained variances associated with these patterns are 6.0 × 10 −3 (φ 1 ), 0.020 (φ 3 ), 0.040 (φ 5 ), 0.067 (φ 8 ), and 0.066 (φ 11 ); that is, in this regime the traveling wave patterns are dominant in terms of explained variance. In general, the variance explained by individual eigenfunctions at L = 94 is smaller than those identified for L = 22, consistent with the higher dynamical complexity of the former regime. It is worthwhile noting that L = 94 eigenfunction φ 1 bears some qualitative similarities with the covariant Lyapunov vector (CLV) patterns identified at a nearby (L = 96) KS regime in [55] (see Fig. 2 of that reference). Other VSA patterns also display qualitatively similar features to φ 1 and to CLVs. While such similarities are intriguing, they should be interpreted with caution as the existence of connections between VSA and CLV techniques is an open question.

Conclusions
We have presented a method for extracting spatiotemporal patterns from complex dynamical systems, which combines aspects of the theory of operator-valued kernels for machine learning with delay-coordinate maps of dynamical systems. A key element of this approach is that it operates directly on spaces of vector-valued observables appropriate for dynamical systems generating spatially extended patterns. This allows the extraction of spatiotemporal patterns through eigenfunctions of kernel integral operators with far more general structure than those captured by pairs of temporal and spatial modes in conventional eigendecomposition techniques utilizing scalar-valued kernels. In particular, our approach enables efficient and physically meaningful decomposition of signals with intermittency in both space and time, while naturally factoring out dynamical symmetries present in the data. By incorporating delay-coordinate maps, the recovered patterns lie, in the asymptotic limit of infinitely many delays, in finite-dimensional invariant subspaces of observables associated with the point spectrum of the Koopman operator of the system. This endows these patterns with high dynamical significance and the ability to decompose multiscale signals into distinct coherent modes. We demonstrated with applications to the KS model in chaotic regimes that VSA recovers intermittent patterns, such as heteroclinic orbits associated with translation families of unstable fixed points and traveling waves, with significantly higher skill than comparable eigendecomposition techniques operating on spaces of scalar-valued observables. We anticipate this framework to be applicable across a broad range of disciplines dealing with complex spatiotemporal data.

A.1 Basic properties of Koopman operators and their eigenfunctions
In this appendix, we outline some of the basic properties of the Koopman operator U t acting on scalarvalued observables in H X and its liftŨ t acting on scalar-valued observables in H Ω (and, by the isomorphism H H Ω , on vector-valued observables in H). Additional details on these topics can be found in one of the many references in the literature on ergodic theory, e.g., [33,34,66].
We begin by noting that for the class of C 1 measure-preserving dynamical systems on manifolds studied here (see Section 2.1), the group U = {U t } t∈R of Koopman operators is a strongly continuous unitary group. This means that for every f ∈ H X , the map t → U t f is continuous with respect to the H X norm at every t ∈ R. By Stone's theorem, strong continuity of U implies that there exists an unbounded, skew-adjoint operator V : D(V ) → H X with dense domain D(V ) ⊂ H X , called the generator of U , such that U t = e tV . This operator completely characterizes Koopman group. Its action on an observable f ∈ D(V ) is given by where the limit is taken with respect to the H X norm. If f is a differentiable function in C 1 (X), then where v is the vector field of the dynamics. A distinguished class of observables in H X are the eigenfunctions of the generator of the Koopman group. Every such eigenfunction, z j , satisfies the equation where α j is a real frequency, intrinsic to the dynamical system. In the presence of ergodicity (assumed here), all eigenvalues of V are simple, and eigenfunctions corresponding to distinct eigenvalues are orthogonal. Moreover, the eigenfunctions can be normalized so that |z j (x)| = 1 for µ-almost every x ∈ X. That is, Koopman eigenfunctions of ergodic dynamical systems can be normalized to take values on the unit circle in the complex plane, much like the functions e iωt in Fourier analysis.
Every eigenfunction z j of V at eigenvalue iα j is also an eigenfunction of U t , corresponding to the eigenvalue Λ t j = e iαj t . This means that along an orbit of the dynamical system, z j evolves purely by multiplication by a periodic phase factor, viz.
where the equality holds for µ-almost every x ∈ X. This property makes Koopman eigenfunctions highly predictable observables, which warrant identification from data. In general, the evolution of any observable f lying in the closed subspace D X = span{z j } of H X spanned by Koopman eigenfunctions has the closed-form expansion This shows that the evolution of observables in D X can be characterized as a countable sum of Koopman eigenfunctions with time-periodic phase factors. Koopman eigenvalues and eigenfunctions of ergodic systems also have an important group property; namely, if z j and z k are eigenfunctions of V at eigenvalue iα j and iα k , respectively, then the product z j z k is also an eigenfunction, corresponding to the eigenvalue i(α j + α k ). Thus, the eigenvalues and eigenfunctions of the Koopman generator form groups, with addition of complex numbers and multiplication of complex-valued functions acting as the group operations, respectively. If, in addition, the eigenfunctions are continuous (which is assumed here), these groups are finitely generated. Specifically, in that case there exists a finite collection of rationally-independent eigenfrequenciesα 1 , . . . ,α l , such that every eigenfrequency has the form α j = l k=1 j kαk , where j = (j 1 , . . . , j l ) is a vector of integers. Moreover, the Koopman eigenfunction corresponding to eigenfrequency α j is given by z j =z j1 1 · · ·z j l l , wherez 1 , . . . ,z l are the eigenfunctions corresponding toα 1 , . . . ,α l , respectively. It follows from these facts in conjunction with (33) that the evolution of every observable in D X can be uniquely determined given knowledge of finitely many Koopman eigenfunctions and their corresponding eigenfrequencies.
Yet, despite these attractive properties, in typical systems, not every observable will admit a Koopman eigenfunction expansion as in (33); that is, D X will generally be a strict subspace of H X . In such cases, we have the orthogonal decomposition which is invariant under the action of U t for all t ∈ R. For observables in the orthogonal complement D ⊥ X of D X dynamical evolution is not determined by (33), but rather by a spectral expansion involving a continuous spectral density (intuitively, an uncountable set of frequencies). This evolution can exhibit the characteristic behaviors associated with chaotic dynamics, such as decay of temporal correlations. In particular, it can be shown that for any f ∈ D ⊥ X and g ∈ H X , the quantity t 0 | g, U s f H X | ds/t vanishes as |t| → ∞. We now turn to the unitary groupŨ = {Ũ t } t∈R associated with the Koopman operatorsŨ t on H Ω . As stated in Section 4.2.2, these operators are obtained by a trivial liftŨ t = U t ⊗ I H Y of the Koopman operators on H X ; equivalently, we haveŨ t f = U tΦt , whereΦ t = Φ t ⊗ I Y , and I Y is the identity map on Y . The groupŨ is generated by the densely-defined, skew-adjoint operatorṼ : D(Ṽ ) → H Ω , which is an extension of V ⊗ I H Y . Moreover, analogously to the decomposition in (34), there exists an orthogonal decomposition which is invariant underŨ t for all t ∈ R. It is straightforward to verify that the eigenvalues ofŨ t are identical to those of U t , i.e., Λ t = e αt for some eigenfrequency α ∈ R, and every eigenfunctionz at eigenvalue Λ t has the formz = z ⊗ f, where z ∈ H X is an eigenfunction of U t at the same eigenvalue (unique up to normalization by ergodicity), and f an arbitrary spatial pattern in H Y .

A.2 Common eigenfunctions with kernel integral operators
We now examine the properties of common eigenfunctions between the Koopman operatorsŨ t on H Ω and the kernel integral operators K ∞ from Theorem 3 with which they commute. In particular, letz ∈ H Ω be an eigenfunction ofŨ t at eigenvalue Λ t . Then, which implies that K ∞ z is also an eigenfunction ofŨ t at the same eigenvalue. As stated in Appendix A.1, the eigenvalues ofŨ t are identical to those of the Koopman operator U t on H A . However, unlike those of U t , the eigenvalues ofŨ t are not simple, and we cannot conclude that K ∞z = λz for some number λ; i.e., it is not necessarily the case thatz is also an eigenfunction of K ∞ (despite the fact that (37) implies that every eigenspace ofŨ t is invariant under K ∞ ). In fact, the eigenspaces ofŨ t are infinite-dimensional, and there is no a priori distinguished set of spatiotemporal patterns in each eigenspace.
To identify a distinguished set of spatiotemporal patterns associated with Koopman eigenfunctions, we take advantage of the fact that K ∞ is a compact operator with finite-dimensional eigenspaces corresponding to nonzero eigenvalues. For each such eigenspace, there exists an orthonormal basis consisting of simultaneous eigenfunctions of K ∞ andŨ t . To verify this explicitly, let W l ⊂ H Ω be the eigenspace of K ∞ corresponding to eigenvalue λ l = 0, and f an arbitrary element of W l . Since we can conclude thatŨ t f ∈ W l ; i.e., that W l is a finite-dimensional invariant subspace of H Ω underŨ t . Choosing an orthonormal basis {φ 1l , . . . , φ m l l } for this space, where m l = dim W l , we can expand f = m l j=1 c j φ jl with c j = φ jl , f HΩ , and computẽ By unitarity ofŨ t , the m l × m l matrix U with elementsŨ ij is unitary, and therefore unitarily diagonalizable. Let then {v j } m l j=1 with v j = (v 1j , . . . , v m l j ) be an orthonormal basis of C m l consisting of eigenvectors of U, and Λ t 1l , . . . , Λ t m l l be the corresponding eigenvalues. It is a direct consequence of (38) that the set {z 1l , . . . ,z m l l } withz jl = m l k=1 v kj φ kl is an orthonormal basis of W l consisting of Koopman eigenfunctions corresponding to the eigenvalues Λ t jl , which much thus be given by Λ jl = e iα jl t for some Koopman eigenfrequency α jl ∈ R. Since every element of W l is an eigenfunction of K ∞ , we conclude that thez jl are simultaneous eigenfunctions ofŨ t and K ∞ .

B.1 Basic definitions
Let G be a topological group with a left action on the spatial domain Y . By that, we mean that there exists a map Γ Y : G × Y → Y with the following properties:

C Construction and properties of VSA kernels
In this appendix, we describe in detail some aspects of the VSA kernel construction, namely the choice of distance scaling function for the kernels k Q in (19) (Appendix C.1), and the normalization procedure to obtain the Markov kernels p Q (Appendix C.2). We also establish some results on the behavior of these kernels in the limit of infinite delays, Q → ∞. Throughout, M = A × Y ⊆ Ω will denote the compact support of the measure ρ.

C.1 Choice of scaling function
The distance scaling function s Q is based on the corresponding function introduced in [31], which has dependencies on both the local sampling density and time tendency of the data.

C.1.1 Local density function
Letk Q : Ω × Ω → R denote the unscaled Gaussian kernel from (18), andK Q : H Ω → H Ω the corresponding integral operator. Following [44], we define the function which is continuous, strictly positive, and bounded away from zero on compact sets. It is a standard result from the theory of kernel density estimation that if the support of the measure ρ is a smooth, m-dimensional Riemannian manifold, denoted M , and the delay-coordinate observation mapF Q : Ω → R Q from (14) is an embedding of M into R Q , then, as → 0, the quantityσ Q (ω) = σ Q (ω)/(2π m/2 ) converges for every ω ∈ M to the density dρ dvol (ω) of the measure ρ with respect to the Riemannian measure vol on M . Moreover,σ Q has physical dimension (units) of length −m , and as a resultσ −1/m Q assigns a characteristic length at each point in M . Here, we do not assume that M has the structure of a smooth manifold, so we will not be taking → 0 limits. However, due to the exponential decay ofk Q (ω, ω ) with respect to distance in delay-coordinate space R Q , we can still interpret σ Q from (41) as a local density-like function.

C.1.2 Phase space velocity
Throughout this section, we will assume that the pointwise observation map F y : X → R is of class C 2 for every y ∈ Y , which is equivalent to assuming that the vector-valued observation map F lies in the space C 2 (X; C(Y )). This assumption is natural for a wide class of observation maps encountered in applications. In particular, it implies that the "energy" of the signal, expressed in terms of the Koopman generatorṼ from Appendix A as Ω |Ṽ ( F )| 2 dρ is finite. Under this condition, the phase space speed function ξ Q : Ω → C, defined as where is continuously differentiable with respect to x. Note that ξ Q (x, y) may vanish, e.g., if y lies in the boundary of Y and F obeys time-independent boundary conditions. In the special case Q = 1, ξ Q will also vanish at local maxima/minima of the signal with respect to time. Phase space speed functions analogous to ξ Q were previously employed in NLSA [22] and related kernel [67] and Koopman operator techniques [31]. For reasons that will be made clear below, we will adopt the approach introduced in [31, Section 6], which utilizes ξ Q in such a way so that if ξ Q (ω) is zero, then s Q (ω) vanishes too.

C.1.3 Scaling function
With the density and phase space speed functions from Appendices C.1.1 and C.1.2, respectively, we define the continuous scaling function where γ is a positive parameter. This definition is motivated by [31], where it was shown that an analogous scaling function employed in scalar-valued kernels for ergodic dynamical systems on compact Riemannian manifolds can be interpreted, for an appropriate choice of γ and in a suitable limit of vanishing kernel bandwidth parameter , as a conformal change of Riemannian metric that depends on the vector field of the dynamics. More specifically, a Markov operator analogous to P Q from Section 3.3 was shown to approximate the heat semigroup generated by a Laplace-Beltrami operator associated with this conformally transformed metric. In [31], this change of geometry was associated with a rescaling of the vector field of the dynamics (i.e., a time change of the dynamical system [68]) that was found to significantly improve the conditioning of kernel algorithms if the system has fixed points. In particular, for a dataset consisting of finitely many samples, the sampling of the state space manifold near a fixed point will become highly anisotropic, as most of the near neighbors of datapoints close to the fixed point will lie along the sampled orbit of the dynamics (which is a one-dimensional set), and the directions transverse to the orbit will be comparatively undersampled. The latter is because the phase speed of the system becomes arbitrarily small near a fixed point, meaning that most geometrical nearest neighbors of a data point in its vicinity will lie on a single orbit. Choosing the scaling function (analogous to s Q ) such that it vanishes at the fixed point is tantamount to increasing the bandwidth 1/s Q of the kernel by arbitrarily large amounts, thus improving sampling in directions transverse to the orbit. While the arguments above are strictly valid only in the smooth manifold case (as they rely on → 0 limits), s Q in (43) should behave similarly in regions of the product space Ω where the rate-of-change of the observed data (measured by ξ Q in (42)) is small. As stated in Section 3.1, ξ Q can vanish or be small not only at fixed points of the dynamics on X, but also at points y ∈ Y where the observable F y is constant or nearly constant (e.g., near domain boundaries).
What remains for a complete specification of s Q is to set the exponent parameter γ. According to [31], if M is an m-dimensional manifold embedded in R Q , the Riemannian metric associated with P Q for the choice γ = 1/m has compatible volume form with the invariant measure of the dynamics, in the sense that the corresponding density dρ dvol is a constant. Moreover, the induced metric is also invariant under a class of conformal changes of observation map F Q . In practice, M will not be a smooth manifold, but we can still assign to it an effective dimension by examining the dependence of the kernel integrals κ = Ω×Ωk Q dρ × dρ (or the corresponding data-driven quantity κ N S = Ω×Ωk Q dρ N S × dρ N S , where the measures ρ N S are defined in Section 5) as a function of the bandwidth parameter . As shown in [44,69], d log κ/d log can be interpreted as an effective dimension for at the scale associated by the bandwidth parameter . This motivates an automatic bandwidth tuning procedure where is chosen as the maximizer of that function, and the corresponding maximum valuem provides an estimate of M 's dimension.
Here, we nominally set γ = 1/m withm determined via the method just described. The results presented in Section 6 are not too sensitive with respect to changes of γ around that value. In fact, for the systems studied here, the results remain qualitatively robust even if the velocity-dependent terms are not included in s Q and s Q,N . That is, qualitatively similar results can also be obtained using the scaling function s Q = σ Q , which is continuous even if F is not continuously differentiable.

C.2 Markov normalization
Following the approach taken in NLSA algorithms and in [23,27,30,31], we will construct a Markov kernel p Q : Ω × Ω → R from a strictly positive, symmetric kernel k Q : Ω × Ω → R, meeting the conditions in Section 3.2, by applying the normalization procedure introduced in the diffusion maps algorithm [18] and further developed in the context of general exponentially decaying kernels in [21]. For that, we first compute the functions where 1 Ω is the function on Ω equal to 1 at every point. By the properties of k Q and compactness of the support of ρ, both r Q and l Q are continuous, positive functions on Ω, bounded away from zero on compact sets. We then define the kernel p Q by and the Markov property follows by construction. In [21], the division of k Q (ω, ω ) by l Q (ω) and r Q (ω ) to form p Q (ω, ω ) is referred to as left and right normalization, respectively. Because r Q and l Q are positive and bounded away from zero on compact sets, p Q is continuous. In general, a kernel of the class in (44), is not symmetric, and as a result the corresponding integral operator P Q : H Ω → H Ω is not self-adjoint. Nevertheless, by symmetry of k Q , P Q is related to a self-adjoint compact operatorP Q : H Ω → H Ω by a similarity transformation. In particular, let f be a bounded function in L ∞ (Ω, ρ), and T f : H Ω → H Ω the corresponding multiplication operator by f . That is, for g ∈ H Ω , T f g is the function equal to f (ω)g(ω) for ρ-a.e. ω ∈ Ω. DefiningP Q : H Ω → H Ω as the self-adjoint kernel integral operator associated with the symmetric kernel one can verify thatP Q can be obtained from P Q through the similarity transformation Due to (45), P Q andP Q have the same eigenvalues λ j , which are real by self-adjointness ofP Q , and thus admit the ordering 1 = λ 0 > λ 1 ≥ λ 2 ≥ · · · since P Q is ergodic and Markov. Moreover, by compactness of P Q andP Q , the eigenvalues have a single accumulation point at zero, and the eigenspaces corresponding to the nonzero eigenvalues are finite-dimensional. SinceP Q is self-adjoint, there exists an orthonormal basis {φ j } ∞ j=0 of H Ω consisting of real eigenfunctionsφ j ofP Q corresponding to λ j , which are continuous by the assumed continuity of kernels. Moreover, due to (45), for every elementφ j of this basis, the continuous functions φ j = Tl Qφ j =l Qφj and φ j = T −1 l Qφ j =φ j /l Q are eigenfunctions of P Q and P * Q , respectively, corresponding to the same eigenvalue λ j . The sets {φ j } ∞ j=0 and {φ j } ∞ j=0 are then (non-orthogonal) bases of H Ω satisfying the bi-orthogonality relation φ i , φ j HΩ = δ ij . In particular, every f ∈ H Ω can be uniquely expanded as f = ∞ j=0 c j φ j with c j = φ j , f HΩ , and we have P Q f = ∞ j=0 λ j c j φ j .

C.3 Behavior in the infinite-delay limit
In this section we establish that the covariance and Gaussian kernels k Q in (17)- (19), as well as the Markov kernels in Section 3.3, converge to well-defined, shift-invariant limits in the infinite-delay (Q → ∞) limit, in accordance with the conditions for VSA kernels listed in Section 3.2. Since all of these kernels are based on distances between datapoints in delay-coordinate space under the mapsF Q from (14), we begin by considering the family of distance-like functions d Q : Ω × Ω → R, with Q ∈ N and This family of functions has the following important property: Proposition 5. Suppose that the sampling interval τ is such that there exists no eigenfrequency α j of the generator V such that e iαj τ = 1. Then, the sequence d 1 , d 2 , . . . converges in H Ω ×H Ω norm to a τ -independent Proof. Let ω = (x, y) and ω = (x , y ) with x, x ∈ X and y, y ∈ Y be arbitrary points in Ω. The H Ω ⊗ H Ω convergence of d Q to d ∞ follows from the Von Neumann mean ergodic theorem and the fact that is a Birkhoff average under the product dynamical systemΦ qτ ⊗Φ qτ on Ω×Ω of the function d 2 1 : Ω×Ω → R, which is bounded on the compact support of the corresponding invariant measure ρ × ρ.
Next, let {Ũ t ⊗U t } t∈R be the strongly continuous unitary group induced byΦ t ×Φ t . Denote the generator of this group byV . To establish τ -independence and invariance of d ∞ underŨ t ⊗Ũ t , it suffices to show that d ∞ lies in the nullspace ofV . Indeed, by invariance of infinite Birkhoff averages, we haveŨ τ ⊗Ũ τ d ∞ = d ∞ , i.e., d ∞ is an eigenfunction ofŨ τ ⊗Ũ τ at eigenvalue 1, and by the condition on τ stated in the proposition and the fact that U t ,Ũ t , andŨ t ⊗Ũ t have the same eigenvalues (see Appendix A), this implies that d ∞ is an eigenfunction ofV at eigenvalue zero.
Note that the condition on τ in Proposition 5 holds for Lebesgue almost every τ ∈ R, as the set of Koopman eigenfrequencies α j is countable.
An immediate consequence of Proposition 5 is that given any continuous shape function h : R → R, the kernel k Q : Ω × Ω → R with k Q (ω, ω ) = h(d Q (ω, ω )) satisfies the conditions listed in Section 3.2. In particular, setting h to a Gaussian shape function, h(s) = e −s 2 / , with > 0, shows that the Gaussian kernel in (18) has the desired properties. That the covariance kernel in (17) also has these properties follows from an analogous result to Proposition 5 applied directly to the kernel k Q , which in this case is equal to a Birkhoff average of k 1 .
Next, we turn to the family of kernels in (19) utilizing scaled distances. These kernels have the general form where h : R → R is continuous, so by Proposition 5, the required properties will follow if it can be shown that the sequence of scaling functions s 1 , s 2 , . . . converges in H Ω norm to a bounded function s ∞ ∈ L ∞ (Ω, ρ). That this is indeed the case for the choice of scaling functions described in Appendix C.1 follows from the facts that (i) the local density function σ Q in (41) is itself derived from an unscaled Gaussian kernelk Q , which was previously shown to meet the required conditions, and (ii) the phase space velocity function ξ Q from (42) is equal to a Birkhoff average of a continuous function. Finally, the class of Markov kernels from Section 3.3 meets the necessary conditions because the diffusion maps normalization function r Q is a continuous function determined by action of K Q on a constant function, thus converging as Q → ∞ to aŨ t -invariant function by the previous results on k Q , and similarly l Q is determined by action of K Q on 1/r Q (see Appendix C.2).

D Data-driven approximation
In this appendix, we present a proof of Theorem 4 for the most general class of integral operators on H Ω studied in this paper, namely the Markov operators P Q associated with the kernels k Q : Ω×Ω → R from (19) utilizing the distance scaling functions s Q in Appendix C.1, followed by Markov normalization as described in Section 3.3 and Appendix C.2. The convergence results for the operators not employing distance scaling and/or Markov normalization follow by straightforward modification of the arguments below.

D.1 Data-driven Markov kernels
Because the distance scaling functions s Q involve integrals with respect to ρ and time derivatives, in a data-driven setting we must work with kernels k Q,N S : Ω × Ω → R + approximating k Q , where and s Q,N S ∈ C(Ω) are scaling functions approximating s Q . Our construction of s Q,N S follows closely that of s Q in (43); that is, we set where γ is the same exponent parameter as in (43), and σ Q,N S and ξ Q,N are continuous functions approximating σ Q and ξ Q , respectively. To construct σ Q,N S , we introduce the integral operatorK Q,N S : and define Moreover, following [22,31,67], in the data-driven setting we approximate the function ζ used in the definition of ξ Q in (42) by a continuous function ζ τ : Ω → R that provides a finite-difference approximation of ζ with respect to the sampling interval τ . As a concrete example, we consider a backward difference scheme, which, under the assumed differentiability properties of F y (see Appendix C.1.2), converges as τ → 0 to ζ, uniformly on compact sets (in particular, V). We will also consider that the sampling interval is specified as a function τ (N ) such that τ (N ) → 0 and N τ (N ) → ∞ as N → ∞. With this assumption, the limit N → ∞ corresponds to infinitely short sampling interval (required for convergence of finite different schemes) and infinitely long total sampling time (required for convergence of ergodic averages). We define ξ Q,N as the function resulting by substituting ζ by ζ τ (N ) in (42). As we will establish in Appendix D.5, with these definitions, and under weak convergence of the measures ρ N S in (30), s Q,N S converges to s Q uniformly on the compact set V. Next, we construct the Markov kernels p Q,N S : Ω × Ω → R of the operators P Q,N S : H Ω,N S → H Ω,N S in the statement of the theorem by applying diffusion maps normalization to k Q,N S as in Appendix C.2, viz.
where K Q,N S is the kernel integral operator on H Ω,N S associated with k Q,N S . Note that r Q,N S and l Q,N S are positive, continuous functions on Ω, bounded away from zero on compact sets.

D.2 Proof of Theorem 4
We will need the important notion of compact convergence of operators on Banach spaces [20,48].
Definition 6. A sequence of bounded operators T n : E → E on a Banach space E is said to converge compactly to a bounded operator T : E → E if T n converges to T pointwise (i.e., T n f → T f for all f ∈ E), and for every bounded sequence of vectors f n ∈ E, the sequence g n = (T n − T )f n has compact closure (equivalently, g n has a convergent subsequence).
Compact convergence is stronger than pointwise convergence, but weaker than convergence in operator norm. For our purposes, it is useful as it is sufficient to imply convergence of isolated eigenvalues of bounded operators [48], and hence convergence of nonzero eigenvalues of compact operators and their corresponding eigenspaces. In particular, Theorem 4 is a corollary of the following theorem, proved in Appendix D.3: Under the assumptions of Theorem 4, the following hold: (a)P Q,N S andP Q are both compact operators on C(V). As a result, their nonzero eigenvalues have finite multiplicities, and accumulate only at zero.
(b) As N, S → ∞,P Q,N S converges compactly toP Q .
(c) λ j is a nonzero eigenvalue ofP Q if and only if it is a nonzero eigenvalue of P Q . Moreover, if φ j is an eigenfunction of P Q corresponding to that eigenvalue, thenφ j ∈ C(V) with is an eigenfunction ofP Q corresponding to the same eigenvalue. Analogous results hold for every nonzero eigenvalue of λ j,N S , ofP Q,N S , and corresponding eigenfunctions φ j,N S andφ j,N S of P Q,N S andP Q,N S , respectively, whereφ j,N S (ω) = Ω p Q,N S (ω, ω ) dρ N S (ω ). (49) To verify that Theorem 7 indeed implies Theorem 4 (with K Q replaced by P Q and K Q,N S by P Q,N S ), note that since the eigenvalue λ j in the statement of Theorem 4 is nonzero, it follows from Theorem 7(c) that it is an eigenvalue ofP Q , and thatφ j from (48) is a corresponding eigenfunction. Moreover, since, by Theorem 7(b),P Q,N S converges toP Q compactly (and thus in spectrum for nonzero eigenvalues [48]), there exist N 0 , S 0 ∈ N such that the j-th eigenvalues λ j,N S ofP Q,N S are all nonzero for N ≥ N 0 and S ≥ S 0 , and thus, by Theorem 7(c), they are eigenvalues of P Q,N S converging to λ j , as claimed in Theorem 4. The existence of eigenfunctions φ j,N S of P Q,N S corresponding to λ j,N S , such thatφ j,N S from (49) converges uniformly to φ j is shown in [30]. This completes our proof of Theorem 4.

D.3 Proof of Theorem 7
Our proof of Theorem 7 draws heavily on the spectral convergence results on data-driven kernel integral operators established in [20,30], though it requires certain modifications appropriate for the class of kernels in (19) utilizing scaled distances, which, to our knowledge, have not been previously discussed. In what follows, we provide explicit proofs of Claims (a) and (b) of the theorem; Claim (c) is a direct consequence of the definition ofφ j,N S in (49). Throughout this section, all operators will act on the Banach space of continuous functions on V equipped with the uniform norm, · C(V) . Therefore, for notational simplicity, we will drop the tildes from our notation forP Q andP Q,N S . We will also drop S subscripts representing the number of sampled points in the spatial domain Y ,with the understanding that N → ∞ limits correspond to S → ∞ followed by N → ∞ limits.
We now return to the proof of Claim (a). First, that P Q,N is compact follows immediately from the fact that it has finite rank. Showing that P Q is compact is equivalent to showing that for any bounded sequence f n ∈ C(V), the sequence g n = P Q f n has a limit point in the uniform norm topology. Since V is compact, it suffices to show that g n is equicontinuous and bounded; in that case, the existence of a limit point of g n is a consequence of the Arzelà-Ascoli theorem. Indeed, for any ω ∈ V, we have where B = sup n f n C(V) . This shows that g n is uniformly bounded. Similarly, we have |g n (ω) − g n (ω )| ≤ p Q (ω, ·) − p Q (ω , ·) C(V) f n C(V) , and the equicontinuity of {g n } follows from Lemma 8. It therefore follows from the Arzelà-Ascoli theorem that g n has a limit point, and thus that P Q is compact, as claimed.

D.5 Proof of Claim (b)
According to Definition 6, we must first show that for every f ∈ C(V), P Q,N f converges to P Q f in the uniform norm; that is, we must show that lim N →∞ η N = 0, where Defining the Markov kernelp Q,N : Ω × Ω → R + , and the operatorsP Q,N : C(V) → C(V) with That is, we can bound η N by the sum of contributions due to (i) errors in approximating integrals with respect to the invariant measure ρ by the sampling measure ρ N (the first term in the right-hand side); (ii) errors in approximating the left and right normalization functions l Q and r Q by their data-driven counterparts, l Q,N and r Q,N , respectively (the second term in the right-hand side); and (iii) errors in approximating the kernel k Q by the data-driven kernel k Q,N (the third term in the right-hand side). We first consider the first term, By the weak convergence of the measures ρ N to ρ (see (30)) in conjunction with the continuity of p Q , it follows thatP Q,N f (ω) converges to P N f (ω), pointwise with respect to ω ∈ V; however, it is not necessarily the case that the convergence is uniform. For the latter, we need the stronger notion of a Glivenko-Cantelli class.
Definition 9. Let E : C(V) → C and E N : C(V) → C, be the expectation operators with respect to the measures ρ N and ρ, respectively, i.e., Then, a set of functions F ∈ C(V) is said to be a Glivenko-Cantelli class if Note, in particular, that if the set can be shown to be a Glivenko-Cantelli class, then it will follow that P Q,N f −P Q f C(V) vanishes as N → ∞. That this is indeed the case follows from Proposition 11 in [20]. Next, we turn to the second and third terms in (51). To bound these terms, we first establish convergence of the data-driven distance scaling functions s Q,N to s Q . Lemma 10. Restricted to V, the scaling functions s Q,N from Appendix C.1 converge uniformly as N → ∞ to s Q .
Thus, since ξ Q,N converges uniformly to ξ Q by continuous differentiability of the observation map F on the compact set V (see Appendix C.1.2), s Q,N will converge uniformly to s Q if σ Q,N converges uniformly to σ Q . Indeed, because |σ Q,N (ω) − σ Q (ω)| = |Ek Q (ω, ·) − E NkQ (ω, ·)|, this will be the case if the set F 2 = {k Q (ω, ·) | ω ∈ V} is a Glivenko-Cantelli class. This follows from similar arguments as those used to establish that F 1 is Glivenko-Cantelli.
Lemma 10, in conjunction with the continuity of the kernel shape function used throughout this work (see Section 3.1) implies in the following: Corollary 11. The data-driven kernel k Q,N converges uniformly to k Q ; that is, lim N →∞ k Q,N − k Q,N C(V×V) = 0.
We now proceed to bound the second term in (51), P Q,N f −P Q,N f C(V) . It follows from the definition of the kernels p Q,N andp Q,N via (47) and (50), respectively, that .
By our assumptions on kernels stated in Section 3.2, the functions l Q , r Q , l Q,N , and r Q,N are bounded away from zero on V, uniformly with respect to N . Therefore, there exists a constant c > 0, independent of N , such that P Q,N f −P Q,N f C(V) ≤ c k Q C(V×V) f C(V) l Q ⊗ r Q − l Q,N ⊗ r Q,N C(V×V) = c k Q C(V×V) f C(V) l Q − l Q,N C(V) r Q − r Q,N C(V) .
Observe now that if it can be shown that F 3 = {k Q (ω, ·) | ω ∈ V} is a Glivenko-Cantelli class. The latter can be verified by means of similar arguments as those used to establish that F 1 is Glivenko-Cantelli. Equation (52), in conjunction with the fact that l Q − l Q,N C(V) is bounded, is sufficient to deduce that lim N →∞ P Q,N f −P Q,N f C(V) = 0.
We now turn to the third term in (51), P Q,N f − P Q,N f C(V) . We have P Q,N f − P Q,N f C(V×V) ≤ p Q,N − p Q,N C(V×V) f C(V) , and it follows from the definitions ofp Q,N and p Q,N , in conjunction with the fact that the normalization functions l Q,N and r Q,N are both uniformly bounded away from zero, that there exists a constant c such that P Q,N f − P Q,N f C(V×V) ≤ c k Q − k Q,N C(V×V) f C(V) .
Thus, the convergence of P Q,N f − P Q,N f V to zero follows from Corollary 11. In summary, we have shown that P N f −P Q,N f C(V) , P Q,N f −P Q,N f C(V) , and P Q,N f − P Q,N f C(V) all converge to zero, which is sufficient to conclude that lim N →∞ η N = 0, and that that P Q,N f converges to P Q f . According to Definition 6, it remains to show that for any bounded sequence f N ∈ C(V), the sequence g N = (P Q,N − P Q )f N has a limit point. This can be proved by an Arzelà-Ascoli argument as in the proof of Claim (a) in conjunction with Glivenko-Cantelli arguments as in the proof of pointwise convergence above. We refer the reader to Proposition 13 in [20] for more details. This completes our proof of Claim (b).