tgEDMD: Approximation of the Kolmogorov Operator in Tensor Train Format

Extracting information about dynamical systems from models learned off simulation data has become an increasingly important research topic in the natural and engineering sciences. Modeling the Koopman operator semigroup has played a central role in this context. As the approximation quality of any such model critically depends on the basis set, recent work has focused on deriving data-efficient representations of the Koopman operator in low-rank tensor formats, enabling the use of powerful model classes while avoiding over-fitting. On the other hand, detailed information about the system at hand can be extracted from models for the infinitesimal generator, also called Kolmogorov backward operator for stochastic differential equations. In this work, we present a data-driven method to efficiently approximate the generator using the tensor train (TT) format. The centerpiece of the method is a TT representation of the tensor of generator evaluations at all data sites. We analyze consistency and complexity of the method, present extensions to practically relevant settings, and demonstrate its applicability to benchmark numerical examples.


Introduction
Learning models for complex dynamical systems based on simulation or measurement data has become a vibrant and successful research field, with applications in fluid dynamics, engineering, biophysics, economics, and many others. A significant body of work revolves around the statistical description of a dynamical system using the Koopman operator, see [1,2,3,4,5,6,7,8]. Major algorithmic frameworks originating from this line of research include, for example, (extended) dynamic mode decomposition, or (E)DMD [9,10], Markov state modeling [2,3,11,12], the variational approach to conformational dynamics (VAC) [13,14] and its generalization variational approach to Markov processes (VAMP) [15], and the 1. derive a TT representation formula for tensors which are given as a certain structured sum of rank-one tensors, extending a result from [40]. This representation is then used to deduce the TT structure of the tensor of generator evaluations at all data sites, which is the missing piece when translating existing methods to the approximation of the generator. We derive separate representations for reversible and general SDE systems.
basis sets defined on reduced variables can be used to approximate the generator of an effective dynamics in TT format, building on results from Refs. [41,34]. 4. illustrate the capabilities of the proposed methods by approximating the generator of a four-dimensional model system, and the effective generator for molecular dynamics of the deca-alanine peptide on a set of torsion angle coordinates.
The rest of the paper is structured as follows: basic concepts regarding the Koopman framework, data-driven approximation, the AMUSE algorithm, and tensor trains are introduced in Section 2. The general TT representation formula and the TT structures of the generator data tensor are derived in Section 3. The main algorithm, its analysis, and the extensions are presented in Section 4. Numerical examples follow in Section 5.

Kolmogorov Backward Operator
In this paper, we are concerned with the numerical approximation of the Kolmogorov backward operator where b ∈ R d is a vector field and a ∈ R d×d is a field of symmetric positive semi-definite matrices (that is, v T a(x)v ≥ 0 for all x and v ∈ R d ). Moreover, ∇φ(x) ∈ R d is the gradient of the scalar function φ, ∇ 2 φ(x) is the Hessian matrix of its second-order derivatives, and the colon : denotes the Frobenius inner product, A : B = tr(A T B), between matrices. The differential operator L is closely linked to the statistical description of dynamical systems which are driven by a stochastic differential equation (SDE). For t ≥ 0, let X t be a stochastic process on a state space X ⊂ R d , subject to the governing equation In this context, the field b is called the drift, while the matrix field σ is called diffusion, which must be such that σσ T = a. Furthermore, W t in (2) is the Brownian motion in d-dimensional space. Equation (2) can be understood in the sense that the infinitesimal change of the process X t is proportional to the drift field b, but also subject to the random fluctuations of the Brownian motion W t , modulated by σ. See Ref. [42] for a rigorous introduction to the dynamics of type (2). The statistics of the process X t are then captured by a family of linear operators, called the Koopman semigroup [1,4]. For t ≥ 0 and a function φ on X (usually called an observable of the system), the linear operator K t acts as The Kolmogorov operator L arises as the infinitesimal generator of this semigroup, that is, for suitable functions φ, we have As described in the introduction, knowledge of the Koopman semigroup, and especially its generator, can be utilized for a thorough statistical analysis of the dynamical system X t . Before discussing its numerical approximation, let us point out that the precise mathematical properties of the differential operator L and its domain of definition can be quite intricate. As these technical details are mostly irrelevant for the present study, we just briefly lay out a standard setting to serve as a guideline for the following, see [42,43] for an in-depth discussion of this setting.
(A1) The coefficients b, σ of the SDE (2) are such that global existence and uniqueness of solutions with continuous paths are guaranteed.
(A2) There is a unique, everywhere positive invariant density ρ for (2), with associated probability measure µ. Invariance of the measure µ can be characterized by the identity, for all φ and t ≥ 0: It follows that the operators K t form a strongly continuous contraction semigroup on the Hilbert space of square-integrable functions with respect to the invariant measure µ: The inner product for φ,φ ∈ L 2 µ (X) is given by φ,φ µ = X φφ dµ. The domain D(L) of functions where the derivative (3) exists is a dense subspace of L 2 µ (X). If in addition the dynamics (2) are reversible with respect to the invariant measure µ, that is the detailed balance condition P µ (X 0 ∈ A, X t ∈ B) = P µ (X 0 ∈ B, X t ∈ A) holds for all A, B ⊂ X and t ≥ 0, then the generator L is self-adjoint on its domain. Moreover, the following important relation holds for all φ,φ ∈ D(L):

Galerkin Projection and Data-driven Estimation
Next, we consider the approximation of L on a finite-dimensional subspace V, spanned by twice continuously differentiable functions contained in the domain, i.e. V ⊂ D(L) ∩ C 2 (X). Choosing V in this way ensures that Lφ(x) can be evaluated in the form (1) for all φ ∈ V. If a basis set ψ = [ψ 1 , . . . , ψ n ] T for V is given, we will briefly write ψ(x) = [ψ 1 (x), . . . , ψ n (x)] T ∈ R n for the evaluation of all basis functions at a point x, and similarly Lψ(x) ∈ R n for the application of L to all basis functions at once. Moreover, if X = [x 1 , . . . , x m ] ∈ R d×m is a collection of data points in X, we write and call these the data matrix and generator data matrix, respectively.

4
The orthogonal projection of L onto V can be represented by the matrix (w.r.t. the basis ψ): As the integrals appearing in (5) are expectation values with respect to the probability measure µ, they can be approximated by Monte Carlo methods. From now on, we assume that x l , l ∈ N, is a sequence of random variables, such that for all integrable functions φ ∈ L 1 µ (X) and almost all realizations of the x l we have: Equation (6) is of course satisfied if all x l are i.i.d. with respect to µ. By the ergodic theorem [44], the same conclusion holds if x l is taken as the l-th step of any discretized trajectory of the dynamics (2), provided the initial condition is drawn from µ. As a consequence, for any finite realization x 1 , . . . , x m of the random variables x l , we obtain an empirical Galerkin For the reversible case, the matrixÂ(ψ) can be calculated in a simpler form based on (4), requiring only first order derivatives of the basis set. Define dψ(x) = [∇ψ(x)σ(x)] ∈ R n×d , using the Jacobian ∇ψ ∈ R n×d . Then, introduce a modified generator data matrix dΨ(X) = [dψ(x 1 ) . . . dψ(x m )] ∈ R n×dm . It follows from (4) that A(ψ) can be approximated bŷ noting that the summation extends over both the data size and the state space dimension. This conceptually simple way of approximating the Kolmogorov operator based on data using (7 -9) has been labeled generator extended dynamic mode decomposition (gEDMD) in [34].

AMUSE Algorithm
The calculation of the Gramian matrixĈ(ψ) and its inverse may be avoided by means of the AMUSE Algorithm 1 [17,7], which also serves as a tool for dimension reduction. The rationale behind AMUSE is that, using the truncated SVD Ψ(X) ≈Û rΣrV r computed in line 1, the transformed basisη r (x) T := ψ(x) TÛ rΣ −1 r is empirically orthonormal, hence computation of the Gramian matrix is no longer necessary. In order to compute a representation of the generator on the linear span ofη r , the same transformation just needs Algorithm 1 AMUSE Input: data matrix Ψ(X), generator data matrix LΨ(X) or dΨ(X). Output: Reduced matrix representationM r ∈ R r×r orM rev r ∈ R r×r of L, where r ≤ n.
1: Compute rank-r SVD of Ψ(X), i.e., Ψ(X) ≈Û rΣrV r . 2: (Non-reversible Case): Compute reduced matrix to be applied toÂ(ψ), which is what happens in lines 2 and 3 of Algorithm 1. The reduced matricesM r andM rev r can also be computed by the following expressions, which do not require assembly of the full generator data matrices: If one is interested in consistency of the reduced matrix for the infinite data limit, the spectral decomposition of the analytical Gramian needs to be considered. Denote this spectral decomposition by C(ψ) = U Σ 2 U T , and for any r ≤ n we write C(ψ) = U r Σ 2 r U T r +E r for its truncation using r terms, with error term E r . Now, if the reduced dimension r in Algorithm 1 is fixed, it can be shown that, unless the eigenvalue σ 2 r of C(ψ) is degenerate, the linear span ofη r converges almost surely to that of η r := ψ(x) T U r Σ −1 r , along with the projected generator on this space (see [33][Proposition 3] for a proof). Alternatively, one can start with a fixed truncation parameter > 0. Assume there is an r( ) such that σ r( ) > > σ r( )+1 in the spectral decomposition of C(ψ). Then, if the reduced rank in Algorithm 1 is chosen such thatσ r ≥ √ m >σ r+1 , it follows by the same arguments that r → r( ) almost surely as m → ∞, recalling thatĈ(ψ) = 1 m Ψ(X)Ψ(X) T . Hence,M r ,M rev r provide consistent estimates of the projected generator on the linear span of η r( ) . This datadependent truncation strategy will also be used in the multi-linear case, see the discussion in Section 4.1 below.

Tensor Train Format
The choice of the finite-dimensional subspace V in Section 2.2 is of critical importance to the quality of the gEDMD approximation. A widely used approach to arrive at a large space with rich approximation properties is to consider all products of functions contained in a selection of elementary subspaces of moderate dimension. Many of the quantities introduced above will be endowed with a tensor structure in this case. For our purposes, it is sufficient to think of tensors as multi-dimensional arrays, that is, array entries are labeled by p different indices: The number of indices p is called order of the tensor, the individual indices are often called modes with mode sizes n k . A tensor can be unfolded into a matrix by grouping any selection I ⊂ {1, . . . , p} of tensor indices along rows and all remaining tensor indices along columns. We denote this matrix by T| I . Specifically, if I consists of the first k indices, we just write T| k and call the resulting matrix the mode-k unfolding.
The tensor product of p vectors v (k) ∈ R n k is defined by generalizing the standard outer product between two vectors. A tensor of this type is a rankone tensor. Every tensor can be written as a linear combination of finitely many rank-one tensors, but their number may be exponentially large. However, many tensors appearing in applications can be (approximately) represented by a much smaller number of parameters. Many different parametrizations, typically called tensor formats, have been put forward.
Here, we focus exclusively on the tensor train format (TT format), introduced in [27,28], where a tensor is represented as a contraction of multiple lower-order tensors: The tensors T (k) ∈ R r k−1 ×n k ×r k of order 3 are called TT cores and the numbers r k are called TT ranks. It holds that r 0 = r p = 1 and r k ≥ 1 for k = 1, . . . , p − 1. A TT core is left-orthonormal if Remark 2.2. Sometimes, we will also allow the first or final rank to be greater than one, i.e. r 0 > 1 or r p > 1. In this case, we understand T to be encoding multiple tensors, one for each pair of values of s 0 and s p in (12).
The following formal representation for tensor trains has proven particularly useful. For a given tensor train T with cores T (k) ∈ R r k−1 ×n k ×r k , a single core is written as a twodimensional matrix containing vectors as elements We then represent T by a formal matrix product, called rank-core product in [45]: In this representation, we compute the tensor products of the matrix elements -which are vectors instead of scalars -and then sum over the columns and rows, just as for an ordinary matrix product. It is readily verified that the result is the same as (12). Further below, we will also require a tensor product for matrices, which is known as Kronecker product, and denoted by the same symbol. For matrices A ∈ R m×n , B ∈ R r×q , its block-wise definition is The Kronecker product is equivalent to the standard tensor product applied to vectorized versions of the matrices A, B, but using a non-standard ordering of their indices, see [46] for details.

Basis Decomposition and Global SVD
We are now prepared to formulate the tensor-based approximation framework for the Kolmogorov operator. Suppose that is also contained in D(L). We denote the basis formed by all the products above by Ψ (assuming their linear independence). The evaluation of Ψ at x ∈ X is a rank-one tensor of order p, while the analogue of the data matrix Ψ(X) is a tensor of order p + 1, which we call the data tensor : Tensor-based counterparts LΨ(x), LΨ(X), and dΨ(x), dΨ(X) of the generator data matrices can be defined the same way. Unless the n k and p are very small, these data tensors cannot even be held in memory, let alone be used to compute the Galerkin matrices in (7 -9). Therefore, we follow the lines of the AMUSE approach in Section 2.3, but notice that we cannot compute an SVD of the unfolded data tensor Ψ(X)| p either, necessitating further modifications. As shown in [47], the data tensor is a sum of m rank-one tensors, and hence possesses a TT representation with all ranks equal to m, and the following TT cores (using the rank-core notation introduced in (13)): In the final core above, e l denote the canonical unit vectors in R m . Moreover, by a sequence of SVDs applied to the cores of (15), a multi-linear approximation of the true SVD for Ψ(X)| p can be computed, called global SVD [32], replacing the first step in Algorithm 1.
The result is an SVD-like decomposition Here,Û ∈ R n 1 ×...×np×rp is a tensor train with TT ranks r 1 , . . . , r p , where r p > 1 is allowed (see Remark 2.2), and left-orthonormal cores. Moreover,Σ ∈ R rp×rp is diagonal and nonnegative, andV ∈ R m×rp is orthogonal. As shown in [33], the columns ofÛ| pΣ −1 encode an empirically orthonormal basis of an r p -dimensional subspace of V: Analogous to the standard AMUSE algorithm, we can then compute an empirical representation of the generator on the linear span ofη rp , by forming the product (see (10 -11)):M The computationally demanding part of (16-17) is the multiplication of the unfolded tensors The orthonormal factorÛ is always given in tensor train format by construction. The main topic of this study is therefore the derivation of a tensor train representation of the first factor in each case, containing the action of the generator L (see Section 3), and the efficient calculation of the reduced matrix by contraction of a suitable tensor network (see Section 4).

Tensor Representations of the Generator Data Tensor
The goal of this section is to determine a tensor train representation of the generator data tensors LΨ(x) ∈ R n 1 ×...×np and ∇Ψ(x) ∈ R n 1 ×...×np×d , given entry-wise by

A General Representation Formula
As the Kolmogorov operator acts as a differential operator, its point-wise action on a product basis will be a sum of rank-one tensors, by the product rule. In anticipation of later results, we prove the following general representation formula, which is a straightforward extension of results shown in [40]: with vectors e k , f k , g k,i , h k,i ∈ R n k for k = 1, . . . , p and i = 1, . . . , d. Then T can be written as a tensor train T = T (1) ⊗ · · · ⊗ T (p) with all ranks equal to d + 2, and cores: Proof. Define auxiliary tensors A (k) , B (k) i ∈ R n k ×···×np for k = 1, . . . , (p−1) and i = 1, . . . , d by: We also define A (p) = f p and B (p) i = h p,i . Note that A (k) is similar to T, but all sums start from k instead of 1. In particular, we have that A (1) = T. The following relation holds for k = 1, . . . , (p − 1), and is central for the proof: The recursion (23) can be verified directly by expanding e k ⊗ A (k+1) , and noting that the remaining terms in (23) are exactly what is missing to complete the expression for A (k) .
We also obtain a recursive relation for B Using (23 -24), we can verify that for k = 2, . . . , p − 1: By noting that and repeated application of (26), we find Remark 3.2. Proposition 3.1 differs from the setting in [40] by adding the sums over i in the second line of (19). On the other hand, additional weights a k 1 ,k 2 in each term were considered in [40], as well as an additional sum over the terms in the second line, but with g k,i and h k,i swapped. The latter modification, without weights, is easily incorporated into (20 -22). However, this result is not needed in the following: If the definition of T additionally contains the terms: then T can be represented with cores of ranks 2d + 2 similar to (20 -22). In detail, T (1) is extended by appending all terms h 1,i , while all g p,i are appended to T (p) . For 2 ≤ k ≤ p − 1, all terms h k,i are added to the first row of T (k) , while all g k,i are added to the second column. The diagonal is just filled up with copies of the vector e k .

Representations of the Generator Data Tensor
Based on Proposition 3.1, we can now derive tensor train representations of the data tensors required for generator EDMD in TT format. We introduce the following notation: for a vector-valued function ψ : X → R n , we write ∂ψ ∂x i for the vector of partial derivatives of all functions ψ u with respect to coordinate x i . The associated Jacobian matrix is denoted by ∇ψ = ∂ψu ∂x i u,i

Reversible Case
We consider the reversible generator L first, since only first order derivatives are required for its data-driven approximation (11).
For any x ∈ X, the tensor ∇Ψ(x) ∈ R n 1 ×...×np×d can be written in the form (19) with: Proof. The representation follows directly from the product rule: which is already (19) with the parameters given by the Lemma.
. . . , Proof. The result follows from Lemma 3.4, and by observing that the second row and column of each core in Proposition 3.1 can be omitted in this case, as they do not contribute to the final tensor. Moreover, the first column of the pre-last core ∇Ψ (p) (x) can also be omitted, as it would be multiplied by a zero in ∇Ψ (p+1) (x).
Remark 3.6. A TT representation of dΨ(x) can be obtained based on Corollary 3.5 by absorbing the additional factor σ into the final core. As dΨ(X) = m l=1 dΨ(x l ) ⊗ e l , where e l is the canonical unit vector in R m , we can also obtain a TT decomposition of dΨ(X) using the summation rule for tensor trains [28].

Non-reversible Case
Next, we consider the generator L for a general (not necessarily reversible) SDE. The pointwise evaluation of L can be written as a sum of rank-one tensors as follows: Lemma 3.7. For any x ∈ X, the tensor LΨ(x) ∈ R n 1 ×...×np can be written in the form (19) with: where σ :,i denotes the i-th column of the diffusion σ.
In the last step, we used that the second term is symmetric in k 1 and k 2 . This is the desired result.

13
The tensor train representation of LΨ(x) is then obtained from Proposition 3.1: Corollary 3.8. The tensor train representation of LΨ(x), with ranks all equal to (d + 2), is given by the cores: where e l is the canonical unit vector in R m , we can also obtain a TT decomposition of LΨ(X) using the summation rule for tensor trains [28].

Algorithmic Realization
In this section, we discuss the algorithmic realization of the tensor-based version of gEDMD. We first present the tgEDMD algorithm in Section 4.1 and briefly analyze its complexity and consistency for infinite data. Subsequently, we present two extensions of the method in Section 4.2.

The tgEDMD Algorithm
Based on the results of Section 3, the tensor-based counterpart of the AMUSE Algorithm 1, which we call tgEDMD, is given in Algorithm 2 below. Its output is a matrix representation M rp orM rev rp of the generator, on the linear span of the functionsη rp . This matrix representation can then be used for further analysis of the generator, e.g. for spectral analysis, prediction, or model reduction.
Computational Effort The computational effort required by Algorithm 2 can be broken down into three major contributions. The first is the global SVD of the data tensor Ψ(X). The method consists of a series of matrix SVDs, each matrix being of size r k−1 n k × m. As is well known, the cost for these can be estimated as O(min{r 2 k−1 n 2 k m, r k−1 n k m 2 }). The cost of updating the next core in each step can be neglected in comparison. Secondly, the reduced matricesM rp orM rev rp need to be computed. To this end, for each data point x l , the contraction of the orthonormal tensor trainÛ with either LΨ(x l ) or ∇Ψ(x l ) in (16 -17) must be carried out. This can be achieved by Algorithm 3 below, see [28] for details. The tensor product in lines 1 and 3 is the Kronecker product between matrices defined in (14). Of course, Algorithm 3 can be applied to any pair of general tensors of compatible dimensions. :,u k ,: . 4: end for 5: Re-shape v to shape s p × r p . 6: return v As the cores of LΨ(x l ) and ∇Ψ(x l ) are structured and sparse, we obtain the following result: Proof. We obtain the result by recalling that the TT ranks of LΨ(x l ) and ∇Ψ(x l ) are essentially equal to d by the results obtained in Sec. 3, and by observing that the sparsity patterns of their cores are invariant under Kronecker products. Taking LΨ(x l ) as an example, we have by (14) for all 2 ≤ k ≤ p − 1:

Algorithm 3 Tensor Network Contraction
:,u k ,: ⊗Û :,u k ,: . The Kronecker product LΨ(x l ) :,u k ,: can therefore be evaluated in O(dr k−1 r k ) operations, while multiplication by a row vector from the left also costs O(dr k−1 r k ) operations. The full cost of applying Algorithm 3 is therefore O( p k=1 n k dr k−1 r k ).
The remaining matrix multiplications appearing in the computation of the reduced matrices are again not relevant in comparison, leaving us with the estimate O(m p k=1 n k dr k−1 r k ). Finally, if eigenvalues of the Kolmogorov operator are required, an additional cost of O(r 3 p ) for the diagonalization of the reduced matrix must be allowed. In summary, the total cost for the approximation of eigenvalues of the Kolmogorov operator can be estimated as It is instructive to compare these findings to the cost of applying the standard AMUSE Algorithm 1 in conjunction with the full product basis Ψ. Denoting the product of all mode sizes by N = p k=1 n k , and adding up the operation counts for the SVD of the data tensor Ψ(X), the assembly of the reduced matrix, and diagonalization of the latter, we end up with the estimate In all but the smallest examples, the cost is dominated by the exponential dependence on the dimension p through N .

Consistency
The discussion on consistency of the AMUSE algorithm, see Section 2.3, carries over to the tensor case, based on the analysis in Ref. [33,Section 5]. Therein, it was shown that if all SVDs in the global SVD are truncated according to a fixed vector of TT ranks r = [r 1 , . . . , r p ], one can construct a recursive sequence of subspaces G :k , 1 ≤ k ≤ p (called multi-linear spectral subspaces), such that the linear span of the functionsη rp consistently approximates the last of these subspaces G :p , along with the projected Koopman generator. This result required non-degeneracy of a series of eigenvalues of recursively constructed Gramian matrices. Analogous to the matrix case, we can also impose a vector of truncation parameters [ 1 , . . . , p ] instead and choose the reduced rank of each SVD in the global SVD method such thatσ k r k ≥ √ m k >σ k r k +1 , 1 ≤ k ≤ p. Given similar nondegeneracy conditions, the same arguments as in [33] can be used to show that there are limiting ranks r k ( k ) such that r k → r k ( k ), 1 ≤ k ≤ p, and the linear span ofη rp consistently approximates G :p for the TT ranks [r 1 ( 1 ), . . . , r p ( p )].

Extensions
We conclude the methodological part of this paper by briefly discussing two practically relevant settings which require modifications of the proposed methods.
Importance Sampling First, obtaining data sampling the invariant measure µ is often a major challenge, for instance due to metastability of the dynamics (2). In principle, this difficulty can be circumvented as long as appropriate re-weighting factors can be calculated. Assume the data are generated according to a probability distribution with density ϑ on X and we are able to calculate the importance sampling ratio w(x) = Z ρ(x) ϑ(x) at any point x ∈ X, where Z is a possibly unknown constant. This setting reflects a rather typical scenario where both the invariant density ρ and the sampling density ϑ can be evaluated up to an unknown normalization constant. Note, however, that a poor choice of the sampling density ϑ can drastically increase the variance of all estimators involved, but this problem is beyond the scope of the present work. To see how this affects the proposed methods, consider the matrix case first. Introducing the diagonal weighting matrix W = diag[w(x 1 ), . . . , w(x m )], the empirical estimators (7 -8) change according to: We need to modify line 1 of Algorithm 1 to the computation of the rank-r SVD (Ψ(X)W 1/2 ) ≈ U rΣrV T r . The reduced matrices are then obtained aŝ Note that both estimators are indeed invariant with respect to the constant Z. In the tensor case, the following modifications must be made as a result: 1. The TT representation (15) of Ψ(X) is modified by scaling each unit vector in the last core by the corresponding re-weighting factor: 2. Application of the global SVD of Ψ(X) must be preceded by right-orthonormalization of Ψ(X) (p+1) , see [32].

Reduced matrices are calculated according tô
Projection onto Reduced Variables Secondly, we discuss the use of basis functions defined on a set of reduced variables. For high-dimensional systems, it is often convenient to employ basis functions which are not directly defined on the state variables x i , 1 ≤ i ≤ d, but on the image of some (possibly) non-linear descriptors is a finite-dimensional space of functions defined on the lowerdimensional state space Y, the resulting Galerkin approximation to the generator possesses a useful interpretation, as discussed in great detail in Ref. [41]. Consider the space of functions in L 2 µ which are in fact functions of ξ, i.e.
After identification ofφ with φ, V ξ equals the weighted Lebesgue space L 2 ν (Y), where ν is the marginal probability distribution of µ along ξ. Using the µ-orthogonal projector P ξ onto L 2 ν (Y), one can introduce the projected generator L ξ = P ξ LP ξ . Given some technical assumptions [41], the projected generator L ξ is again the generator of a stochastic differential equation (2), with effective drift and diffusion coefficients. Importantly, the following identities relating the generators L, L ξ and their invariant measures µ, ν, hold for all φ,φ ∈ L 2 ν (Y): As a consequence, any Galerkin approximation to L on a subspace V ⊂ L 2 ν (Y) is simultaneously an approximation to L ξ . Conversely, the empirical estimators (7 -9) can be used in conjunction with simulation data of the full process, sampling µ, in order to approximate the reduced generator L ξ [34]. When applying the full generator L to a basis set on Y as above, derivatives of ξ with respect to the full state variables x i must be taken using the chain rule. If V ⊂ D(L ξ ) is a tensor subspace, the following adjustments need to be made when applying tgEDMD: 1. In the reversible case, it suffices to replace a(x l ) in (17) by a ξ (x l ) = ∇ξ(x l )a(x l )∇ξ(x l ) T .
All derivatives appearing in Corollary 3.5 can then be taken simply with respect to the reduced variables y.

Numerical Results
Below, we illustrate the capabilities of the proposed methods by means of two examples. The first is a diffusion in a four-dimensional model potential (Section 5.1), the second is a data set of molecular dynamics simulation of the deca-alanine peptide (Section 5.2). For both examples, we mainly focus on re-producing the dominant spectrum of the generator. In the setting outlined in Sec. 2.1, the generator L is non-negative, hence any eigenvalue κ k must satisfy R(κ k ) ≤ 0, where R(·) is the real part. If the system is reversible in addition, all eigenvalues lie on the negative real half-line. For a broad class of systems, the largest eigenvalue κ 0 = 0 is non-degenerate, possibly followed by a number of real eigenvalues close to zero and a subsequent spectral gap, that is where R < 0 is an upper bound for the remaining spectrum. This pattern, which we will encounter in both examples, is the signature of metastability, i.e. the existence of long-lived states such that interstate transitions are rare-events [48].
By the spectral mapping theorem [49], any generator eigenvalue κ k corresponds to an eigenvalue λ k (t) = e κ k t of the Koopman operator. If R(κ k ) < 0, the Koopman eigenvalue is thus decaying exponentially with the time lag t. The characteristic decay time scale associated to λ k (t), also called implied time scale, is where |·| is the modulus of a complex number. Implied time scales have frequently been used as a metric in order to compare different approximate models for the Koopman operator or generator [11,12]. Our software implementation of the tgEDMD algorithm closely follows the discussion in Section 4.1. It is publicly available as part of the scikit-tt package 1 , including the numerical examples presented below. The data required to re-produce these examples is also available online 2 .

Lemon Slice Potential
Description The first system we study is defined by the reversible SDE (2) on X = R 4 , with drift given by the negative gradient b = −∇V of a scalar potential V and the diffusion equal to a constant multiple of the identity, σ ≡ √ 2Id. The potential V is a sum of the two-dimensional Lemon Slice potential V LS along the first two state variables and harmonic potentials V h along the last two variables, that is Above, r, ϕ are two-dimensional polar coordinates derived from x 1 , x 2 . A contour plot of the Lemon slice potential, which has also been used in a number of previous publications [50,51], is shown in Figure 1 A. The dynamics along each of the variables x 3 , x 4 is independent from that along the remaining ones and equilibrates quickly. The dynamics along x 1 , x 2 on the other hand, is metastable with four long-lived states corresponding to the minima of the potential V LS . Consequently, a Markov state model analysis using a box discretization in (x 1 , x 2 )-space yields an estimate of these eigenvalues as κ 0 = 0, κ 1 = −0.42, κ 2 = −1.19, κ 3 = −1.54, which translate into implied timescales t 1 = 2.38, t 2 = 0.84, t 3 = 0.65. This simple example mainly serves to illustrate the importance of choosing TT ranks appropriately, and also to illustrate the use of the re-weighting technique described in Section 4.2.
Methods For tgEDMD, we generate ten independent long simulations of the SDE (2) by the Euler-Maruyama scheme at integration time step ∆ t = 10 −3 , each spanning 3 · 10 5 time steps. These simulations are downsampled to m = 3000 data points each. Moreover, we generate ten independent data sets of the same size, drawn from a three component Gaussian mixture distribution (GMM) shown in Figure 1 B, in order to test the re-weighting method. The parameters of the GMM were chosen in such a way that samples will correspond to physically relevant states with high probability, while drastically shifting the relative weights of the four metastable states. We define elementary subspaces V k , 1 ≤ k ≤ 4, each spanned by Gaussian basis functions depending on x k alone. For the first two variables, we use seven Gaussians centered equidistantly in [−1.2, 1.2], each with bandwidth √ 0.4, while for the remaining variables we only use five Gaussians centered in [−1.0, 1.0] with bandwidth √ 0.5. For each data set, we calculate the reduced matrixM rev rp by Algorithm 2. The truncation threshold for the global SVD is set to √ m , where varies between = 10 −8 and = 10 −2 . We calculate the first four dominant eigenvalues and implied time scales of the reduced matrix, along with their eigenvectors and the evaluation of their associated eigenfunctions at all data sites. Finally, these eigenfunction trajectories are passed to the PCCA method [52] to generate a clustering of the data into four metastable states. As a comparison, we also apply the standard AMUSE Algorithm 1 to the full product basis, using the same truncation parameters. We record the resulting ranks and implied time scales.

Results
In Figure 2, we show estimates for the leading three implied time scales t i , i = 1, 2, 3, obtained from the eigenvalues of the reduced matrix, averaged over ten independent data sets, using both simulation data of the SDE (2) (panel A) and data sampled from the GMM (panel B). Time scales obtained by the standard AMUSE algorithm are shown in panel C as a comparison. We observe that using both the simulation data and the GMM data, we can robustly identify all three time scales within statistical uncertainty by the tgEDMD method, for a range of truncation parameters between = 10 −6 and = 10 −3 . If a larger threshold is used, approximation quality starts deteriorating, especially for the slowest time scale. We also verify in panel E that a PCCA analysis of the eigenfunction trajectories computed by tgEDMD yields the correct decomposition of the data set into four metastable states. As shown in panel D of Figure 2, the maximal rank employed by tgEDMD varies between roughly 200 for = 10 −3 and about 800 for = 10 −6 . These Circles indicate results from the SDE trajectories, crosses those obtained using the GMM data. Solid horizontal lines indicate the rank obtained by applying standard AMUSE to the SDE data with truncation threshold indicated by the color. E: Metastable decomposition projected onto (x 1 , x 2 ), obtained by applying PCCA to the eigenfunction trajectories for one of the ten SDE trajectories and = 10 −3 . F: Computational cost for tgEDMD and standard gEDMD as estimated by Eqs. (27)(28). 21 ranks are approximately equal to the truncation rank achieved by standard AMUSE if the same thresholds are used. In summary, these results confirm that tgEDMD can robustly identify suitable reduced subspaces for spectral analysis of the Koopman generator. Due to the limited size of this example, there is no significant gain in terms of computational effort however. In fact, we see in panel F that estimators Eqs. (27)(28) yield about the same computational cost for tgEDMD and standard gEDMD for all truncation ranks. One can observe, however, that the cost for tgEDMD depends more strongly on the truncation rank, whereas the cost of gEDMD is dominated by the basis set size N = p k=1 n k , as expected.

Deca-Alanine
Description The second example is a data set resulting from equilibrium molecular dynamics simulations of the ten residue peptide deca-alanine. The simulation setup can be found in [30], we only note that the data include both positions and momenta of all atoms, and the velocity re-scaling thermostat was used to control temperature. The (downsampled) data set consists of m = 3 · 10 5 frames at a time spacing of 10 ps. The system has been analyzed in several previous publications, e.g. [14,53,30,33]. It is well-known that the system possesses two major metastable conformations, namely the folded (helical) and unfolded state. These can be distinguished by a collective transition in the space of backbone dihedral angles ϕ, ψ. Representing each (ϕ, ψ)-pair by its Ramachandran plane, the folded state is (approximately) characterized by We will refer to this region in the Ramachandran plane as the α-region. In the unfolded state we expect to see a mixture of angle pairs in the α-region and in the β-region, which is defined by ϕ ≤ 0 • ∨ ϕ ≥ 130 • and ψ ≤ −115 • ∨ ψ ≥ 80 • (see the upper left plot of Figure 3 C). The slowest implied time scale, corresponding to the folding process, was determined as t 1 ≈ 7.5 ns by an MSM analysis in [33]. Two additional time scales were estimated as t 2 ≈ 3.7 ns, t 3 ≈ 3.4 ns.
Methods As the physically relevant dynamics of deca-alanine can be described by dihedral angle coordinates, we use elementary subspaces of functions defined on ten internal dihedral angles for tgEDMD, the outermost dihedrals being more flexible. The elementary subspaces V k are the same as in [33], they are spanned by periodic Gaussian functions ψ k,u k (x) = exp − 1 2s k,u k sin 2 (0.5(x k − c k,u k )) .
Note that ψ k,u k (x) only depends on the k-th coordinate of the state x. If the k-th coordinate refers to a ϕ angle, we choose the basis set comprised of the constant function and two periodic Gaussians with s k,u k = 0.8, 0.5 and c k,u k = −2.0, 1.0. If the k-th coordinate refers to a ψ angle, we choose the basis set as the constant function and three periodic Gaussians with s k,u k = 0.8, 4.0, 0.8 and c k,u k = −0.5, 0.0, 2.0. Thus, the full tensor space V consists of 3 5 · 4 5 ≈ 2.5 · 10 5 basis functions. As the basis set is defined in terms of reduced coordinates, the effect of this projection needs to be taken into account as described in Section 4.2. The full state generator is then, in principle, given as the generator of the molecular dynamics engine on full position and momentum space. For several variants of thermostatted MD, including Langevin dynamics and velocity re-scaling, it is well-known that the dynamics on the atomic position space can be approximated by a reversible SDE with constant diffusion on long time scales (see [51] for a detailed discussion). However, this approximation induces a re-scaling of time by a typically unknown factor. Still, we make use of this approximation here by employing the reversible tgEDMD algorithm. Besides the basis functions and their derivatives, the only additional quantity we require is the Jacobian of all dihedral angles with respect to the atomic positions, which can be calculated analytically. We then compute the reduced matrixM rev rp by Algorithm 2 using downsampled trajectories of either m = 3000 or m = 30000 snapshots. In this example, the dominant singular values of the TT-cores are rather large and differ significantly in size between the cores. Therefore, we employ a relative SVD truncation threshold in contrast to the absolute cut-off used before, i.e. the truncation threshold is set to √ m s 1 , where s 1 denotes the largest singular value, and varies between = 10 −7 and = 10 −4 . In addition, we cap the maximally allowed SVD rank at 500. As before, we calculate the dominant eigenvalues and implied time scales of the reduced matrix. In order to account for the change of time units, we calculate the ratio of the reference slowest time scale t 1 and our estimate and re-scale all time scales by that ratio. Of course, this enforces all tgEDMD estimates for t 1 to match the reference exactly. However, by also comparing the next few estimated time scales to their reference values, we can verify whether or not our estimates really match the reference up to a re-scaling of time. Finally, we also compute a clustering of the data into metastable states using the PCCA method [52].
Results In panel A of Figure 3, we show estimates for the leading three implied time scales t i , i = 1, 2, 3, obtained from the eigenvalues of the reduced matrix. We find that these time scales can be reliably approximated across the range of parameters and data sizes we tested. A ten to fifteen per cent relative error margin is only exceeded for the largest SVD threshold = 10 −4 . The ranks of the associated TT representations of the data tensor are shown in panel B of Figure 3. The maximal TT rank of 500 is only attained for the smallest threshold = 10 −7 . The displayed ranks are similar to the results presented in [33] and we also confirm their finding that TT ranks on the order of 50 to 100 are sufficient to satisfactorily approximate the slow dynamics of the peptide. In panel C, we verify that the folded and unfolded states were correctly identified by tgEDMD. To this end, we separately compute histogram densities of all data points that were assigned to each metastable state, projected onto each of the five Ramachandran planes. The results obtained for parameters m = 30000, = 10 −7 , are shown in the first two rows. These results are consistently reproduced for all thresholds and data sizes we tested. As a reference, we show the same densities based on the clustering obtained from the eigenfunctions computed by tensor-based EDMD as in [33] (last two rows of panel C). We observe that both the folded and unfolded states were identified correctly, up to a slight discrepancy of the relative weights of the alpha-and beta-regions of the first two planes for the unfolded state. In this ten-dimensional example, the computational benefit of employing the presented tgEDMD method compared to the standard AMUSE algorithm (gEDMD) is already substantial, as Fig. 4 illustrates. For the parameters we tested, the computational effort of tgEDMD is orders of magnitude smaller than for gEDMD. For instance, we see that by A: Implied Time Scales 1.0e-4, m=3000 1.0e-5, m=3000 1.0e-7, m=3000 1.0e-4, m=30000 1.0e-5, m=30000 1.0e-7, m=30000 MSM  Figure 3: Results for deca-alanine peptide. A: Estimates of the leading three implied time scales, obtained by applying the tgEDMD Algorithm 2 with Gaussian basis functions, using different values of the truncation threshold and different data sizes. Errorbars were generated by selecting ten different subsets of the total simulation data set, which comprises 3 · 10 5 points. MSM-based reference values are shown in black. B: TT-ranks for different values of the truncation threshold and different data sizes. C: Metastable decomposition of the state space, obtained by applying PCCA to the tgEDMD eigenfunction trajectories (m = 30000 and = 10 −7 ). First row: density of data points assigned to the first PCCA state (folded, blue) along the five Ramachandran planes. Second row: the same for the second PCCA state (unfolded, red). Third and fourth row: densities of points assigned to the folded and unfolded states by applying PCCA to the eigenfunctions obtained by tensor-based EDMD as in [33].   (27)(28), for the deca-alanine example. setting = 10 −5 and m = 30000, accurate results can be achieved (cf. Fig. 3) at computational cost reduced by more than three orders of magnitude. Moreover, we confirm once again that the computational effort for tgEDMD depends primarily on the TT ranks, thus offering an efficient means of controlling the computational cost. On the other hand, the cost of gEDMD is dominated by the basis set size N and essentially independent of the truncation rank.

Conclusions
We have presented tgEDMD, a data-driven method to approximate the Kolmogorov backward operator, or generator, of a dynamical system driven by a stochastic differential equation using the tensor train format. The centerpiece of the method is a TT representation of the generator data tensor. We have derived this representation for reversible and nonreversible SDEs. We have shown that tgEDMD consistently approximates the generator on a multi-linear subspace of the full tensor space and also analyzed the computational complexity. We have shown how importance sampling methods can be incorporated and how the method can be used to compute an effective generator on a reduced state space. Finally, we have presented successful applications of the method to a low-dimensional test system and to a benchmarking molecular dynamics data set. 25