Dimensionality Reduction of Complex Metastable Systems via Kernel Embeddings of Transition Manifolds

We present a novel kernel-based machine learning algorithm for identifying the low-dimensional geometry of the effective dynamics of high-dimensional multiscale stochastic systems. Recently, the authors developed a mathematical framework for the computation of optimal reaction coordinates of such systems that is based on learning a parametrization of a low-dimensional transition manifold in a certain function space. In this article, we enhance this approach by embedding and learning this transition manifold in a reproducing kernel Hilbert space, exploiting the favorable properties of kernel embeddings. Under mild assumptions on the kernel, the manifold structure is shown to be preserved under the embedding, and distortion bounds can be derived. This leads to a more robust and more efficient algorithm compared to previous parametrization approaches.


Introduction
Many of the dynamical processes investigated in the sciences today are characterized by the existence of phenomena on multiple, interconnected time scales that determine the longterm behavior of the process. Examples include the inherently multiscale dynamics of atmospheric vortex-and current formation which needs to be considered for effective weather prediction [24,29], or the vast difference in time scales on which bounded atomic interactions, side-chain interactions, and the resulting formation of structural motifs occur in biomolecules [19,12,10]. An effective approach to analyzing these systems is often the identification of a low-dimensional observable of the system that captures the interesting behavior on the longest time scale. However, the computerized identification of such observables from simulation data poses a significant computational challenge, especially for high-dimensional systems.
Recently, the authors have developed a novel mathematical framework for identifying such essential observables for the slowest time scale of a system [6]. The method-called the transition manifold approach-was primarily motivated by molecular dynamics, where the dynamics is typically described by a thermostated Hamiltonian system or diffusive motion in molecular dynamics landscapes. In these systems, local minima of the potential energy landscape induce metastable behavior, which is the phenomenon that on long time scales, the dynamics is characterized by rare transitions between certain sets that happen roughly along interconnecting transition pathways [36,42,17]. The sought-after essential observables should thus resolve these transition events, and are called reaction coordinates in this context [46,4], a notion that we will adopt here. Despite of its origins, the transition manifold approach is also applicable to other classes of reducible systems (which will also be demonstrated in this article).
At the heart of this approach is the insight that good reaction coordinates can be found by parametrizing a certain transition manifold M in the function space L 1 . This manifold has strong connections to the aforementioned transition pathway [5], but is not equivalent. Its defining property is that, for times τ that fall between the fastest and slowest time scales, the transition density functions with relaxation time τ concentrate around M. The algorithmic strategy to parametrize M can then be summarized as 1. Choose test points in the dynamically relevant regions of the state space.
2. Sample the transition densities for each test point by Monte Carlo simulation.
3. Embed the transition densities into a Euclidean space by a generic embedding function. 4. Parametrize the embedded transition densities with the help of established manifold learning techniques. The result is a reaction coordinate evaluated in the test points. This reaction coordinate has been shown to be as expressive as the dominant eigenfunctions of the transfer operator associated with the system [6], which can be considered an "optimal" reaction coordinate [20,11,30]. One decisive advantage of our method, however, is the ability to compute the reaction coordinate locally (by choosing the test points), whereas with conventional methods, the inherently global computation of transfer operator eigenfunctions quickly becomes infeasible due to the curse of dimensionality. Kernel-based methods for the computation of eigenfunctions alleviate this problem to some extent [43,25]. Nevertheless, the number of dominant eigenfunctions critically depends on the number of metastable states, which can be significantly larger than the natural dimension of the reaction coordinate [6].
However, the algorithm proposed originally had several shortcomings related to the choice of the embedding function. First, in order to ensure the preservation of the manifold's topology under the embedding, the dimension of M had to be known in advance. Second, the particular way of choosing the embedding functions allowed no control over the distortion of M, which may render the parametrization problem numerically ill-conditioned.
The goal of this article is to overcome the aforementioned problems by kernelizing the transition manifold embedding. That is, we present a method to implicitly embed the transition manifold into a reproducing kernel Hilbert space (RKHS) with a proper kernel. The RKHS is-depending on the kernel-a high-or even infinite-dimensional function space with the crucial property that inner products between points embedded into it can be computed by cheap kernel evaluations, without ever explicitly computing the embedding [48,41]. In machine learning, this so-called kernel trick is often used to derive nonlinear versions of originally linear algorithms, by interpreting the RKHS-embedding of a data set as a high-dimensional, nonlinear transformation, and (implicitly) applying the linear algorithm to the transformed data. This approach has been successfully applied to methods such as principal component analysis (PCA) [40], canonical correlation analysis (CCA) [31], and time-lagged independent component analysis (TICA) [43], to name but a few.
Due to their popularity, the metric properties of the kernel embedding are well-studied [45,21,47,22,33]. In particular, for characteristic kernels, the RKHS is "large" in an appropriate sense, and geometrical information is well-preserved under the embedding. For our application, this means that distances between points on the transition manifold M are approximately preserved, and thus the distortion of M under the embedding can be bounded. This will guarantee that the final manifold learning problem is well-posed. Moreover, if the transformation induced by the kernel embedding is able to approximately linearize the transition manifold, there is hope that efficient linear manifold learning methods can be used to parametrize the embedded transition manifold.
The main contributions of this work are as follows: 1. We develop a kernel-based algorithm to approximate transition manifolds and compare it with the Euclidean embedding counterpart.
2. We derive measures for the distortion of the embedding and associated error bounds.
3. We illustrate the efficiency of the proposed approach using academic and molecular dynamics examples. In Section 2, we will formalize the definition of transition manifolds and derive conditions under which systems possess such manifolds. Section 3 introduces kernels and the induced RKHSs. Furthermore, we show that the algorithm to compute transition manifolds numerically can be written purely in terms of kernel evaluations and derive measures for the distortion of the manifold caused by the embedding into the RKHS. Numerical results illustrating the benefits of the proposed kernel-based methods are presented in Section 4 and a conclusion and future work in Section 5.

Reaction coordinates based on transition manifolds
In what follows, let {X t } t≥0 (abbreviated as X t ) be a reversible, thus ergodic, stochastic process on a compact state space X ⊂ R n and ρ its unique invariant density. That is, if X 0 ∼ ρ, then X t ∼ ρ for all t ≥ 0. For x ∈ X and t ≥ 0, let p t x : X → R denote the transition density function of the system, i.e., p t x describes the probability density at time t, after having started in point x at time 0. For fixed t and x, we will often consider p t x as a point in the space L 1 (X) (the space of absolutely integrable functions over X with respect to the Lebesgue measure), and later also other related function spaces. For the sake of clarity, we will from now on omit the argument of L p when referring to functions over X.

Reducibility of dynamical systems
We assume the state space dimension n to be large. The main objective of this work is the identification of good low-dimensional reaction coordinates (RCs) or order parameters of the system. An r-dimensional RC is a smooth map ξ : X → Y from the full state space X to a lower-dimensional space Y ⊂ R r , r n. Loosely speaking, we call such an RC good if on long enough time scales the projected process ξ(X t ) is approximately Markovian and the dominant spectral properties of the operator describing its density evolution of ξ(X t ) resemble those of X t . This ensures that important long-time statistical properties such as equilibration times are preserved under projection onto the RC. We will now introduce a method to find such RCs. The explanation that the found RCs are indeed amenable to a rigorous quantitative measure of quality will be given in Section 2.2.
The method is based on a novel framework, introduced by some of the authors in [6], that ties the existence of good RCs to certain geometrical properties of the family of transition densities. This framework is called the transition manifold framework. To motivate the main idea, we first introduce the fuzzy transition manifold : is called the fuzzy transition manifold of the system.
The main idea behind the framework is now based on the following observation: If the system is in a certain way separable into a slowly equilibrating r-dimensional part and a quickly equilibrating (n − r)-dimensional part, and the lag time τ is chosen large enough so that the latter is essentially equilibrated but the former is not, then M concentrates around an r-dimensional manifold in L 1 . Further, any parametrization of this manifold is a good RC. Let us illustrate this with the aid of a simple example.
Example 2.2. Consider the process X t to be described by overdamped Langevin dynamics with the energy potential V , the inverse temperature β, and Brownian motion W t . The potential depicted in Figure 1 possesses two metastable states, located around the local energy minima. Any realization of this system that started in one of the the metastable sets will likely remain in that set for a long time. Moreover, a realization that started outside of the wells will with high probability quickly leave the transition region and move into one of the two metastable sets. The probability of whether the trajectory will be trapped in the left or right well depends almost exclusively on the horizontal coordinate of the starting point x, or, in other words, the progress of x along the transition path. Thus, for times τ that allow typical trajectories to move into one of the wells, the transition densities p τ x also depend only on the horizontal but not the vertical coordinate of x. This means that the fuzzy transition manifold M tightly concentrates around a one-dimensional manifold in L 1 . Also, any parametrization of this manifold corresponds to a parametrization of the horizontal coordinate, and thus to a good RC.
The concept of (fuzzy) transition manifolds can be made rigorous by the following definition: The process X t is called (ε, r)-reducible if there exists an r-dimensional manifold of functions M ⊂ M such that for a fixed lag time τ , it holds that Any such M is called a transition manifold of the system.
Two remarks are in order: Figure 1: Illustration of the transition manifold concept for metastable systems.
1. Note that in the above definition, the L 2 1/ρ norm is used to measure distances, where L 2 1/ρ is the space of (equivalence classes of) functions that are square-integrable with respect to the measure induced by the function 1 ρ , and thus for f ∈ L 2 1/ρ , Closeness with respect to the L 2 1/ρ -norm instead of the L 1 -norm is indeed a strict requirement here, as measuring the quality of a given RC will require a Hilbert space, see Section 2.2. Note that under appropriate assumptions on the system, it holds that p t x ∈ L 2 1/ρ for all x ∈ X. This will be shown in Lemma 3.13 and implies M ⊂ L 2 1/ρ , which together with the requirement M ⊂ M makes (2) well-defined.
2. The original definition of (ε, r)-reducibility (see [6,Definition 4.4]), is marginally different from the definition above: Instead of M ⊂ L 2 1/ρ , we here require M ⊂ M ⊂ L 2 1/ρ . The introduction of this slightly stronger technical requirement allows us to later control a certain embedding error, see Proposition 2.8. Note that the proofs in [6] regarding the optimality of the final reaction coordinate are not affected by this change.
In what follows, we always assume that the process X t is (ε, r)-reducible with small ε and r n.
Remark 2.4. In addition to metastable systems, many other relevant problems possess such transition manifolds. For instance, one can show that, under mild conditions on f and g, systems with explicit time scale separation, given by also possess a transition manifold since p t (x 0 ,y 0 ) essentially depends only on x 0 for 0 < ε 1 and t ε.

A measure for the quality of reaction coordinates
We will now present a measure for evaluating the quality of reaction coordinates that is based on transfer operators, first derived in [6]. The Perron-Frobenius operator P t : L 1 → L 1 associated with the process X t is defined by This operator can be seen as the push-forward of arbitrary starting densities, i.e., if X 0 ∼ u, then X t ∼ P t u.
As L 2 1/ρ ⊂ L 1 (see [6,Remark 4.6]) we can consider P t as an operator on the inner product space L 2 1/ρ , where it has particularly advantageous properties (see [2,38,26]). Here, it is selfadjoint due to the reversibility of X t . Moreover, under relatively mild conditions, it does not exhibit any essential spectrum [42]. Hence, its eigenfunctions form an orthonormal basis of L 2 1/ρ and the associated eigenvalues are real. Now, the significance of the dominant eigenpairs for the system's time scales is well-known [42]. This is the primary reason for the choice of the L 2 1/ρ -norm in Definition 2.3. Let θ t i be the eigenvalues of P t , sorted by decreasing absolute value, and ψ i the corresponding eigenfunctions, where i = 0, 1, . . . . It holds that θ 0 = 1 is independent of t, isolated and the sole eigenvalue with absolute value 1. Furthermore, ψ 0 = ρ. The subsequent eigenvalues decrease monotonously to zero both for increasing index and time. That is, The associated eigenfunctions ψ 1 , ψ 2 , . . . can be interpreted as sub-processes of decreasing longevity in the following sense: Let since for the lag time τ > 0 as defined above, there exists an index d ∈ N such that |θ t i | ≈ 0 for all t ≥ τ and all i > d. Hence, the major part of the information about the long-term density propagation of X t is encoded in the d dominant eigenpairs.
The operator P t describes the evolution of densities of the full process X t . In order to monitor the dependence of densities on the reduced coordinate ξ only, we first introduce the projection operator Π ξ : where the ρ-weighted expectation value is taken with respect to the random variable x. This operator is also known as the Zwanzig projection operator from statistical physics [50]. Intuitively, Π ξ averages a function over the individual level sets of ξ.
The effective transfer operator P t ξ : L 1 (X) → L 1 (X) associated with ξ is then given by see [6]. We now want to preserve the statistics of the dominant long-term dynamics of X t under the projection onto ξ, i.e., P t u ≈ P t ξ u, for t ≥ τ , where τ is some lag time that is long enough for the fast processes, associated with the non-dominant eigenpairs, to have equilibrated. A sufficient condition for (4) is that is, the dominant eigenfunctions ψ i must be almost constant along the level sets of ξ. This motivates the following definition of a good reaction coordinate: be the eigenpairs of the Perron-Frobenius operator. Let τ > 0 and d ∈ N such that θ t i ≈ 0 for all i > d and t ≥ τ . We call a function ξ : X → R r a good reaction coordinate if for all i = 0, . . . , d there exist functionsψ i : R r → R such that If condition (5) is fulfilled, we say that ξ (approximately) parametrizes the dominant eigenfunctions.

Optimal reaction coordinates
We now justify why reaction coordinates that are based on parametrizations of the transition manifold M indeed fulfill condition (5). Let Q : L 2 1/ρ → L 2 1/ρ be the nearest-point projection onto M, i.e., Assume further that some parametrization γ : M → R r of M is known, i.e., γ is one-to-one on M and its image in R r . Then the reaction coordinate ξ : R n → R k defined by is good in the sense of Definition 2.5 due to the following theorem: Theorem 2.6 ([6, Corollary 3.8]). Let the system be (ε, r)-reducible and ξ defined as in (6). Then for all i = 0, . . . , d, there exist functionsψ i : R r → R such that Let us add two remarks: 1. The choice of the L 2 1/ρ -norm in Definition 2.3 is crucial for Theorem 2.6 to hold.

Metastable systems typically exhibit a time scale gap after the
1 for suitably large t > 0.
Therefore, τ can always be chosen such that θ τ d+1 is close to zero and |θ τ i | , i = 0, . . . , d, is still relatively large. Consequently, the denominator in (7) is not too small, and thus the RC (6) is indeed good according to Definition 2.5.
The main task for the rest of the paper is now the numerical computation of an (approximate) parametrization γ of M.

Whitney embedding of the transition manifold
One approach to find a parametrization of M, proposed by the authors in [6], is to first embed M into a more accessible Euclidean space and to parametrize the embedded manifold. In order to later compare it with our new method, we will briefly describe this approach here.
To construct an embedding E that preserves the topological structure of M, without prior knowledge about M, a variant of the Whitney embedding theorem can be used. It extends the classic Whitney theorem to arbitrary Banach spaces and was proven by Hunt and Kaloshin in [23].
Theorem 2.7 (Whitney embedding theorem in Banach spaces, [23]). Let V be a Banach space and let K ⊂ V be a manifold of dimension r. Let k > 2r and let α 0 = k−2d k(d+1) . Then, for all α ∈ (0, α 0 ), for almost every (in the sense of prevalence) bounded linear map F : V → R k there exists a C > 0 such that for all x, y ∈ K, where · 2 denotes the Euclidean norm in R k . In particular, almost every F is one-to-one on K and its image, and F −1 F (K) is Hölder continuous with exponent α.
In particular, almost every such map F is a homeomorphism between K and its image in R k , which in short is called an embedding of K (see e.g. [35, §18]). This means that the image F(M) will again be an r-dimensional manifold in R k , provided that k > 2r. We will apply this result to the transition manifold, i.e., V = L 2 1/ρ and K = M, and for simplicity restrict ourselves to the lowest embedding dimension, i.e., k = 2r + 1. Any "randomly selected" continuous map F : L 2 1/ρ → R 2r+1 then is an embedding of M. Unfortunately, there is no practical way to randomly draw from the space of continuous maps on L 2 1/ρ directly. Instead of arbitrary continuous maps, we therefore restrict our considerations to maps F : where where σ is some distribution on the (finite-dimensional) space of (2r + 1) × d-matrices (e.g., Gaussian matrices). The linear map η : X → R 2r+1 , called feature map, is bounded due to the boundedness of X. Maps of the form (8) are therefore continuous on L 1 , and thus in particular on the subspace L 2 1/ρ . By drawing from the distribution σ of the matrices A, we can effectively sample maps of form (8). We assume from now on that the embedding property from Theorem 2.7 not only holds for a prevalent subset of general continuous maps, but already for a prevalent subset of the maps of form (8). In other words, we assume that a randomly drawn function of form (8) with linear η almost surely is an embedding of M. While the validity of this assumption is far from obvious for general manifolds, there is empirical evidence that for the typically "smooth" transition manifolds M, it is indeed fulfilled. Still, this necessary restriction to the class of linear embedding functions represents a weak point of the transition manifold method that will later be solved by using kernel embeddings instead.
The dynamical embedding of a point x ∈ X is then defined by This is the Euclidean representation of the density p t x , and the set {E(x) | x ∈ X} ⊂ R 2r+1 is the Euclidean representation of the fuzzy transition manifold. It again clusters around an r-dimensional manifold in R 2r+1 , namely the image F(M) of the transition manifold under F: Proposition 2.8. Let the process X t be (ε, r)-reducible with transition manifold M, and F : L 2 1/ρ → R 2r+1 and E : R n → R 2r+1 defined as in (8) and (9). Then Proof. Let x ∈ X. By the (ε, r)-reducibility of X t (Definition 2.3), and the fact that M ⊂ M, i.e., M itself consists of transition densities, there exists an x * ∈ X such that p t x * ∈ M and p t x − p t Remark 2.9. Together, Theorem 2.7 and Proposition 2.8 guarantee at least a minimal degree of well-posedness of the embedding problem: The embedded manifold F(M) has the same topological structure as M, and F( M) clusters closely around it (if η ∞ is small). However, guarantees on the condition of the problem cannot be made. The manifold M will in general be distorted by F, to a degree that might pose problems for numerical manifold learning algorithms. This problem is illustrated in Figure 2. Such a situation typically occurs if some of the components of the embedding F are strongly correlated. Additionally, the Whitney embedding theorem cannot guarantee that the fuzzy transition manifold M will be preserved under the embedding, as analytically M is not a manifold. Thus, F is in general not injective on M.

Data-driven algorithm for parametrizing the transition manifold
Due to the implicit definition of M, the embedded transition manifold F(M) is hard to analyze directly. However, as M ⊂ M and F( M) concentrates η ∞ ε -closely around F(M), one can expect that any parametrization of the dominant directions of F( M) is also a good parametrization of F(M). We now explain how F( M) can be sampled numerically and how this sample can be parametrized.
Let X N = {x 1 , . . . , x N } be a finite sample of state space points, which covers the "dynamically relevant" part of state space, i.e., the regions of X of substantial measure ρ. The exact distribution of the sampling points is not important here. If X is bounded or periodic, X N could be drawn from the uniform distribution or chosen to form a regular grid. In practice, it often consists of a sample of the system's equilibrium measure ρ.
The set F {p t x | x ∈ X N } will serve as our sample of F( M). Its elements can be computed numerically in the following way: Let X τ (x 0 , ω) denote the end point of the time-τ realization of X t starting in x 0 ∈ X and outcome ω ∈ Ω, where Ω is the sample space underlying the process X t . For x ∈ X, τ > 0 fixed as in Definition 2.3 and arbitrarily chosen {ω 1 , . . . , ω M } ⊂ Ω, let y (k) (x) := X τ (x, ω k ). In short, the y (k) (x), k = 1, . . . , n, sample the density p t x . In practice, the y (k) (x) will be generated by multiple runs of a numerical SDE solver starting in x with M different random seeds ("bursts of simulations").
With the samples y (k) (x), we approximate F(p t x ) by its Monte Carlo estimator: .
Due to Proposition 2.8, the point cloud F {p t x | x ∈ X N } , and for a large enough burst size M also its empirical estimator E X N , then clusters around the r-dimensional manifold Parametrizing E X N , i.e., finding the dominant nonlinear directions in this point cloud in R 2r+1 , now can be accomplished by a variety of classical manifold learning methods. We assume that we have a method at our disposal that is able to discover the underlying rdimensional manifold within the point cloud E X N , and assign each of the points { E(x) | x ∈ X N } a valueγ E(x) ∈ R r according to its position on that manifold. For examples of such algorithms see Section 3.1. Hence,γ : E(X N ) → R k can be seen as an approximate parametrization of F(M), defined however only at the points E(X N ). Any parametrization of F(M) in turn corresponds to a parametrization of M, due to F being an embedding. Finally, any parametrization of M corresponds to a good reaction coordinate due to Theorem 2.6. Thus, the map ξ(x) : forms a good reaction coordinate. Note however that it is only defined on the sample points X N . The strategy of computing reaction coordinates by embedding densities sampled from M into R 2r+1 by a random linear map and learning a parametrization of the embedded manifold was first presented in [6]. The following algorithm summarizes the overall procedure:

Kernel-based parametrization of the transition manifold
The approach described above for learning a parametrization of the transition manifold M by embedding it into Euclidean spaces requires a priori knowledge of the dimension of M. Also, more importantly, M might be strongly distorted by the embedding F, as described in Section 2.4. The kernel-based parametrization, which is the main novelty of this work, will address both of these shortcomings by embedding M into reproducing kernel Hilbert spaces.

Kernel reformulation of the embedding algorithm
Manifold learning algorithms that can be used in Algorithm 2.1 include Diffusion Maps [14], Multidimensional Scaling [49,28], and Locally Linear Embedding [37]. These, and many others, require only a notion of distance between pairs of data points. In our case, this amounts to the Euclidean distances between embedded points, i.e., E(x i ) − E(x j ) 2 , which can be computed by the Euclidean inner products E(x i ), E(x j ) , as Other compatible algorithms such as Principal Component Analysis are based directly on the inner products. The inner products can be written as and the empirical counterpart is However, rather than explicitly computing the inner product between the features on the right-hand side, we now assume that it can be computed implicitly by using a kernel function k : This assumption, called the kernel trick, is commonly used to avoid the costly computation of inner products between high-dimensional features. However, instead of defining the kernel k based on previously chosen features, one typically considers kernels that implicitly define high-and possibly infinite-dimensional feature spaces. In this way, we are able to avoid the choice of the feature map η altogether. Kernels with this property span a so-called reproducing kernel Hilbert space: Here, A denotes the completion of a set A with respect to · H . Requirement (ii) implies that The inner product between general functions f, g ∈ span {k(x, ·) | x ∈ X} can therefore be expressed as the weighted sum of kernel evaluations: Let where the selection of points x i , x j depends on f and g, respectively. Then For functions on the boundary of span {k(x, ·) | x ∈ X}, the inner product is constructed by the usual limit procedure. The map η : x → k(x, ·) can be regarded as a function-valued feature map (the so-called canonical feature map). However, each positive definite kernel is guaranteed to also possess a feature map of at most countable dimension: Theorem 3.2 (Mercer's theorem [32]). Let k be a positive definite kernel and ν be a finite Borel measure with support X. Define the integral operator T k : Then there is an orthonormal basis { √ λ i ϕ i } of H consisting of eigenfunctions ϕ i of T k rescaled with the square root of the corresponding nonnegative eigenvalues λ i such that The above formulation of Mercer's theorem has been taken from [33]. The Mercer features η i := √ λ i ϕ i thus fulfill (10) for their corresponding kernel. The usage of the same symbol η as for the linear feature map from Section 2.4 is no coincidence, as the Mercer features will again serve the purpose to observe certain features of the full system. In what follows, η(x) will always refer to the vector (or 2 sequence) defined by the Mercer features. If not stated otherwise, ν will be the standard Lebesgue measure. 2. Polynomial kernel of degree p: k(x, x ) = (x x + 1) p . It can be shown that the Mercer feature space is spanned by the monomials in x up to degree p.

Gaussian kernel
where σ > 0 is called the bandwidth of the kernel. Let p ∈ N and p = (p 1 , . . . , p n ) with p 1 + . . . + p n = p be a multi-index. The Mercer features of k then take the form η p (x) = e p 1 (x 1 ) · · · e pn (x n ), see [48], where Let F k denote the density embedding based on the Mercer features of the kernel k, i.e., and let E k (x) := F k (p t x ). The amount of information about p t x preserved by the embedding F k depends on the choice of the kernel k. For the first two kernels in Example 3.3, the information preserved has a familiar stochastic interpretation (see, e.g., [33,39,47]): 1. Let k be the linear kernel. Then i.e., the means of p t x 1 and p t x 1 coincide.
2. Let k be the polynomial kernel of degree p > 1. Then i.e., the first p moments m i of p t x 1 and p t x 1 coincide.
Remark 3.4. In practice, comparing the first p moments often is enough to sufficiently distinguish the transition densities that constitute the transition manifold. However, densities that differ only in higher moments cannot be distinguished by F k , which means that for the above two kernels, F k is not injective on M. Therefore F k does not belong to the prevalent class of maps that is at the heart of the Whitney embedding theorem 2.7. We can therefore not utilize the Whitney embedding theorem to argue that the topology of M is preserved under F k . Instead, in Section 3.2, we will use a different argument to show that the embedding is indeed injective for the Gaussian kernel (and others).
Still, by formally using the Mercer dynamical embedding E k in (9) (abusing notation if there are countably infinitely many such features), and using the kernel trick, we can now reformulate Algorithm 2.1 as a kernel-based method that does not require the explicit computation of any feature vector. This is summarized in Algorithm 3.1. Simulate trajectory of length τ with new random seed. Let the end point be denoted by y (l) i .

5:
end for 6: end for 7: Compute the kernel matrix K ∈ R N ×N : 9: Apply a distance-based manifold learning algorithm to the distance matrix D. Denote the resulting parametrization of the underlying i-th element by γ i ∈ R r . Output: An r-dimensional reaction coordinate evaluated at the test points: ξ(x i ) := γ i , i = 1, . . . , N.

Condition number of the kernel embedding
We will now investigate to what extent the kernel embedding preserves the topology and geometry of the transition manifold.

Kernel mean embedding
We derived the kernel-based algorithm by considering the embedding F k of the transition manifold into the image space of the Mercer features in order to highlight the similarity to the Whitey embedding based on randomly drawn features. Of course, the Mercer features never had to be computed explicitly. However, in order to investigate the quality of this embedding procedure it is advantageous to consider a different, yet equivalent embedding map: The transition manifold can be directly embedded into the RKHS by means of the kernel mean embedding operator. Note that µ(p) andμ(p) are again elements of H and that for ν in (12) being the Lebesgue measure we obtain µ(p) = T k p. Further, one sees that where the inner product ·, · refers to the Euclidean inner product or the inner product in 2 (N 0 ), dependent on whether F k (p) is finite or countably infinite. Thus, for investigating whether the embedding F k preserves distances or inner products between densities, we can equivalently investigate the embedding µ. This is advantageous as injectivity and isometry properties of the kernel mean embedding are well-studied.

Injectivity of the kernel mean embedding
A first important result is that k can be chosen such that µ is injective. Such kernels are called characteristic [21]. In [47], several conditions for characteristic kernels are listed, including the following: Condition (15) is known as the Mercer condition, which is, for example, fulfilled by the Gaussian kernel from Example 3.3. The Mercer features of such a kernel are particularly rich. For more details, see, e.g., [41,48]. It is easy to see that for kernels fulfilling (15), µ as a map from L 2 to H is Lipschitz continuous: Proof. As µ is linear, it suffices to show that µ(f ) H ≤ c f 2 for all f ∈ L 2 . We obtain where (11) was used in the last line. By expanding k into its Mercer features via (13), this becomes By Theorem 3.7, the ϕ i form an orthonormal basis of L 2 , and thus Thus, if the kernel is characteristic, the structure of the TM and the fuzzy TM are qualitatively preserved under the embedding. Corollary 3.9. Let k be a characteristic kernel and let X t be (ε, r)-reducible. Then µ(M) is an r-dimensional manifold in H, and for all x ∈ X it holds that Proof. By Lemma 3.8, the map µ : L 2 → H is continuous (and furthermore injective), and thus µ(M) is again an r-dimensional manifold. For x ∈ X, consider now any Remark 3.10. This result should be seen as an analogue to Proposition 2.8 for the Whitneybased TM embedding. In short, for characteristic kernels, the injectivity and continuity of µ guarantee that the image of M under µ is again an r-dimensional manifold in H, and Corollary 3.9 guarantees that the embedded fuzzy transition manifold µ( M) still clusters closely around µ(M) (if √ λ 0 and √ ρ ∞ in Corollary 3.9 are small). This again guarantees a minimal degree of well-posedness of the problem.

Distortion under the kernel mean embedding
Unlike the Whitney embedding, the kernel embedding now allows us to derive conditions under which the distortion of M is bounded. We have to show that the L 2 1/ρ -distance between points on M is not overly decreased or increased by the kernel mean embedding. To formalize this, we consider measures for the manifold's internal distortion, following the notions of metric embedding theory [1]. We call the embedding well-conditioned if both the contraction: sup are small (close to one). Here, µ denotes the embedding corresponding to a characteristic kernel.
Due to the Lipschitz continuity of µ (see Lemma 3.8) and thus bounding the expansion.
Contraction bound: regularity requirement. Unfortunately, it is not possible even for characteristic kernels to derive a bound for the contraction that holds uniformly for all p, q ∈ L 2 1/ρ , as the following proposition shows. Nevertheless, we will be able to give reasonable bounds under some regularity-and dynamic assumptions, (20) and (23), respectively. Proposition 3.11 (Unbounded inverse embedding). Assume the kernel embedding operator µ has absolutely bounded orthonormal eigenfunctions ϕ i with corresponding nonnegative eigenvalues λ i (arranged in nonincreasing order). Assume lim i→∞ λ i = 0. Then there exist functions p, q ∈ L 2 1/ρ such that for any arbitrarily small ε > 0.
Proof. See Appendix A.
The assumptions of Proposition 3.11 are fulfilled for example for the Gaussian kernel. A similar, but non-quantitative result has been derived in [47,Theorem 19]. The idea behind its proof and the proof of Proposition 3.11 is that, if p and q vary only in higher eigenfunctions ϕ i of the embedding operator µ (see also Theorem 3.2), the H-distance can become arbitrarily small. If, however, we can reasonably restrict our considerations to the subclass of functions whose variation in the higher ϕ i is small compared to the variation in the lower ϕ i , a favorable bound can be derived. Let the expansion of h = p − q be given by with the sequence (h 0 ,h 1 , . . .) ∈ 2 . Now, for any i max ∈ N such that there exists an index i ≤ i max withh i = 0, define the factor This factor bounds the contribution of the higher Mercer eigenfunctions to h by the contribution of the lower ones, hence it is a regularity bound : Thus, for an individual h, we can bound the distortion of the L 2 -norm under µ with the help of c(h, i max ).
Lemma 3.12. Let h ∈ L 2 , i max ∈ N, and c(h, i max ) be defined as in (19). Then Proof. See Appendix A.
We from now on make the assumption that for every index i max there exists a constant c * imax > 0 such that for all p τ x 1 , p τ x 2 ∈ M. The existence and form of this constant strongly depends on the shape of the Mercer eigenfunctions, hence the kernel. However, we motivate the existence of such a global constant by the observation that higher Mercer eigenfunctions typically consist of high Fourier modes, and that these modes decay quickly under the dynamics. Therefore, high Mercer eigenfunctions should have a negligible share of the p τ x and the differences p τ While due to ergodicity 1/ √ ρ is indeed defined on all of X, it becomes large in regions of small invariant measure ρ, i.e., "dynamically irrelevant" regions. This would lead to a very large upper bound for the contraction. For general h, a more favorable estimate is indeed difficult to obtain. For us, however, h = p τ x 1 − p τ x 2 , and we can utilize that these "dynamically irrelevant" regions are almost never visited by the system.
To formalize this, we require one additional assumption that can be justified by the metastability of the system. One defining characteristic of metastable systems is the phenomenon that essentially any trajectory moves nearly instantaneously 1 into one of the metastable sets before continuing. With A i , . . . , A d ⊂ X denoting these sets, we can thus assume that the probability density p τ x (y) to move from x to y in time τ depends almost only on the probabilities to (instantaneously) move to the sets A i from x (denoted by c i (x), thus d i=1 c i (x) ≤ 1) and the probabilities to then move from A i to y in time τ (denoted by p τ A i (y)), i.e., To be more precise, we require for all x, y ∈ X that which is equivalent to for some positive function δ : X × X → R with δ ∞ ≤ δ * 1. The positivity of δ comes from the fact that there is a miniscule, but positive probability (density) to move from x to y without first equilibrating inside a metastable set, thus p τ . With this, we can bound the invariant density ρ from below as follows: The b i can be seen as the equilibrium probability mass almost instantaneously attracted to A i . For every important metastable set this will not be too small. As a first step, (24) allows us to bound the L 2 1/ρ -norm of p τ x by their L 1 -norm: Lemma 3.13. Let the assumption (23) hold and b i , i = 1, . . . , d, be defined as in (24). Then for any x ∈ X it holds that p τ Proof. See Appendix A.
This shows that indeed p τ x ∈ L 2 1/ρ , as required by Definition 2.3. Further, Hölder's inequality gives where |X| denotes the Lebesgue measure of the state space. This now also allows us to bound the L 2 1/ρ -norm of the p τ x by their L 2 -norm: Lemma 3.14. Let the assumption (23) hold and b i , i = 1, . . . , d, be defined as in (24). Then for any x 1 , x 2 ∈ X it holds that Proof. See Appendix A.
Of course, due to the squared norm on the left-hand side, this is not a Lipschitz bound. However, recall that our main motivation for deriving a bound for the contraction is to show that large distances in L 2 1/ρ are not overly compressed under the embedding into H, as illustrated in Figure 2. We therefore abstain from deriving such a bound for very small distances in L 2 1/ρ and only estimate the contraction of pairs of densities p τ x 1 , p τ x 2 with p τ x 1 − p τ for some constant C > 0. That C is reasonably large is discussed in Remark 3.15 below. For such differences, we can then relate the L 2 to the L 2 1/ρ -norm, i.e., Together with Lemma 3.12 and assumption (20), this gives which is our contraction bound.
Remark 3.15. For the distortion of the transition manifold under the embedding the essential property is that the global "spanning structure" of the manifold is well preserved. In other words, the embedded p τ x , p τ y should be well-separated for x, y ∈ X from different metastable sets A i . Since the embedding is continuous, the transition paths connecting them will be preserved as well.
As p τ A i → ρ as τ → ∞ for every i, it is important that τ is not too large, such that the transition manifold is a meaningful object. In a other words, τ should be such that p τ A i and p τ A j are sufficiently distinct for i = j. Thus, we require that p τ A i − p τ is sufficiently large for i = j, hence C can be chosen as a constant such that 1/C is reasonably small. Remark 3.16. The bounds (18) and (28) guarantee the well-posedness of the overall embedding and parametrization problem as the relevant expansion and contraction of the transition manifold cannot become arbitrarily large. We will support this statement with numerical evidence (see Section 4.2) showing that the distortion is, in practice, indeed small. It should be noted, however, that we do not expect the analytically derived bounds to perform well as quantitative error estimates, as many of the estimates that led to them are rather rough.

Reaction coordinate of the Müller-Brown potential
As a first illustrating example, we compute the reaction coordinate of the two-dimensional Müller-Brown potential [34] via the new kernel-based Algorithm 3.1. The potential energy surface (see Figure 3 (a)) possesses three local minima, where the two bottom minima are separated only by a rather shallow energy barrier. Correspondingly, the system's characteristic long-term behavior is determined by the rare transitions between the minima. These transitions happen predominantly along the potential's minimum energy pathway (MEP), which is shown as white dashed line and was computed using the zero temperature string method [15,16]. For two sets A, B ⊂ X and a starting point x ∈ X, the committor function q AB (x) is defined as the probability that the process hits set A before hitting set B, provided it started in x at time zero. For a precise definition see [42]. For the Müller-Brown potential, the committor function associated with the top left and bottom right energy minima, shown in Figure 3 (b), can be considered an optimal reaction coordinate [18]. Therefore, we use the (qualitative) comparison with the committor function as a benchmark for our reaction coordinate. Note that the computation of the committor function requires global knowledge of the metastable sets and is often not a practical option for the identification of reaction coordinates.
The governing dynamics is given by an overdamped Langevin equation (1), which we solve numerically using the Euler-Maruyama scheme. At inverse temperature β = 0.05, the lag time τ = 0.03 falls between the slow and fast time scales and is thus defined to be the with bandwidth σ = 0.1 is used and for the manifold learning task in Algorithm 3.1, the diffusion maps algorithm with bandwidth parameter 0.1. The reaction coordinate ξ for the test points is shown in Figure 3 (c). We observe remarkable resemblance to the committor function.
Remark 4.1. The kernel evaluations used for the RKHS embedding of densities and the kernel evaluations used in the diffusion maps algorithm should not be mixed up as they serve different purposes. The former is used to embed the state space densities into H, while the latter is used to approximate the Laplace-Beltrami operator on the manifold in H that is to learn (this is the principle on which the diffusion maps algorithm is based). Even though the Gaussian kernel is a popular choice due to its favorable characteristics, one has great freedom in choosing a kernel for the RKHS-embedding, whereas in the classical diffusion maps algorithm, predominantly the Gaussian kernel is used, so the repeated use of the Gaussian kernel does not constitute a connection. Moreover, the fact that in this example identical bandwidth parameters were used was a mere coincidence. We do not see a way to unify these kernel evaluations, neither on a conceptual nor algorithmic level.

Distortion under the Whitney and kernel embeddings
We now demonstrate the distortion of the transition manifold under the embedding via the conventional Algorithm 2.1 (Whitney embedding) and our new Algorithm 3.1 (kernel-based embedding). To this end, we consider the two-dimensional potential depicted in Figure 4 (a). Good reaction coordinates of a diffusion process in this potential should parametrize the "horseshoe-like" MEP shown as a white dashed line. Such a reaction coordinate is depicted in Figure 4

Whitney embedding
For the Whitney embedding, the expected manifold dimension r = 1 is assumed to be known in advance. To demonstrate the different effects of "good" and "bad" embedding functions, two 2r + 1-dimensional linear observables η : R 2 → R 3 were chosen: and the corresponding embedding functions F g , F b constructed via (8). The coefficients of A g of the observable function η g were chosen randomly via the Matlab command rng(1); A_g=rand(3,2)-0.5, which resulted in the matrix Under the embedding F g , all parts of the transition manifold can indeed be distinguished very well, see Figure 5 (a). This is due to the fact that two distinct points of the transition pathway of the potential, for example on the two opposite "branches", are never mapped to the same point under η g .
On the other hand, the matrix A b of the "bad" observable function η b was intentionally constructed to consist of three row vectors that are pairwise almost linearly dependent, and that essentially ignore the x 1 -component of state space points: This way, points on the two opposite branches of the transition pathway but with the same x 2 -coordinate are mapped to almost the same point in R 3 . The result is an embedding of the transition manifold in which the two branches can hardly be distinguished, see Figure 5 (b). This makes the numerical identification of the manifold structure extremely difficult. Note that the judgment of quality of the embedding function has to be performed manually after the embedding, as it is impossible to reliably choose good embedding functions without detailed a priori knowledge of the global structure of the transition manifold or transition pathway. While in our experience, randomly chosen coefficients typically result in "goodenough" embedding functions, this uncertainty in the numerical algorithm should be seen as one of the main reasons to use the more consistent kernel embeddings instead.

Kernel embedding
For the kernel embedding, we again utilize the Gaussian kernel (29) with bandwidth σ = 10 −3 . The result of the kernel embedding of the test points is an approximation of the kernel distance matrix Algorithm 3.1), which cannot be visualized directly. We thus apply the Multidimensionl scaling (MDS) algorithm to D, in order to visualize the level of similarity between the embedded densities. Given a distance matrix D, MDS generates points z i ∈ R k in a Euclidean space of a chosen dimension k ∈ N such that the pairwise distance between the z i corresponds to the distances in D. For an overview of different MDS methods, see for example [49]. We here use the implementation of classical MDS given by the cmdscale method in Matlab.
The MDS representation of the kernel distance matrix for k = 2 is shown in Figure 6 (a). The curved structure of the MEP is immediately visible. Moreover, it is also possible to visualize the corresponding L 2 1/ρ and L 2 distance matrices via MDS, i.e., the matrices The results are shown in Figure 6 For large enough N , we expect D N (µ) to be a good estimator for the true distortion D(µ). The blue graph in Figure 7 shows the dependence of the empirical distortion on the kernel parameter σ. Here the minimum is D N (µ) = 36.7 at σ ≈ 10 −3 . Interpreting D N (µ) as the condition number of the kernel-based embedding problem, the problem can be described as reasonably well-conditioned.
Analogously, we can define the empirical maximum distortion of the Whitney embedding as D N (F) := C N (F)E N (F), where For a given embedding F, this distortion can again be computed numerically. For the "good" embedding F g , we obtain D N (F g ) ≈ 6 · 10 2 , while for the "bad" embedding F b , we obtain D N (F b ) ≈ 7 · 10 3 . The kernel embedding is therefore much better conditioned than both Whitney embeddings.
Remark 4.2. Analogously, we can also define and compute the maximum distortion D N (µ) of the L 2 -metric (red graph in Figure 7). Here, for σ ≈ 10 −1 we obtain D N (µ) = 2.7, i.e., the embedding becomes nearly isometric. This is not surprising as it has been shown in [47] that for radial kernels k σ (x, y) = σ −d g(σ −1 x − y ) where g is bounded, continuous, and positive definite, it holds that lim The Gaussian kernel belongs to this class of kernels. We thus expect that by increasing the sample number M of the transition densities and further decreasing σ, the distortion can be reduced further. However, recall that for our application, only the distortion of the L 2 1/ρ distance is relevant.

Alanine dipeptide
We now demonstrate the applicability of Algorithm 3.1 to realistic, high-dimensional molecular systems by computing reaction coordinates of the Alanine dipeptide. The peptide, depicted in Figure 8 (a), consists of 22 atoms, the state space X thus has the dimension n = 66.
It is well-known that the essential long-term behavior of this system is governed by the metastable transitions between four local minima of the potential energy surface (PES) [13,44]. These minima are clearly visible when projecting the PES onto two specific backbone dihedral angles (ϕ, ψ) that we call essential from now on (see Figure 8 (b)). The transition between the metastable states happens along minimal energy pathways that we aim to reveal with our reaction coordinate. Note however that no information about the existence of the two essential dihedral angles was used in our experiments, and we perform all of the analysis on the full 66-dimensional data.
The simulations were performed using the Gromacs molecular dynamics software [3]. We consider the molecule in explicit aqueous solution at temperature 400 K (the water molecules are discarded prior to further analysis). To generate the test points x i , N = 1000 snapshots from a long, equilibrated trajectory were subsampled. This guarantees that the x i cover the dynamically relevant regions of X, i.e., the metastable sets and transition pathways. The values of the dihedral angles ϕ and ψ of the test points are shown in Figure 10 (the x-and y-coordinates of the points). We see that the metastable sets and transition pathways from Figure 8 (b) are adequately covered. Note however that the projection onto the (ϕ, ψ)-space here serves only illustrative purposes; we continue to work with the test points in the full 66-dimensional space.
The intermediate lag time τ = 20 ps falls between the slow and fast time scales. For strategies to estimate τ prior to simulation, see [5]. For each test point x i , M = 100 simulations of length τ were performed, which took 40 h on a 96 core cluster. The resulting point clouds {y (l) i , l = 1, . . . , M } are samplings of the densities p τ x i . To compute the kernel distance matrix D from the simulation data, the Gaussian kernel (29) with bandwidth σ = 0.1 was chosen. For the plug-in manifold learning algorithm that is applied to D, the diffusion maps algorithm with bandwidth parameter 0.01 was used. The analysis of the simulation data was performed in Matlab and took 4 minutes on a 4 core laptop. Figure 9 (a) shows the leading spectrum of the diffusion map matrix that was computed based on D. The first diffusion map eigenvalue is always one, and the associated eigenvector carries no structural information. Therefore, a spectral gap after the third sub-dominant eigenvalue indicates that the underlying transition manifold is intrinsically three-dimensional. The associated three subdominant eigenvectors now are the final reaction coordinate. For each of the 1000 test points, the values of the three eigenvectors are shown in Figure 9 (b). This can be seen as the embedding of the test points into the reaction coordinate space. Here we observe four clusters of points, and three connecting paths. In Figure 10, the values of the dihedral angles ϕ and ψ at the test points are compared to the values of the three components of the computed reaction coordinates, shown in color. We see that areas of almost constant color correspond to the four metastable sets from Figure 8. Thus, the computed reaction coordinate is able to identify the metastable sets and resolve transitions between them.

Conclusion and future work
In this work, we have analyzed the embedding of manifolds that lie in certain function spaces into reproducing kernel Hilbert spaces. Moreover, we have proposed efficient numerical algorithms for learning parametrizations of these embedded manifolds from data. The question is motivated by the recent insight that parametrizations of the so-called transition manifold, a manifold consisting of the transition density functions of a stochastic system, are strongly linked to reduced coordinates for that system. The method can thus be used for coarse graining a given system.
Compared to previous approaches based on random embeddings into a Euclidean space, the new kernel-based approach eliminates the need to know the transition manifold dimension a priori. Furthermore, if a universal kernel is used, the topological structure of the transition manifold is guaranteed to be preserved under the embedding. We have derived bounds for the geometric distortion of the transition manifold under the RKHS embedding, which can be interpreted as the condition of the overall coarse graining procedure. Correspondingly, the numerical algorithm was demonstrated to be very robust, especially when compared to random embeddings, and, in realistic applications, we obtained very favorable results regarding algorithmic distortion bounds.
There are several new avenues to use the broader theory of kernel embeddings to characterize the kernel embedding of transition manifolds. First, we plan to improve the theoretic distortion bounds derived in Section 3.2 by considering different established interpretations of the metric defined by d(p, q) = µ(p) − µ(q) H . For an overview, see [47].
Recently, the spectral theory of transfer operators was extended to reproducing kernel Hilbert spaces in [27]. The usefulness of this new theory for the data-driven conformation analysis of molecular systems was demonstrated in [25]. As the transition manifold can be defined via the transfer operator 2 , it seems natural to attempt to relate the embedded transition manifold to the kernel transfer operators and corresponding embedded transfer operators defined in [27].
Finally, as illustrated in [7,8,9], RKHSs can act as linearizing spaces in the sense that performing linear analysis in the RKHS can capture strong nonlinearities in the original system. A typical example is the problem of linear separability in data classification: A data set which is not linearly separable might be easily separated when mapped into a nonlinear feature space. In our current context, this means that efficient linear manifold learning methods might be suitable to parametrize the embedded manifold, if the kernel is chosen appropriately. We will investigate whether a corresponding theory can be developed.