Optimal state space reconstruction via Monte Carlo decision tree search

A novel idea for an optimal time delay state space reconstruction from uni- and multivariate time series is presented. The entire embedding process is considered as a game, in which each move corresponds to an embedding cycle and is subject to an evaluation through an objective function. This way the embedding procedure can be modeled as a tree, in which each leaf holds a specific value of the objective function. By using a Monte Carlo ansatz, the proposed algorithm populates the tree with many leafs by computing different possible embedding paths and the final embedding is chosen as that particular path, which ends at the leaf with the lowest achieved value of the objective function. The method aims to prevent getting stuck in a local minimum of the objective function and can be used in a modular way, enabling practitioners to choose a statistic for possible delays in each embedding cycle as well as a suitable objective function themselves. The proposed method guarantees the optimization of the chosen objective function over the parameter space of the delay embedding as long as the tree is sampled sufficiently. As a proof of concept, we demonstrate the superiority of the proposed method over the classical time delay embedding methods using a variety of application examples. We compare recurrence plot-based statistics inferred from reconstructions of a Lorenz-96 system and highlight an improved forecast accuracy for map-like model data as well as for palaeoclimate isotope time series. Finally, we utilize state space reconstruction for the detection of causality and its strength between observables of a gas turbine type thermoacoustic combustor.

populates the tree with many leafs by computing different possible embedding paths and the final embedding is chosen as that particular path, which ends at the leaf with the lowest achieved value of the objective function. The method aims to prevent getting stuck in a local minimum of the objective function and can be used in a modular way, enabling practitioners to choose a statistic for possible delays in each embedding cycle as well as a suitable objective function themselves. The proposed method guarantees the optimization of the chosen objective function over the parameter space of the delay embedding as long as the tree is sampled sufficiently. As a proof of concept, we demonstrate the superiority of the proposed method over the classical time delay embedding methods using a variety of application examples. We compare recurrence plot-based statistics inferred from reconstructions of a Lorenz-96 system and highlight an improved forecast accuracy for map-like model data as well as for palaeoclimate isotope time series. Finally, we utilize state space reconstruction for the detection of causality and its strength between observables of a gas turbine type thermoacoustic combustor.
Keywords State space reconstruction · Embedding · Optimization · Time series analysis · Causality · Prediction · Recurrence analysis

Introduction
The famous embedding theorems of Whitney [1], Mañé [2], and Takens [3] together with their enhancement by Sauer et al. [4] allow a high dimensional state space reconstruction from (observed) uni-or multivariate time series. Computing dynamical invariants [5][6][7][8][9] from the observed system, making meaningful predictions even for chaotic or stochastic systems [10][11][12][13][14][15][16], detecting causal interactions [17][18][19] or nonlinear noise reduction algorithms [20,21] all rely explicitly or implicitly on (time delay) embedding [22] the data into a reconstructed state space. Other ideas rather than time delay embedding (TDE) are also possible [22][23][24][25][26][27], but due to its simple use and its proficient outcomes in a range of situations, TDE is by far the most common reconstruction technique. Suppose there is a multivariate dataset consisting of M time series s i (t), i = 1, . . . , M. The basic idea is to use lagged values of the available time series as components of the reconstruction vector Here, the delays τ j are multiples of the sampling time Δt and the indices i 1 , i 2 , . . . , i m each denote the time series index i ∈ [1, . . . , M], which has been chosen in the 1st, 2nd, . . . , mth embedding cycle. The total number of delays τ j , j = [1, . . . , m], i.e., the embedding dimension m, its values and the corresponding time series s i j , i j ∈ [1, . . . , M] need to fulfill certain criteria to guarantee the equivalence to the unknown true attractor, e.g., the embedding dimension must suffice m 2D B + 1, with D B being the unknown boxcounting dimension (see Casdagli et al. [28], Gibson et al. [24], Uzal et al. [29] or Nichkawde [30] for a profound overview of the problem). Picking optimal embedding parameters τ j and m comes down to make the resulting components of the reconstruction vectors v(t) as independent as possible [4,22], but at the same time not too independent, in order to keep sufficient information of the correlation structure of the data [28,29,[31][32][33]. Besides some unified approaches [34][35][36][37][38][39][40][41][42][43][44], which tackle the estimation of the delays τ j and the embedding dimension m simultaneously, most researchers use two different methods to perform the reconstruction.
(1) A statistic determines the delays τ j , we call it Λ τ throughout this paper. Usually, τ 1 = 0, i.e., the first component of v(t) is the unlagged time series s i 1 in Eq. (1). For embedding a univariate time series, s i 1 = . . . = s i m = s(t), the approach to choose τ 2 from the first minimum of the automutual information [45,46] is most common. All consecutive delays are then simply integer multiples of τ 2 . Other ideas based on different statistics like the auto-correlation function of the time series have been suggested [23,32,33,40,[47][48][49][50]. However, by setting τ j , j > 2 to multiples of τ 2 , one ignores the fact that this "measure" of independence strictly holds only for the first two components of reconstruction vectors (m = 2) [51,52], even though in practice it works fine for most cases. More sophisticated ideas, like high-dimensional conditional mutual information [53,54] and other statistics [54][55][56][57][58][59], some of which include nonuniform delays and the extension to multivariate input data [30,38,39,53,54,60,[60][61][62][63][64][65]65], have been presented. (2) A statistic, we call it Γ throughout this paper, which serves as an objective function and quantifies the goodness of a reconstruction, given that delays τ j have been estimated. The embedding process is thought of as an iterative process, starting with an unlagged (given) time series s i 1 , i.e., τ 1 = 0. In each embedding cycle D d , [d = 1, . . . , m], a time series s i d lagged by τ d gets appended to obtain the actual reconstruction vectors v d (t) ∈ R d+1 and these are compared to the reconstruction vectors v d−1 (t) of the former embedding cycle (if d = 1, v d−1 (t) is simply the time series s i 1 ). This "comparison" is usually achieved by the amount of false nearest neighbors (FNN) [66][67][68][69][70], some other neighborhood-preserving-idea [71,72], or more ambitious ideas [29,30].
We have recently proposed an algorithm [41], which minimizes the L-statistic [29] (the objective function) in each embedding cycle D d over possible delay values in this embedding cycle determined by a continuity statistic [65]. Nichkawde [30] minimizes the FNNstatistic in each embedding cycle over time delays given by a statistic, which maximizes directional derivatives of the actual reconstruction vectors. However, it cannot be ruled out that these approaches result in achieving a local minimum of the corresponding objective function, rather than attaining the global minimum.
Here, we propose a Monte Carlo Decision Tree Search (MCDTS) idea to ensure the reach of a global minimum of a freely selectable objective function Γ , e.g., the L-or FNN-statistic or any other suitable statistic, which evaluates the goodness of the reconstruction with respect to the task. A statistic Λ τ , which guides the pre-selection of potential delay values in each embedding cycle (such as the continuity statistic or conditional mutual information), is also freely selectable and can be tailored to the research task. This modular construction might be useful for practitioners, since it has been pointed out that optimal embedding parameters-thus also the used statistics to approximate them-depend on the research question, e.g., computing dynamical invariants or prediction [63,64,[73][74][75]. Thus, the proposed method is neither restricted to the auto-mutual information, in order to measure the independence of consecutive reconstruction vector components, nor does it necessarily rely on the ubiquitous false nearest neighbor statistic. Independently from the chosen statistic for potential time delays and from the chosen objective function, the proposed method computes different embedding pathways in a randomized manner and structures these paths as a tree. Consequently, it is able to reveal paths through that tree-if there are any-which lead to a lower value of the objective function than paths, which strictly minimize the costs in each embedding cycle. Given a sufficiently high number of samplings, MCDTS guarantees to optimize the chosen objective function Γ over the (delay embedding-) parameter space. In Sect. 2, we describe this method before we apply it to paradigmatic examples in Sect. 3, which include Recurrence Analysis, nearest-neighbor-based time series prediction and causal analysis based on convergent cross mapping.

Method
When embedding a time series, in each embedding cycle a suitable delay, and for multivariate data a suitable time series, has to be chosen. While the final embedding vector is invariant to the order of chosen components, the embedding process, and the used statistics and methods to suggest suitable delays, generally depend on all the previous embedding cycles 1 . It Fig. 1 All possible embeddings of a time series visualized by a tree. Each leaf of the tree symbolizes one embedding cycle D d using one selected time series s i d from the multivariate data set and delay τ d . Marked in orange is one chosen full embedding seems therefore natural to visualize all possible embedding cycles in a tree-like hierarchical data structure as shown in Fig. 1. The initial time series s i 1 with delay τ 1 = 0 forms the root of the tree, and each possible embedding cycle D d is a leaf or node of the tree. With the large amount of possible delays and time series to choose from, this decision tree becomes too large to fully compute it. At the same time, aforementioned statistics like the continuity statistic or conditional mutual information can guide us in pre-selecting potentially suitable delay values and an objective function like the L-or FNN-statistic can pick the most suitable delay value of the pre-selection by quantifying the quality of the reconstruction in each embedding cycle. Throughout this paper, we denote a statistic, which pre-selects potential delay values as Λ τ and the objective function as Γ . The task to embed a time series can then be interpreted as minimizing Γ (i 1 , i 2 , .., i m , τ 1 , τ 2 , ..., τ m ). Visualizing this with a tree as in Fig. 1, we actually perform a tree search to minimize Γ . However, always choosing the leaf of the tree that decreases Γ the most might lead only to a local minimum.
As we strive to find a global minimum and cannot compute the full embedding tree, we proceed by sampling the tree. This approach is inspired by the Monte Carlo Tree Search algorithms that were originally envisioned to master the game of Go [76]. Ultimately computer programs based on these algorithms were able to beat a reigning world champion, a feat that was long thought to be impossible for computer programs [77]. Adapting this idea to the embedding problem, we proceed as follows. We randomly sample the full tree, for each embedding cycle we compute the change in the objective function Γ and pick for the next embedding cycle preferably those delays that decrease Γ further. Each node N d of the tree encodes one possible embedding cycle and holds the time series used [s i 1 , . . . , s i d ], the delays used until this node [τ 1 , . . . , τ d ], i.e., the current path through the tree up to node N d , and a value of the objective function Γ d . We sample the tree N trial -times in a two-step procedure: -Expand: Starting from the root, for each embedding cycle D d , possible next steps (s i j , τ j , Γ j ) are either computed using suitable statistics Λ τ and Γ or, if there were already previously computed ones, they are looked up from the tree. We consider the first embedding cycle D 2 and use the continuity statistic ε (τ ) for Λ τ . Then, for each time series s i the corresponding local maxima of all ε (τ ) (for a univariate time series there will only be one ε (τ )) that determines the set of possible delay values τ 2 (see the rows in Figs. 1, 2 corresponding to D 2 ). Then, one of the possible τ 2 's is randomly chosen with probabilities computed with a softmax of the corresponding values of Γ j . Due to its normalization, the softmax function is able to convert all possible values of Γ j to probabilities with p j = exp(−βΓ j )/ k exp(−βΓ k ). This procedure is repeated (consecutive rows for D 3 . . ., etc., in Figs. 1, 2) until the very last computed embedding cycle D m+1 . This is, when the objective function Γ m+1 cannot be further decreased for any of the τ m+1 -candidates. Figure 2 visualizes this procedure. -Backpropagation: After the tree is expanded, the final value Γ m is backpropagated through the taken path of this trial, i.e., to all leafs (previous embedding cycles d), that were visited during this expand, updating their Γ d values to that of the final embedding cycle.
With this two-step procedure, we iteratively build up the part of the tree that leads to embedding with the smallest values for the objective function. The following two refinements are made to improve this general strategy: in case of multivariate time series input, the probabilities are chosen uniformly random in the zeroth embedding cycle D 1 . This ensures an even sam-pling over the given time series, which can all serve as a valid first component of the final reconstruction vectors. Additionally, as soon as a Γ j is found that is smaller than the previous global minimum, this embedding cycle is directly chosen and not randomized via the softmax function. This also means that for the very first trial always the smallest value of Γ j is chosen, resulting in a good starting point for the further Monte Carlo search of the tree. In case, the continuity statistic ε (τ ) is used as the delay pre-selection statistic Λ τ and the ΔL-statistic [29] as the objective function Γ , the first sample thus is identical to the PECUZAL algorithms [41] and every further sample improves upon this embedding further minimizing ΔL. Aside from the choice of Λ τ and Γ , the two hyperparameters of the method are the number of trials N trials and the β parameter of the probability distribution choosing the next delay value. The parameter β governs how likely it is that the minimum of all Γ i is chosen, i.e., in the extreme cases for β = 0 the possible delay times are chosen uniformly random and for β → ∞ always the smallest Γ i is chosen. For the tree search algorithms, this means that β governs how "wide" the tree search is, larger β values search the tree more along the already found previously found minima, whereas for smaller values the tree search will stress previously unvisited paths through tree stronger. The default value for β which is used in all shown results is β = 2.
The computational complexity of this algorithm obviously scales with the number of trials N trials , even though already computed embedding cycles are not computed again in later trials. When sampling the tree many times, the path through the tree of the first few embedding cycles will likely often be the same as that of previous trials. In these cases, computing the delaypreselection and objective function will be identical to that of previous trials. All the values of possible delays and values of the objective function that are computed in previous trials are saved during the tree search and are reused when the same embedding cycle needs to be computed again.
Otherwise, the complexity depends on the chosen delay pre-selection function Λ τ and the objective function Γ . It has to be clear that the algorithm is computationally much more demanding than a classical TDE. However, once an embedding is computed for a specified system, it can be reused in later applications. Here, we exemplary use the continuity statistic ε (τ ) as the delay pre-selection statistic Λ τ and the ΔL-statistic [29] as the objective function Γ , as it has been utilized in the recently proposed PECUZAL algorithm [41]

Applications
In this section, we present the potential of the proposed MCDTS method by various applications. Here, we aim to provide suggestions and show that there are a number of state-space based applications that directly benefit from our method or provide better results than with the state-of-the-art embedding techniques. A variety of applications are presented to support the fact that different research questions elicit different embedding behavior and that our proposed method is able to optimize the embedding with respect to different study objectives. In particular, we investigate the influence of the state space reconstruction parameters on a recurrence analysis of the chaotic Lorenz-96 system (Sect. 3.1), a nearest-neighbor time series prediction for the chaotic Hénon map and for a palaeoclimate dataset (Sects. 3.2, 3.3), and last but not least, a causal analysis of two physical observables of a combustion process (Sect. 3.4). The selected applications cover many areas of nonlinear time series analysis, and it is not our intention here to propose new techniques for prediction or causal analysis which are necessarily superior to other, alternative approaches. We rather chose well established state-space-based methods and use them to show how our proposed method optimizes results with respect to the chosen embedding.

Recurrence properties of the Lorenz-96 system
At first, we consider a potentially higher dimensional nonlinear dynamical system and compare the recurrence properties of its dynamics as derived from the original set of system variables with such by applying the different embedding approaches. We utilize the Lorenz-96 system [79], a set of N ordinary first-order differential equations with x i being the state of the system of node i = 1, . . . , N and it is assumed that the total number of nodes is N ≥ 4. We can think of this system as a ring-like structure of N coupled oscillatorseach representing some atmospheric quantity-all connected to the same forcing. The forcing constant F serves as the control parameter. Here, we vary F from F = 3.7 to 4.0 in steps of 0.002 covering limit cycle dynamics as well as chaos. We set N = 8, randomly choose the initial condition to u 0 = [0.590; 0.766; 0.566; 0.460; 0.794; 0.854; 0.200; 0.298], and use a sampling time of Δt = 0.1. By discarding the first 2500 points of the integration as transients, we get time series consisting of 5000 samples for each of the encountered values of F. We focus on two scenarios: (1) only the time series of the 2nd node (univariate embedding) and (2) three time series of nodes 2, 4, and 7 are used to mimic a uni-and a multivariate embedding case. For each of these time series, we perform an embedding, using three classic time delay approaches as proposed by Kennel et al. [69] (5%-threshold), Cao [66] (slope threshold of 0.2), and Hegger and Kantz [67] (5%-threshold) with a uniform delay value estimated as the first minimum of the auto-mutual information (only applicable to the univariate case) and the recently proposed PECUZAL algorithm [41]. For our proposed MCDTS approach, we embed the data using the continuity statistic ε (τ ) as the delay preselection statistic Λ τ . For the objective function Γ , we try two different approaches, namely the ΔL-statistic [29] (MCDTS-C-L) as well as the FNN-statistic [67] (MCDTS-C-FNN). In all approaches, we discard serially correlated points from the nearest neighbor search by setting a Theiler window [80] to the first minimum of the mutual information. An overview over all MCDTS implementations and abbreviations is given in Table 1. By varying the control parameter F, the system varies its dynamics which is well represented by a change in the recurrence behavior [81]. In previous work, we have demonstrated that recurrence quantification analysis (RQA) can be used to qualitatively characterize the typical dynamical properties of the Lorenz-96 system such as chaotic or periodic dynamics [82]. We, therefore, compare the recurrence properties of all reconstructed trajectories to recurrence properties of the true trajectory (obtained from the numerical integration) by using RQA. The neighborhood relations of a trajectory can be visualized in a recurrence plot (RP), a binary, square matrix R representing the recurrences of states x i (i = 1, . . . , N , with N the number of points forming the trajectory) in a d-dimensional (optionally reconstructed) state space [83,84] with · a norm, ε a recurrence threshold, and Θ the Heaviside function. There are numerous ideas of how to quantify a RP [84,85]. Some statistics are based on the distribution of recurrence points, some on the diagonal line structures, some on the vertical structures, and it is also possible to use complex-network measures, when interpreting R (subtracting the main diagonal) as an adjacency matrix A = R − 1 of a recurrence network (RN) [86]. Some of these quantifiers are related to dynamical invariants [87,88]. For our purpose of comparing different aspects of recurrence properties of original and reconstructed trajectories, we use the transitivity (TRANS) of the ε-RN, the determinism (DET ), the mean diagonal line length (L mean ), the maximum diagonal line length (L max ) and its reciprocal (D I V ), the entropy of diagonal line lengths (ENTR), the TREND, the mean recurrence time (MRT ), the recurrence time entropy (RTE), and finally, the joint recurrence rate fraction (JRRF). JRRF measures the accordance of the recurrence plot of the (true) reference system, R ref with the RP of the reconstruction, R rec .
We compute both, R ref and R rec , by fixing the recurrence threshold corresponding to a global recurrence rate (R R) of 5% in order to ensure comparability [89]. Although the quantification measures depend crucially on the chosen recurrence threshold, the particular choice we make here is not so important, since we apply it to all RPs we compare. R R = 5% ensures a proper resolution of the inherent structures to be quantified by the ten aforementioned measures. The described procedure is schematically illustrated in Fig. 3. For each reconstruction method and for each of the ten RQA-statistics, the mean squared error (MSE) with respect to the RQA-statistics of the true reference trajectory is computed (normalized to the reference RQA-values). The pairwise comparison of the MSEs is evaluated as the percentage of the ten RQA-MSEs, which take a lower MSE (Fig. 4). For instance, a value of 70% in the table indicates that for seven out of the ten considered RQA-quantifiers the normalized mean squared error for the reconstruction method displayed on the y-axis is lower than for the reconstruction method displayed on the x-axis. The m-notation indicates the multivariate embedding approach, where three instead of one time series have been passed to the reconstruction methods (x 2 (t), x 4 (t), and x 7 (t), see Fig. 3). Since the classic TDE algorithms from Cao, Kennel et al., and Hegger & Kantz are not able to handle multivariate input data, only PECUZAL and the proposed MCDTS-idea combined with the L-statistic and with the F N N -statistic are considered in the multivariate scenario. The superiority over the three classic TDE methods is discernible in values > 50% for PECUZAL and MCDTS in the first three columns. While we would expect a better reconstruction for the multivariate cases-because we simply provide more information-our proposed method also performs better in the univariate case when the F N N -statistic is  (2) (see text for details). In case of the univariate approach, the x 2 (t)-time series gets embedded by all considered reconstruction methods, for the multivariate approach, three time series (x 2 (t), x 4 (t) and x 7 (t)) are passed to the reconstruction algorithms. From the reconstructed attractors, we obtain a recurrence plot and quantify it (RQA) by using ten different quantifiers. The same is done for the reference trajectory gained from all 8 time series from the numerical integration. Repeating the analysis for time series corresponding to varying values of the control parameter F of the system, we finally obtain time series of the RQA-quantifiers for each reconstruction method as well as for the true trajectory used as an objective function. When using MCDTS with the L-statistic, there is hardly any improvement discernible, while the computational costs are magnitudes higher. Here, PECUZAL reveals better results, even though it uses the same statistics. However, combined with the F N N -statistic our proposed idea performs very well in the univariate case and reveals excellent results for the multivariate case.  Comparison of RQA-results with ground truth (Lorenz96-system, 8 nodes)

Fig. 4
Results of the analysis of the Lorenz-96 system with varying control parameter and for all considered reconstruction approaches (see Table 1 for notations). Shown is the pairwise comparison of the normalized mean squared error of all considered ten RQA-quantifiers with respect to the truth RQA-time series (see text for details). For instance, a value of 70% in the table indicates that for seven out of the ten considered RQAquantifiers the normalized mean squared error for the reconstruction method displayed on the y-axis is lower than for the reconstruction method displayed on the x-axis

Short time prediction of the Hénon map time series
In the following, a state space reconstruction v(t) of a single time series s(t) is used to further predict its course. Besides a very recent idea [90] to train neural ordinary differential equations on a reconstructed trajectory, which then allows prediction, several attempts have been published [10][11][12][13][14][15][16] which more or less rely on the same basic idea. For the last vector of the reconstructed trajectory, denoted with a time-index l, v(t l ), a nearest neighbor search is performed. Then, these neighbors are used to predict the future value of this point T time steps ahead, v(t l+T ). Knowledge of the used embedding, which led to the reconstruction vectors v(t), then allows to read the prediction of the time series s(t l +T ) from the predicted reconstruction vector v(t l+T ). Usually, T = 1, i.e., the forecast is iteratively build by appending v(t l+T ) to the trajectory v(t i ), i = 1, . . . , l, and this procedure is repeated N times, in order to obtain an N -step prediction. The aforementioned approaches differ from the way they construct a local model of the dynamics based on the nearest neighbors. For instance, Farmer and Sidorowich [11] proposed a linear approximation, i.e., a linear polynomial is fitted to the pairs (v(t nn i ), v(t nn i +T )), where nn i denotes the ith nearest neighbor time-index. Sugihara and May [16] used a simplex with minimum diameter to select the nearest neighbor indices nn i and projected this simplex T steps into the future. The prediction is then being made by computing the location of the original predictee v(t l ) within the range of the projected simplex, "giving exponential weight to its original distances from the relevant neighbors." Here, a much simpler idea is considered: a zeroth-order approximation of the local dynamics. The prediction is simply the projection of the nearest neighbor of v(t l ), denoted by the index nn 1 , v(t l+T ) = v(t nn 1 +T ). It is clear that the performance of all prediction approaches based on an approximation of the local dynamics by making use of nearest neighbors will crucially depend on the length of the training set. By training set, we mean the time series s(t), which has been used to construct the trajectory v(t). We hypothesize that the accuracy of such a prediction will also depend on the reconstruction method, especially when the training set is rather short (Small and Tse [43] and also Bradley and Kantz [73]). In particular, Garland and Bradley [74] have shown that accurate predictions can be achieved with the aforementioned zeroth-order approximation when using an incomplete embedding of the data, i.e., reconstructions that do not satisfy the theoretical requirements on the embedding dimension in Takens' sense. As a proof of concept, we now use the described nearest-neighbor prediction method to predict the xtime series of the Hénon map [91], even though other simple models like low order polynomial models might be superior for such noise-free and pure deterministic dynamics (we provide a more challenging example in Sect By "in-sample," we mean the training set, which is used for the reconstruction. For all MCDTS implementations and abbreviations, see again Table 1. The accuracy of the prediction is evaluated by the normalized root-mean-square forecast error (RMS), with index true denoting the test set values. This way e rms (T ) = 0 indicates a perfect prediction, whereas e rms (T ) ≈ 1 means that the prediction is not better than a constant mean-predictor of the test set. Figure 5 shows the mean forecast accuracy for the traditional TDE methods (Cao, Kennel et al., Hegger & Kantz) and two selected MCDTS approaches as a function of the prediction time. The largest Lyapunov exponent is estimated to λ 1 ≈ 0.419, and we display Lyapunov times on the x-axis, i.e., units of 1/λ 1 . As in Sect. 3.1, m indicates the multivariate case, in which both, x-and y-time series are fed into the reconstruction algorithms. The results for all discussed reconstruction methods can be found in Appendix A (Fig. 8). As expected, the forecast accuracy is worse in case of added white noise (Fig. 5B) and the predictions based on multivariate reconstructions perform slightly better. The MCDTS-based forecasts perform significantly better than the forecasts based on the traditional TDE methods. Even though the continuity statistic constitutes a reasonable delay pre-selection statistic with a clear physical meaning, when utilized in our MCDTS approach (MCDTS-C-), it performs not as good as if we would not pre-select delays on the basis of some statistic, but try delays in a whole range of values (τ ∈ [0, 50], MCDTS-R-). At least, this statement holds for this example of the Hénon map time series. A Wilcoxon rank sum test is applied to underpin the better performance of the MCDTS-approaches in comparison with the classical time delay methods. Therefore, we define a threshold ζ = 0.1 and compute the prediction times for which e rms (T ) first exceeds ζ for all trials and for all considered reconstruction methods. These distributions of prediction times for each method are used for the statistical test with the null hypothesis that two considered distributions have equal medians.
The tests complement the visual analysis of Figs. 5 and 8. A significantly better forecast performance (α=0.01) than the classic time delay embedding methods for PECUZAL and all considered MCDTS-based approaches, but the ones combined with the FNNstatistic (MCDTS-FNN) can be verified for the noise free case. In the case of the noise corrupted time series PECUZAL (m), all MCDTS-MSE-approaches and MCDTS-C-L (m) achieve a significantly better prediction performance than the classical time delay methods.
Some remarks: Together with PECUZAL (m) and MCDTS-R-MSE (m), MCDTS-C-L (m) achieves the overall best results (Fig. 8). The choice of the threshold ζ is obviously subjective, but a range of thresholds gave similar results and the "grouping" of the results accord-

A B
Lyapunov time

Fig. 5 A Normalized root-mean-square prediction errors (RMS)
for the Hénon x-time series and for selected reconstruction methods (see Fig. 8 for all mentioned approaches and Table 1) as a function of the prediction time. Shown are mean values of a distribution of 100 trials with different initial conditions. For the prediction, we use a one step ahead zeroth-order approximation on the nearest neighbor of the last point of the reconstructed trajectory and iteratively repeated that procedure 30 times in order to obtain a prediction of 31 samples in total for each trial. B Same as in A but with 3% additive white noise ing to the different techniques is clearly discernible already when looking at the mean (Figs. 5, 8). We have to mention that we could not achieve results as shown here for continuous systems like the Lorenz-63 or the Rössler model. In those cases, the difference in the prediction accuracy was not as clear as it is in the Hénon example and not significant, for both, noise-free and noise corrupted time series. We also investigated the influence of the time series length of the training sets, but the results did not change much. All reconstruction methods gave similar prediction results. We could, however, observe that simple and incomplete embeddings, i.e., a too low embedding dimension, often-but not always-led to similarly good prediction results, when compared to "full" embeddings. This was true for the continuous examples (not shown in this work), but this also holds for the Hénon example shown here, where the MCDTS-C-L approach does not yield the best results in the univariate case, although it targets the total minimum of the L-objective-function, which the authors consider to be a suitable cost-function for a good/full embedding. These observations are in line with the findings of Garland and Bradley [74] and the fact that our reconstruction methods tend to suggest higher dimensional embeddings with smaller delays in the presence of noise support the findings of Small and Tse [43]. The FNN-statistic does not seem to be useful in the prediction application shown here, since all approaches which make use of it (including classic TDE) perform clearly worse compared to the other methods used.

Improved short-time predictions for CENOGRID
To demonstrate that the prediction procedure from the preceding section works for real, noisy data, we apply it to the recently published CENOzoic Global Reference benthic foraminifer carbon and oxygen Isotope Dataset (CENOGRID) [92]. The temperaturedependent fractionation of carbon and oxygen isotopes in benthic foraminifera is an important means to reconstruct past global temperatures and environmental conditions. Moreover, the Cenozoic is interesting, because it provides an analogue of future greenhouse climate and how and which regime shifts in large-scale atmospheric and ocean circulation can be expected in the future warming climate. Predicting these data may be unrealistic and not motivated by an actual research question. However, this task shall serve as a proof of concept. The non-stationarity and noise level of CENOGRID make prediction particularly difficult.
The dataset consists of a detrended δ 18 O and a detrended δ 13 C isotope record with a total length of N = 13, 421 samples and a sampling period of Δt = 5000yrs ( Fig. 9 in Appendix B). Here, we make predictions on the δ 13 C isotope record. The first 13, 311 samples are used as a training set, from which state space reconstructions are obtained. The remaining 110 samples of the δ 13 C record act as the test set. For 100 different starting points in the test set, we make 10step-ahead predictions for each reconstruction method by using the embedding parameters gained from the training and with the iterative zeroth-order approximation prediction procedure described in Sect. 3.2. This way we simulate different initial conditions for the prediction and obtain a distribution of forecasts for each reconstruction method. We again use a Wilcoxon rank sum test on these distributions in order to see whether predictions based on some reconstruction method are significantly better than the predictions obtained from classic TDE (Cao, Kennel et al., Hegger & Kantz). Only one of the applied reconstruction methods (listed in Table 1), MCDTS-R-MSE (m), score significantly better predictions (highly significant for prediction horizons up to 4Δt and significant for prediction horizon up to 5Δt). Figure 6A shows the mean normalized root mean square prediction error gained from the 100 predictions for the classic TDE and the mentioned MCDTS-R-MSE (m). The distribution of all prediction trials for the best performing classic TDE method (Hegger & Kantz) and for MCDTS-R-MSE (m) is shown in panels B, C. Even though the multivariate approach MCDTS-R-MSE (m) could have been used both, the δ 18 O and the δ 13 C time series for the reconstruction, it only uses δ 13 C lagged by 1 and 2 samples in a threedimensional reconstruction. The classic TDE methods and all other reconstruction methods (listed in Table  1, not shown in Fig. 9) revealed higher dimensional embeddings (Table 2). Yet, all these higher dimensional reconstructions give poor prediction results, except for MCDTS-C-MSE-KL (m), which gives significant better predictions (α = 0.05) than the classic TDE methods at least for the one-step-ahead prediction.

Estimating causal relationship of observables of a thermoacoustic system
As a final proof of concept, we utilize state space reconstruction for detecting causality between observables X and Y in a turbulent combustion flow in a gas turbine. It is possible to infer a causal relationship between two (or more) time series x(t) and y(t) via convergent cross mapping (CCM) [18,19,93], which-in contrast to Granger causality [94]-also works for time series stemming from nonseparable systems, i.e., deterministic dynamical systems. The CCM method "tests for causation by measuring the extent to which the historical record of Y values can reliably estimate states of X . This happens only if X is causally influencing Y ." [18] This also incorporates the embedding theorems [1][2][3] in a sense that a state space reconstruction based on x(t) is diffeomorphic to a reconstruction of y(t), if x(t) and y(t) describe the same dynamical system and the embedding parameters have been chosen correctly. To check for a causal relationship from X → Y , a state space reconstruction of y(t) yields a trajectory v y (t) ∈ R m , with m denoting the embedding dimension, which is then used for estimating values of x(t), namelyx(t). It is said that v y (t) cross-maps x(t), in order to get estimatesx(t). Technically, this is done by first searching for m + 1 nearest neighbors of a point corresponding to a time index t ∈ t, i.e., find the m + 1 time indices t N N i , i = 1, . . . , m + 1 of the nearest neighbors of v y (t ). Further, these time indices t N N i are used to "identify points (neighbors) in X (a putative neighborhood) to estimate x(t ) from a locally weighted mean of the m + 1 x(t N N i ) values" [18]: with the weighting w i based on the nearest neighbor distance to v y (t ).
with · a norm (we used Euclidean distances). Finally, the agreement of the cross-mapped estimatesx(t ) with the true values x(t ) is quantified for all considered t ∈ t, e.g., by computing a linear Pearson correlation ρ CCM , which has been done in this study. The clou is that the estimation skill, here represented by ρ CCM , increases with the considered amount of data used, if X indeed causally influences Y . This is because the attractor-represented by the reconstruction vectors v y (t)-gets resolved better with increasing time series length, resulting in closer nearest neighbors and therefore a better concordance ofx(t) and x(t), i.e., an increase in ρ CCM with increasing time series length. This convergence of the estimation skill based on crossmapping is a necessary condition for causation, not only a high value of ρ CCM itself (Fig. 7A). Although the embedding process is key to a successful application of CCM to data, its influence has not been discussed by Sugihara et al. [18]. However, Schiecke et al. [95] discussed the impact of the embedding parameters on CCM briefly and we hypothesize that the embedding method can play a crucial role, when analyzing realworld data. Therefore, we utilize the MCDTS framework in the following way. As a delay pre-selection method Λ τ , we use the reliable continuity statistic ε (τ ) [65,78]. As a suitable objective function Γ , we use the negative of the corresponding ρ CCM , i.e., MCDTS optimizes the embedding with respect to maximizing ρ CCM of two given time series. According to our abbreviation-scheme given in Table 1, we will refer to this approach as MCDTS-C-CCM. We apply the CCM-method to time series data that spans the different dynamical regimes of a thermoa-coustic system. Here, we investigate the mutual causal influence of two recorded variables of the thermoacoustic system, namely the pressure and the heat release rate fluctuations (Fig. 10). The original experiments were performed on a turbulent combustor with a rectangular combustion chamber (length 700 mm, cross-section 90 mm × 90 mm, Fig. 11). In such a combustion experiment, a fixed vane swirler is used to stabilize the flame and a central shaft that supports the swirler injects the fuel through four radial injection holes. The fuel used is liquefied petroleum gas (60% butane and 40% propane). The airflow enters through the inlet to the combustion chamber. The partially premixed reactant mixture is ignited using a spark plug. Once the flame is established in the combustor, we continuously varied the control parameter (mass flow rate of air, which, in turn, varies the Reynolds number 2 and the equivalence ratio 3 ) to observe the dynamical transitions in the 2 Reynolds number is ρU D μ , where ρ is the density, U is a characteristic velocity, D is a characteristic dimension (the diameter) and μ is the viscosity. 3 Equivalence ratio is the ratio between the actual fuel-air ratio to the stoichiometric fuel-air ratio. system. Acoustic pressure fluctuations were measured using a piezoelectric transducer (PCB103B02) and heat release rate using a photomultiplier tube (Hamamatsu H10722-01) at a sampling rate of 4 kHz. The interactions between the turbulent flow, the unsteady fluctuations of the flame due and the acoustic field of the chamber lead to different dynamical states. As the airflow rate increases, the system transitions from a state of stable operation (which comprises high dimensional chaos having low amplitude [96]) to intermittency, a state that comprises bursts of periodic oscillations amid epochs of aperiodicity [97], and then to limit cycle [98]. The self-sustained limit cycle oscillations represent a state of oscillatory instability, known as thermoacoustic instability [99]. When the flow rate of air is further increased, the flame loses its stability inside a combustor and blows out. The pressure and heat release rate data capture the transition through all these dynamical states in sequence. In the many different dynamical regimes recorded in the time series, we expect the strength of causal interference between the heat release and the pressure to vary. But in all dynamics, we expect a mutual causal interaction between heat release and pressure. Moreover, since a possible asym-metric bi-directional coupling between heat release and pressure has been discovered in a stationary setup of a very similar experiment [100], we would also expect that the heat release rate has a slightly stronger causal influence on pressure than vice versa. In short, the goal here is twofold: 1. Prove the expected mutual causal relationship between heat release rate and pressure as well as 2. the hypothesized asymmetry in its strengths by applying MCDTS-C-CCM on a range of time series, sampled from the entire record (Fig. 10).
We compare it to results obtained from using the CCM method with the classical embedding approach of Cao [66]. Specifically, we set up the following workflow for this analysis: 1. 50 time indices t ∈ t are drawn randomly, where t covers the entire record. 2. For each of these indices t , time series of length N = 5000 for pressure and heat release are obtained and standardized to zero mean and unit variance (Fig 10). 3. Both time series samples (of full length N = 5000) each are embedded using Cao's method as a classical reference and our proposed framework MCDTS-C-CCM with 100 trials (Table 1). Based on the obtained reconstructions, ρ CCM-Cao and ρ CCM-MCDTS are computed for both directions as a function of increasing time series length as exemplary shown in Fig. 7A. 4. To ensure convergence in the CCM-sense, we fit a linear model to ρ CCM (dashed black lines in Fig. 7A) and whenever that model gives a positive slope and the last value of ρ CCM (i.e., for the longest considered time series of length N = 5000) exceeds a value of 0.2, we infer a true causal relationship. 5. When we can detect a causal relation simultaneously in both directions, we compute the average of the pointwise difference ρ CCM heat→pressure − ρ CCM pressure→heat The minimum considered value of 0.2 for ρ CCM is an arbitrary and subjective choice, and we could have made other choices. But since this procedure is applied to ρ CCM-Cao and ρ CCM-MCDTS at the same time, we think this is reasonable and it prevents samples to be accounted for as "true causal" when there is near-0 ρ CCM , but a positive linear trend. Results do only change slightly when varying this value in some interval [0.2 0.3]. Figure 7B summarizes the results obtained for both considered embedding methods. Shown are the classification results for correctly deducing a causal influence of pressure on heat release (left panel) and of heat release on pressure (middle panel) based on our definition (item 4 in the list above). Thus, in this first step, we do not measure the strength of the causal relationship, but rather test whether such a relationship actually exists. While MCDTS-C-CCM maintains a correct classification in 92% of all cases considered (50 samples) for pressure → heat release and 94% for heat release → pressure, Cao's method is only able to correctly classify 44% and 74%, respectively. These results themselves already demonstrate a clear advantage of our proposed method, but recall that we expect a causal relationship between heat release and pressure simultaneously for each sample. The right panel of Fig. 7B reveals that in 88% of all cases considered, MCDTS-C-CCM is able to detect a mutual causal relationship, while Cao's method managed to do so in only 28% of the cases.
Furthermore, we try to validate a hypothesis made by Godavarthi et al. [100] that heat release has a stronger effect on pressure than vice versa for most of the consid-ered dynamics. The problem of measuring the strength of a causal relationship is twofold: First, the experiment considered here exhibits a number of different dynamics due to the continuously changing control parameter. The hypothesis of an asymmetry in the strength of the interaction was made for stationary cases and four considered dynamics the authors investigated. Second, in describing the CCM method, Sugihara et al. [18] merely described that in the case of a stronger causal effect of X on Y , cross-mapping X with v y converges faster than the other way around. Thus, we would have to define what faster means with respect to our experimental curves like the ones shown in Fig. 7A. That would mean introducing some parameters on which the results would depend too much. Here, we pursue a simpler idea in order to detect the strength of a causal interaction. For samples where a causal relation in both directions has been detected, we compute the average of the pointwise difference of the CCM-correlation coefficients, i.e., Δρ CCM = ρ CCM heat→pressure − ρ CCM pressure→heat . When this difference is positive, we claim that heat release stronger effects pressure in a causal sense than vice versa. Our analysis reveals that the proposed method is able to reflect the hypothesized stronger causal effect of the heat release on pressure data. Figure 12 shows that for 29 of the 50 samples (∼ 58%) Δρ CCM is indeed positive. Using the Cao method, we were able to derive such a result in only ∼ 26% of all samples. In this case, however, only ∼ 28% of the samples were found to be mutually causally related at all (cf. Fig. 7B). Within the group of mutually causally related samples, the assumed asymmetry is reflected very well (13 of 14 mutually causally related samples had a positive Δρ CCM ).
The proposed MCDTS reconstruction approach shows a clear advantage when using it together with the CCM method. Not only is the general classification ability remarkable, but the MCDTS reconstructions also allow verification of an assumed asymmetric causal interaction, which would be limited by the classical time delay method.

Conclusions
A novel perspective of the embedding process has been proposed, in which the state space reconstruction from single time series can be treated as a game, in which each move corresponds to an embedding cycle and is subject to an evaluation through an objective function.
It is possible to model different embeddings, i.e., different choices of delay values and time series (if there are multivariate data at hand) in the embedding cycles, in a tree like structure. Consequently, our approach randomly samples this tree, in order to ensure the finding of a global minimum of the chosen objective function. We leave it to practitioners which state space evaluation statistic, i.e., objective function, they use, since different research questions require different reconstruction approaches. There is also a free choice of a delay pre-selection method for each embedding cycle, e.g., using the minima of the auto-mutual information statistic. We recommend the combination of the continuity statistic of Pecora et al. [65] as a delay pre-selection method together with the L-statistic of Uzal et al. [29] as an objective function as a very good "all-rounder" for many research questions in nonlinear time series analysis, as already shown by Kraemer et al. [41] (PECUZAL algorithm). Since the sampling of the tree is a random procedure, the proposed idea only yields converging embedding parameters for a sufficient sampling size N trial . In our numerical investigations, N trial = 50 usually led to satisfying results for univariate cases and N trial = 80 for multivariate embedding scenarios. Our proposed method initializes in a local minimum of the objective function, which is achieved by minimizing the objective function in each embedding cycle to the maximum extend. So in practice, even setting N trial too low would lead to similar-but never worse-results as the state-of-the-art methods. Moreover, the proposed method is not limited to delay pre-selection and objective functions that take into account certain physical constraints. It would also optimize the reconstruction vectors of the state space for research questions such as classification, where we could speak of a feature or latent space instead of the state or phase space notation associated with statistical physics. We exemplified the use of such a modular algorithm by combining different objective-and delay pre-selection functions. Its superiority to classical time delay embedding methods has been demonstrated for a recurrence analysis of the Lorenz-96 system, a prediction of the xtime series of the chaotic Hénon map and the δ 13 C CENOGRID record as well as on studying causal interactions between variables in a combustion process.
With these applications, we showed the advantage MCDTS brings for any kind of method that utilizes an embedding such as recurrence analysis, embedding-based predictions of time series, or causal analysis with convergent cross-mapping. It, thus, has potential in many applications and disciplines, everywhere where such phase space-based approaches are used, but an automatic phase space reconstruction is required. The latter is of increasing interest, e.g., for big data analysis, analysis with highly reliable requirements (e.g., in medical applications), and also for deep learning-based frameworks.
Acknowledgements R.I.S. acknowledges the funding from the Science and Engineering Research Board (SERB) of the Department of Science and Technology (grant no: CRG/2020/003051). I.P. acknowledges the research assistantship from the Ministry of Human Resource Development, India and IIT Madras. All computations have been carried out in the Julia language and made use of the packages DynamicalSystems.jl [101] and Dif-ferentialEquations.jl [102].
Funding Open Access funding enabled and organized by Projekt DEAL. This work has been financially supported by the German Research Foundation (DFG projects MA4759/8 and MA4759/9).

Data and Code Availability
The study that we present here is available as a fully reproducible code base [103]. The method itself will be available along with detailed documentation in the Julia language within the DynamicalSystems.jl [101] framework soon.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.  Table 2 Obtained embedding parameters for the different reconstruction methods. Time series index 1 in the third column corresponds to the detrended δ 13 C and time series index 2 to the detrended δ 18 O record shown in Fig. 9. For a description of the reconstruction methods see Table 1. The sequence of the delays (center column) and time series (right column) are a result of the embedding cycles which have been passed through in the corresponding reconstruction methods, which is why they are not necessarily ordered. For a reconstruction based on these embedding parameters, it would make no difference whether delays and corresponding time series were sorted beforehand   Average pointwise difference of the CCM-correlation coefficients for the direction heat release → pressure and vice versa for both underlying reconstruction approaches. For a better visualization, we sorted these values here separately for both methods. A positive value indicates that the heat release has a stronger causal influence on pressure than vice versa, which is the expectation value. Diamonds indicate cases, where we could not deduce a causal relationship for both directions in one sample. As also shown in the right panel of Fig. 7B, MCDTS-C-CCM was able to correctly detect a mutual causal relationship in 88% of all considered samples (only 12 % marked with blue diamonds in this Figure), whereas in the case of Cao's reconstruction approach, we could only detect this in 28% of all cases (72 % marked with gray diamonds in this Fig. )