Performance improvement via bagging in probabilistic prediction of chaotic time series using similarity of attractors and LOOCV predictable horizon

Recently, we have presented a method of probabilistic prediction of chaotic time series. The method employs learning machines involving strong learners capable of making predictions with desirably long predictable horizons, where, however, usual ensemble mean for making representative prediction is not effective when there are predictions with shorter predictable horizons. Thus, the method selects a representative prediction from the predictions generated by a number of learning machines involving strong learners as follows: first, it obtains plausible predictions holding large similarity of attractors with the training time series and then selects the representative prediction with the largest predictable horizon estimated via LOOCV (leave-one-out cross-validation). The method is also capable of providing average and/or safe estimation of predictable horizon of the representative prediction. We have used CAN2s (competitive associative nets) for learning piecewise linear approximation of nonlinear function as strong learners in our previous study, and this paper employs bagging (bootstrap aggregating) to improve the performance, which enables us to analyze the validity and the effectiveness of the method.


Introduction
So far, a number of methods for time series prediction have been studied (cf. [1,2]), and our methods have awarded 3rd and 2nd places in the competitions of time series prediction held at IJCNN'04 [3] and ESTSP'07 [4], respectively. Our methods have used model selection methods evaluating MSE (mean square prediction error) for holdout and/or cross-validation datasets. Recently, we have developed several model selection methods for chaotic time series prediction [5,6]. The method in [5] utilizes moments of predictive deviation as ensemble diversity measures for model selection in time series prediction and achieves better performance from the point of view of MSE than the conventional holdout method. The method in [6] uses direct multistep ahead (DMS) prediction to apply the out-of-bag (OOB) estimate of MSE. Although both methods have selected the models to generate good predictions on average, they cannot always have provided good predictions, especially when the horizon to be predicted is large. This is owing mainly to the fact that the MSE of a set of predictions is largely affected by a small number of predictions with short predictable horizons even if most of the predictions have long predictable horizons. This is because the prediction error of chaotic time series increases exponentially with the increase in time after the predictable horizon (see [6] for the analysis and [1] for properties of chaotic time series). Instead of using model selection methods employing the estimation of the MSE, we have developed a method of probabilistic prediction of chaotic time series [7]. Here, from [8], the probabilistic prediction has come to dominate the science of weather and climate forecasting, mainly because the theory of chaos at the heart of meteorology shows that for a simple set of nonlinear equations (or Lorenz's equations shown below) with initial conditions changed by minute perturbations, there is no longer a single deterministic solution and hence all forecasts must be treated as probabilistic. Although most of the methods shown in [8] use ensemble mean for representative forecast, our method in [7] (see below for details) uses an individual prediction selected from a set of plausible predictions for the representative because our method employs learning machines involving strong learners capable of making predictions with small error for a desirably long duration and we can see that ensemble mean does not work when the set of predictions for the ensemble involves a prediction with short predictable horizon. This is owing mainly to the exponential increase in prediction error of chaotic time series after the predictable horizon (see Sect. 3.2 for details) Thus, instead of using ensemble mean, our method in [7] firstly selects plausible predictions by means of evaluating the similarity of attractors between training and predicted time series and then obtains the representative prediction by means of LOOCV (leave-one-out cross-validation) to select the prediction with longer predictable horizon. Comparing with our previous methods using the MSE for model selection [5,6], the method in [7] has an advantage that it is capable of selecting the representative prediction from plausible predictions for each start time of prediction and providing the estimation of predictable horizon. Furthermore, it has achieved long predictable horizons on average. However, there are several cases where the method selects representative prediction with short predictable horizon, although there are plausible predictions with longer predictable horizons.
To overcome this problem, this paper tries to improve the performance of learning machines by using bagging (bootstrap aggregating) method and show the analysis of LOOCV predictable horizon. Here, the bagging is known to use ensemble mean to have an ability to reduce the variance of predictions by single learning machines, and then, we can expect that the performance in time series prediction becomes more stable and higher. Note that, in this paper, the bagging ensemble is employed for iterated one-step-ahead (IOS) prediction of time series, and we deal with probabilistic prediction as an ensemble of longer-term predictions. Furthermore, we use CAN2 (competitive associative net 2) as a learning machine (see [3] for the details of CAN2), where CAN2 has been introduced for learning piecewise linear approximation of nonlinear function and the performance has been shown in evaluating predictive uncertainty challenge [9], where our method has been awarded the first place in regression problems. The CAN2 has been used in our methods [3,4] for the competitions of time series predictions shown above.
We show the present method of probabilistic prediction of chaotic time series in Sect. 2, experimental results and analysis in Sect. 3, and the conclusion in Sect. 4.
2 Probabilistic prediction of chaotic time series

IOS prediction of chaotic time series
Let y t ð2 RÞ denote a chaotic time series for a discrete time t ¼ 0; 1; 2; . . . satisfying where rðx t Þ is a nonlinear target function of a vector x t ¼ ðy tÀ1 ; y tÀ2 ; . . .; y tÀk Þ T generated by k-dimensional delay embedding from a chaotic differential dynamical system (see [1] for the theory of chaotic time series). Here, y t is obtained not analytically but numerically, and then, y t involves an error eðx t Þ owing to an executable finite calculation precision. This indicates that there are a number of plausible target functions rðx t Þ with allowable error eðx t Þ. Furthermore, in general, a time series generated with higher precision has small prediction error for longer duration of time from the initial time of prediction. Thus, let a time series generated with a high precision (or 128-bit precision; see Sect. 3 for details), be ground truth time series y ½gt t , while we examine predictions generated with standard 64-bit precision.
Let y t:h ¼ y t y tþ1 . . .y tþhÀ1 denote a time series with the initial time t and the horizon h. For a given training time series y t g :h g ð¼ y ½train t g :h g Þ, we are supposed to predict succeeding time series y t p :h p for t p ! t g þ h g . Then, we make the training dataset D ½train ¼ fðx t ; y t Þ j t 2 I ½train g for I ½train ¼ ft j t g þ k t\t g þ h g g to train a learning machine. After the learning, the machine executes IOS prediction bŷ for t ¼ t p ; t pþ1 ; . . ., recursively, where f ðx t Þ denotes prediction function of x t ¼ ðx t1 ; x t2 ; . . .; x tk Þ whose elements are given by x tj ¼ y tÀj for t À j\t p and x tj ¼ŷ tÀj for t À j ! t p . Here, we suppose that y t for t\t p is known as the initial state for making the predictionŷ t p :h p . As explained above, we execute the prediction with standard 64-bit precision, and we may say that there are a number of plausible prediction functions f ðx t Þ with small error for a duration of time from the initial time of prediction by means of using strong learning machines.

Single CAN2 and the bagging for IOS prediction
We use CAN2 as a learning machine. A single CAN2 has N units. The jth unit has a weight vector w j ,ðw j1 ; . . .; w jk Þ T 2 R kÂ1 and an associative matrix (or a row vector) M j , ðM j0 ; M j1 ; . . .; M jk Þ 2 R 1Âðkþ1Þ for j 2 I N ,f1; 2; . . .; Ng. The CAN2 after learning the training dataset D ½train ¼ fðx t ; y t Þ j t 2 I ½train g approximates the target function rðx t Þ by where e x t , ð1; x T t Þ T 2 R ðkþ1ÞÂ1 denotes the (extended) input vector to the CAN2, and e y cðtÞ ¼ M cðtÞ e x t is the output value of the c(t)th unit of the CAN2. The index c(t) indicates the unit who has the weight vector w cðtÞ closest to the input vector x t , or cðtÞ, argmin j2I N kx t À w j k: Note that the above prediction performs piecewise linear approximation of y ¼ rðxÞ and N indicates the number of piecewise linear regions. We use the learning algorithm shown in [10] whose high performance in regression problems has been shown in evaluating predictive uncertainty challenge [9].
We obtain bagging prediction by means of using a number of single CAN2s as follows (see [11,12] for details); let D ½na ] ;j ¼ fðx t ; y t Þ j t 2 I ½na ] ;j Þg be the jth bag (multiset, or bootstrap sample set) involving na elements, where the elements in D ½na ] ;j are resampled randomly with replacement from the training dataset D ½train involving n ¼ jD ½train j elements. Here, a ð [ 0Þ indicates the bag size ratio to the given dataset, and j 2 J ½bag ,f1; 2; . . .; bg. Here, note that a ¼ 1 is used in many applications (see [12,13]), which we use in the experiments shown below after the tuning of a (see [12] for validity and effectiveness of using variable a). Using multiple CAN2s employing N units after whereŷ ½j t c ,ŷ ½j ðx t c Þ denotes the prediction by the jth machine h ½j N . The angle brackets Á h i indicate the mean, and the subscript j 2 J ½bag indicates the range of the mean. For simple expression, we sometimes use hÁi j instead of hÁi j2J ½bag in the following.

Probabilistic prediction and estimation
denotes the similarity of two-dimensional attractor (trajectory) distributions a ½h N ij and a ½train ij of time series y ½h N t p ;h p and y ½train t g :h g , respectively, and S th is a threshold. Here, the twodimensional attractor distribution, a ij , of a time series y t:h is given by where v 0 is a constant less than the minimum value of y t for all time series and D a indicates a resolution of the distribution. Furthermore, 1fzg is an indicator function equal to 1 if z is true, and 0 if z is false, and bÁc indicates the floor function.

LOOCV measure to estimate predictable horizons
Let us define predictable horizon between two predictions y ½h N t p :h p and y where e y indicates the threshold of prediction error to determine the horizon. Then, we employ LOOCV method to estimate predictable horizon of y Now, we derive the probability of the prediction y t for t p t\t p þ h p as

Experimental settings
We use the Lorenz time series, as shown in Fig. 1 and [6], obtained from the original differential dynamical system given by for r ¼ 10, r ¼ 28 and b ¼ 8=3. Here, we use t c for continuous time and t ð¼ 0; 1; 2; . . .Þ for discrete time related by t c ¼ tT with the sampling time or the embedding delay T ¼ 25 ms. We have generated the time series y ½gt t ¼ x c ðtTÞ for t ¼ 1; 2; . . .; 5000 from the initial state ðx c ð0Þ; y c ð0Þ; z c ð0ÞÞ ¼ ðÀ8; 8; 27Þ via the fourth-order Runge-Kutta method with step size Dt ¼ 10 À4 and r ¼ 128-bit precision of GMP (GNU multiprecision library).
Using y ½train t g :h g ¼ y . . .; 14Þ. After the training, we execute IOS predictionŷ t ¼ f ½h N ðx t Þ for t ¼ t p ; t p þ 1; . . . with the initial input vector x t p ¼ ðy ½gt t p À1 ; . . .; y ½gt t p Àk Þ for prediction start time t p 2 T p ¼ f2000 þ 100i j i ¼ 0; 1; 2; . . .; 29g and prediction horizon h p ¼ 500. We show experimental results for the embedding dimension being k ¼ 10 and the threshold in (8) being e y ¼ 10 (see [7] for the result with k ¼ 8, which is not significantly but slightly different).  with Dt ¼ 10 À4 and r ¼ 128 is considered to be accurate during 230 steps on average because we have observed that predictable horizon of two time series generated by the Runge-Kutta method with step sizes Dt ¼ 10 Àn and 10 ÀnÀ1 for n ¼ 3; 4; 5; 6; 7 increases monotonically with the decrease in step size or the increase in n.
Here, note that we have executed several experiments with using the parameter h ¼ ðN; kÞ for k ¼ 6, 8, 10, 12 and so on, and we do not have found out any critically different results, although we would like to execute and show the results of comparative study in our future research.

Results and analysis
First, we show an example of all predictions y ½h N t p :h p for t p ¼ 2300 in Fig. 2a Fig. 3a).
In Fig. 2b, we can see that single CAN2s have larger number of predictions with the similarity S smaller than S th ¼ 0:8 than bagging CAN2s at t ¼ 2799, and their predictions are not selected as plausible predictions. A detailed analysis of the similarity is shown below. 25 ms) and the former is almost the same as the mean of predictable horizons achieved by single and bagging CAN2 being 170 and 175 steps, respectively. This indicates that single and bagging CAN2s after learning the training data generated via the Runge-Kutta method with the step size Dt ¼ 10 À4 have almost the same prediction performance as the Runge-Kutta method with Dt ¼ 5 Â 10 À4 . Although we do not have no general measure to evaluate time series prediction so far, the above method using the step size of Runge-Kutta method and the mean predictable horizon seems reasonable. In Fig. 3a, we can see that the performance of the stability of prediction by single CAN2 is improved by bagging CAN2 from the point of view that the former has four actual predictable horizons h  Fig. 5a    t p :h p in (a) are shorter than the neighboring (w.r.t. t p ) horizons. This correspondence seems reasonable because negative correlation does not contribute to the selection of the prediction with large predictable horizon. Thus, we have to remove the cases of negative correlations. So far, we have two approaches: one is to improve the performance of learning machine much more as we have done with the bagging method in this paper, and the other is to refine the selection method by means of modifying LOOCV predictable horizon or developing new methods. Actually, we have predictions with much longer predictable horizons not shown in this paper, but we cannot select such predictions without knowing the ground truth time series, so far.

Conclusion
We have presented a performance improvement in the method for probabilistic prediction of chaotic time series by means of using bagging learning machines. The method obtains a set of plausible predictions by means of using similarity of attractors between training and predicted time series. And then, it provides representative prediction which has the longest LOOCV predictable horizon. By means of executing numerical experiments using single and bagging CAN2s, we have shown that bagging CAN2 improves the performance of single CAN2 and analyzed the relationship between LOOCV and actual predictable horizons. In our future research studies, we would like to overcome the problem of negative correlation between the achieved predictable horizon and the LOOCV predictable horizon, or the measure of selecting representative prediction.

Compliance with ethical standards
Conflict of interest The authors declare no conflicts of interest associated with this article.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.