Adaptive kernel approach to the time series prediction
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s10044-010-0189-3
- Cite this article as:
- Michalak, M. Pattern Anal Applic (2011) 14: 283. doi:10.1007/s10044-010-0189-3
Abstract
This short article describes two kernel algorithms of the regression function estimation. One of them is called HASKE and has its own heuristic of the h parameter evaluation. The second is a hybrid algorithm that connects the SVM and HASKE in such a way that the definition of the local neighborhood is based on the definition of the h-neighborhood from HASKE. Both of them are used as predictors for time series.
Keywords
Time series prediction Support vector machine Kernel estimators Non-parametric regression1 Introduction
Estimation of the regression function is one of the basic problems that deals with the discipline called machine learning [22, 26]. The aim of the evaluation of regression function is to find some dependencies between variables in the observed dataset. Sometimes these relations can be overt like the dependence between the current intensity and the voltage in the linear element (Ohm’s law), but in many cases this dependence is hidden or difficult to notice.
Very common non-parametric regression function estimators are spline functions [4, 9], radial basis functions [13], additive (and generalized additive) models [12], the LOWESS algorithm [3] or kernel estimators [17, 28] with support vector machines [1].
The specific kind of data—time series—can be analyzed by the usage of typical methods like autoregressive models, decomposition method or the Fourier analysis. As it will be shown in this article, kernel methods can also be useful as a time series prediction tool, although some standard parts of algorithms should be changed (the method of the smoothing parameter h evaluation [16].
In this article, author describes two new kernel methods of the time series prediction and both of them are based on the certain regression function estimation. The first method—a kernel estimator HASKE—is described based on the Nadaraya–Watson kernel estimator, but can also be based on other estimators from this group. This estimator applies the new adaptive method of smoothing parameter evaluation, using the definition of the h-neighborhood. This new method avoids the big estimation error in a case, where there are no training objects in the neighborhood of the test object. The algorithm is specially designed for time series with the visible periodic dependence between past and present values that is often determined by the nature of the time series. The results of HASKE algorithm are compared with other kernel estimators and the well-known decomposition method.
The second kernel predictor—called the HKSVR—is a hybrid algorithm that connects the mentioned group of kernel estimators and the support vector machine. It combines the adaptive definition of the test point neighborhood from HASKE with advantages of the support vector machine regression, but should not be considered as the extension of HASKE. The HKSVR is especially designed for the time series for which we cannot point the easy interpretable dependence between past and present values. As the HKSVR can be classified as the local support vector regression, its results are compared with that of the support vector machine.
Both algorithms were widely presented in the CORES conference [16].
2 Non-parametric estimators of the regression function
2.1 Kernel estimators
- 1.
\({\int_{\mathbb{R}} K(u)\, {\rm d}u = 1 }\)
- 2.
\({\forall x\in {\mathbb{R}} K(x) = K(-x)}\)
- 3.
\({\int_{\mathbb{R}} uK(u)\,{\rm d}u = 0 }\)
- 4.
\({\forall x\in {\mathbb{R}} K(0) \geq K(x)}\)
- 5.
\({\int_{\mathbb{R}} u^2K(u)\,{\rm d}u < \infty }\)
Popular kernel functions
Uniform | \(K(x) = {\frac{1}{2}}\,I(-1 < x < 1) \) |
Triangular | K(x) = (1 − |x|) I(− 1 < x < 1) |
Biweight | \(K(x) = {\frac{15}{16}}(1 - u^2)\,I(-1 < x < 1) \) |
Gaussian | \(K(x) = {\frac{1}{\sqrt{2\pi}}}\exp{-u^2/2}\) |
The second step in creating the kernel estimator is the selection of the smoothing parameter h. As it is described in [20] and [25], the selection of h is more important than the selection of the kernel function. Small values of h cause the estimator to fit data too much. Big values of the parameter h lead the estimator to oversmooth dependencies in the analyzed set.
2.2 Support vector machines
Support vector machines (SVM) were defined in [1] and later in [18, 26].
Now it is noticeable that not all of the training vectors take part in the evaluation of the regression value. Only vectors x_{i} which Lagrange parameters \(\alpha_i - \alpha_i^* >0\) influence the result and these vectors are called support vectors.
There also exists the non-linear version of SVM and it differs in such a way that the scalar product of two vectors is replaced by the function that performs the corresponding product in higher-dimensional space. Detailed calculations can be found in [21].
This model of the support vector regression (SVR) is the global one, but there are also a number of its local modifications [8, 14].
3 The modification of the data space
Typical time series data can be described as the set of pairs (t, x_{t}), where t is the time variable and x is the observed variable. For the purpose of kernel time series prediction a small transformation of the data space must be performed. Let us assume that there is a parameter \({p_m \in \mathbb{N}}\) defining the maximal prediction horizon that is our interest. Then, the original set of pairs (t, x_{t}) is transformed to the set of pairs \((x_t, x_{t+p_m})\). This transformation decreases the number of pairs from n in the original dataset to n − p_{m} in the transformed data.
The G series describes the increase in the number of American airlines passengers per month (in thousands) between 1949 and 1960. From its nature it is intuitive to set p_{m} equal to 12.
The transformation of the time series modifies the task of the time series prediction and leads to the estimation of the regression function in the modified space. The prediction of the value of the time series in the moment t (x_{t}) is equivalent to the evaluation of the value of the regression function for the argument \(x_{t-p_m} (\widetilde{f}(x_{t-p_m})).\)
It is typical that time series are not divided into train and test subsets. The predictor is trained on the basis of the historical data and verified on the basis of the present data. If the maximal interesting prediction horizon is p_{m} and the historical data are from x_{1} to x_{k} it is suitable to verify the prediction model in the following way: observations x_{1} to \(x_{k - p_m}\) become the train set and the rest becomes the test set.
4 Problems with the typical kernel prediction
As it was mentioned in Sect. 2.1 kernel estimators need two significant elements: kernel function and the smoothing parameter. The choice of the kernel function is not as significant as the choice of the smoothing parameter. The simplest formula of the optimal h value is the Eq. 12. we need to evaluate it on the basis of the train set. In this case we do not need the tune set, so we can treat the sum of the train and tune set as the train set.
The result of 12 values predicted by Nadaraya–Watson
t | x_{t} | \(\widetilde{f}_{NW}(x_{t-12})\) | absolute error |
---|---|---|---|
I 60 | 417 | 389 | 28 |
II 60 | 391 | 377 | 14 |
III 60 | 419 | 448 | 29 |
IV 60 | 461 | 442 | 19 |
V 60 | 472 | 451 | 21 |
VI 60 | 535 | 514 | 21 |
VII 60 | 622 | 0 | 622 |
VIII 60 | 606 | 0 | 606 |
IX 60 | 508 | 501 | 7 |
X 60 | 461 | 449 | 12 |
XI 60 | 390 | 390 | 0 |
XII 60 | 432 | 447 | 15 |
It occurs when the denominator of Eq. 3 equals zero and it makes the division unrealizable. It means also that there is no pair of train and test values that \(K(x_{\rm train}, x_{\rm test}) > 0.\) This situation is easier to describe if we define the notion of h-neighborhood.
From this point of view we can say that the big prediction error is caused by the empty h-neighborhood of some test points. It may seem correct to increase the support of the kernel function by the increase of the smoothing parameter value. However, from the other hand we know that increasing the value may cause the other unwanted effect—the oversmoothing. The algorithm that gives us a compromise between the non-empty h-neighborhood and the oversmoothing is the HASKE algorithm, described in Sect. 5.
5 HASKE algorithm
5.1 Background
Performing time series prediction as the kernel estimation of the regression function may meet the problem of empty h-neighborhood for test objects. It occurs that typical algorithms of smoothing parameter evaluation fail in respect to time series prediction. Results of experiments—some of them are shown in Table 2—suggest to modify the value of smoothing parameter in such a way that for every test object its h-neighborhood would be non-empty.
Heuristic Adaptive Smoothing Parameter Kernel Estimator Algorithm (HASKE) solves the mentioned problem. The solution is the set of two parameters μ and α. Each of them depend on the given time series, but only the first of them is connected with the problem of the empty h-neighborhood. For the given time series its training part is divided into two separate subsets: train and tune. Then the Nadaraya–Watson kernel estimator is trained with respect to the value of the μ, and the error of the tune set prediction is observed. The value that gives the lowest prediction error on the tune set is chosen as the optimal value of the μ parameter.
HASKE, like other mentioned kernel estimators, is not defined for typical time series space (pairs of observation and time stamp: (t, x_{t})), but requires the transformation to the new space with clearly defined dependent and independent variables. The typical form of transformation is definition of the time interval \(\Updelta t\), that implies the new set of pairs of observation \((x_t, x_{t+\Updelta t})\). This assumption requires setting the \(\Updelta t\) value, that is usually prediction horizon or the strongest time series period length. HASKE should not be applied when \(\Updelta t\) is hard to determine or when the correlation between dependent and independent variables is low.
5.2 Definition
It is very important to set the question: How far shall we increase the μ parameter? If we do it in an arbitrary way it may occur that this method gives more damage than profits. This value should be data dependent, so we should evaluate it in the adaptive way. That is why the train set was divided into the smaller train and the tune set. We can observe from Fig. 2 how the tune set prediction error changes by the action of μ changes.
To assure the independence of the μ value from the phase of the time series period, it is evaluated as the median of μ_{i} values, evaluated for every phase of the series period. The phase as itself is considered by the author as the following notion: let us consider the time series determined by the values from t_{0} to t_{k} and the prediction horizon p_{m}. For the assumed prediction horizon p_{m} it is possible to define M = p_{m} phases, with indices \(ph = 0, 1, \ldots, M-1,\) as experiments that are defined as the prediction of p_{m} values on the basis of values from t_{0} to t_{k−ph}.
The improvement of prediction obtained with the usage of μ and α in HASKE
t | x_{t} | NW(x_{t−12}) | \({\rm HASKE}_\mu(x_{t-12})\) | \({\rm HASKE}_{\mu, \alpha}(x_{t-12})\) |
---|---|---|---|---|
I 60 | 417 | 389 | 382 | 411 |
II 60 | 391 | 377 | 367 | 395 |
III 60 | 419 | 448 | 421 | 453 |
IV 60 | 461 | 442 | 412 | 444 |
V 60 | 472 | 451 | 437 | 471 |
VI 60 | 535 | 514 | 493 | 531 |
VII 60 | 622 | 0 | 553 | 595 |
VIII 60 | 606 | 0 | 555 | 598 |
IX 60 | 508 | 501 | 487 | 524 |
X 60 | 461 | 448 | 421 | 454 |
XI 60 | 390 | 390 | 383 | 412 |
XII 60 | 432 | 447 | 420 | 452 |
RMSE | 275.26 | 37.39 | 17.18 |
- 1.
Define the maximal interesting prediction horizon p_{m}.
- 2.
Transform time series from the (t, x_{t}) space to the \((x_k, x_{k+p_m})\) space.
- 3.
Split the obtained set of pairs into the train and tune set. Last p_{m} pairs of the initial set become the tune set, the rest remain in the train set.
- 4.Define the maximum value of μ (μ_{max}) and the step of the μ increase (\(\Updelta\mu\)). Then, for each phase of the prediction horizon \({\rm ph} = 0, 1, 2, \ldots, p_m-1\) do the following:
For i = 1 to \(i = {\frac{\mu_{max} - 1}{\Updelta\mu}}\) observe the prediction error on the phase tune set rmse_{ph}.
Select the minimal value of the phase prediction error rmse_{ph}. The argument μ_{ph} is the argument of the minimal rmse_{ph} value.
- 5.
The median of \(\mu_{\rm ph}, {\rm ph} = 0, 1, \ldots, p_m-1\) values becomes the μ value.
- 6.
Find underestimations of all tune objects, as the result of the prediction with the usage of μ value and take median of them as the α value.
- 7.
Perform the HASKE prediction as in Eq. 22.
6 The HKSVR estimator
6.1 Background
The support vector regression model described in Sect. 2 is a global one. There is also a number of its modifications that use a local learning paradigm. The algorithm presented in [8] uses kNN as a local training set. The other algorithm says that the value of the \(\epsilon\) parameter depends on local covariance matrix (\(\Upsigma_i\)), calculated on the basis of training points from the neighborhood of point x_{i} [14].
The heuristic of the smoothing parameter evaluation, presented with the HASKE algorithm, gives more appropriate definition of the test point neighborhood because it makes it possible to reduce the time series prediction error. It is worth to check, whether that new definition of neighborhood improves the results of the support vector regression for time series.
The Hybrid, Kernel and Support Vector Regression algorithm (HKSVR) combines the kernel regression (considered as Nadaraya–Watson estimator or similar) and SVM regression. Initial step of the algorithm determines the neighborhood of train objects for each test point. Second step performs support vector regression for the test point on the basis of its train neighbors.
Similarly, as HASKE, the HKSVR performs prediction as the estimation of the regression function in the modified space. That means that the value of the parameter that defines the transformation is necessary. As distinguished from HASKE, the HKSVR is designed for prediction of time series, where it is hard to point their period length. The length of the time series period may be determined with the usage of the Fourier transformation.
It is important to notice that the HKSVR usefulness depends on the correlation of data in modified space. Usage of the HKSVR brings the prediction improvement if the correlation is significant (close to one).
6.2 Definition
As it was mentioned in the previous section, this paper describes the new local estimator. It is based on the h-neighborhood definition and its adaptive evaluation, but is not the extension of HASKE. First step of the algorithm is the choice of the parameter δ that defines the transformation of the time series from its original space to the space defined in Sect. 3 As it was mentioned in Sect. 1, the HKSVR is dedicated for time series with “hidden” dependence between its previous values. The parameter δ, that determines the transformation to the modified space, does not have to be connected with the maximal interesting prediction horizon and that is why the other denotation is used. The δ value can be evaluated with the usage of the Fourier analysis and δ is the length of one of the harmonics.
In the second step, all data are divided into train, tune and test set. Then the adaptive value of the μ parameter is evaluated, as it is performed in the HASKE algorithm. After that, for every test object its h_{μ}-neighborhood is determined and becomes the train set for the support vector regression. Finally, the prediction is performed as the local SVR.
- 1.
Split train data into the tune set and the smaller train set.
- 2.
Evaluate smoothing parameter h in the standard way (for example Eq. 12).
- 3.
Find values of the μ and α as it takes place in the HASKE algorithm for the task of prediction of the tune set.
- 4.For every test object:
- (a)
find its h_{μ}-neighborhood (Eq. 19) in the whole train set—it becomes the train set for the SVR,
- (b)
learn the support vector machine,
- (c)
find the value for the test object as the result of local support vector regression,
- (d)
divide the result by the underestimation (α).
- (a)
7 Time series prediction
All algorithms were performed on synthetic and real data. Besides, all of experiments corresponded to the rule of unbiased prediction. This rule assumes that the prediction value is equal to the expected value of the predicted random variable. It is required that expected value of the difference between the predicted random variable and the prediction value leads to zero with the increase of the amount of the training data [30].
7.1 HASKE results
Comparison of the decomposition method, kernel estimators and HASKE
Decomposition | SF | GM | NW | HASKE | ||||
---|---|---|---|---|---|---|---|---|
Trend | Exp. | Lin. | ||||||
Model | Add. | Mult. | Add. | Mult. | ||||
M series | 13.36 | 23.23 | 28.47 | 33.68 | 31.24 | 58.53 | 42.77 | 8.74 |
N series | 21.99 | 50.69 | 33.65 | 49.22 | 43.81 | 75.30 | 49.09 | 4.6 |
G series | 40.32 | 26.60 | 64.63 | 68.52 | 139.01 | 475.64 | 275.26 | 17.18 |
E series | 72.06 | 77.72 | 72.56 | 77.57 | 33.66 | 301.30 | 33.37 | 36.76 |
7.2 The HKSVR results
The prediction horizon p varied from 1 to 10 days. The δ parameter was evaluated as the length of the nth maximal harmonic of the Fourier transformation. The length of the first harmonic for time series was comparable to the length of the time series. As a result of this fact, the first harmonic length was not considered as the value of the δ parameter.
The dependence of the WIG20 estimation accuracy increase on the prediction horizon and the harmonic
nth Maximal harmonic | Prediction horizon p | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
2 | −91 | −40 | −124 | −117 | −53 | −149 | −25 | 25 | 3 | 1 |
3 | −1 | 23 | 6 | 14 | 4 | 14 | 91 | 45 | 19 | 16 |
4 | −298 | 13 | 40 | 25 | 11 | 69 | 48 | 19 | −8 | −41 |
5 | −46 | 112 | −19 | −17 | −11 | −10 | −24 | −27 | −20 | −47 |
6 | −44 | −24 | 3 | 17 | 26 | 21 | 40 | 0 | −146 | −11 |
7 | −13 | 4 | −34 | −28 | 6 | −2 | −5 | −4 | 0 | 0 |
8 | −2 | 3 | 0 | −11 | 266 | 208 | 189 | 136 | 104 | 102 |
9 | −33 | −49 | 403 | 347 | −6 | 0 | 0 | 0 | 0 | 0 |
10 | 114 | 152 | −143 | 0 | 0 | −85 | 0 | 0 | 0 | 0 |
Statistical description of the WIG20 estimation improvement with the usage of the HKSVR model
Horizon | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Avg | −45.9 | 21.7 | 14.6 | 25.6 | 26.9 | 7.4 | 35.0 | 21.7 | −5.4 | 2.3 |
SD | 109.6 | 67.6 | 158.0 | 127.7 | 92.4 | 98.7 | 69.2 | 47.7 | 63.9 | 42.9 |
The character of the prediction improvement can be more visible with the definition of the ρ coefficient, that is the quotient of the average prediction improvement and its standard deviation. Observing the value of that coefficient may help to decide whether the usage of the HKSVR improves the prediction or not. The higher the positive values the higher is the improvement.
The ρ coefficient values for WIG20 and rWIG20 series
p | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
ρ_{WIG20} | −2.4 | 3.1 | 10.8 | 5.0 | 3.4 | 13.4 | 2.0 | 2.2 | −11.9 | 18.9 |
ρ_{rWIG20} | −4.9 | −1.6 | −1.3 | −1.8 | −2.6 | 3.6 | 23.9 | −2.1 | 4.9 | −4.3 |
It can be noticed that the improvement of the rate of return time series prediction decreased significantly. Majority of positive ρ values became negative.
7.3 ρ coefficient normalization and results interpretation
Q(ρ) = 0 is the asymptotic worst value and corresponds to ρ = −∞
Q(ρ) = 1 is the asymptotic best value and corresponds to ρ = ∞
Q(ρ) ∈ (−∞, 0.5) corresponds to prediction worsening
Q(ρ) ∈ (0.5, ∞) corresponds to prediction improvement
Q(ρ) = 0.5 means that there is no improvement and it corresponds to ρ = 0
1 − Q(−ρ) > Q(ρ), for ρ > 0, that for small |ρ| values the worsening has stronger influence on the Q value that the improvement (for example: Q(0.01) = 0.51 and Q(−0.01) = 0.3).
Figure 7 shows the function Q(ρ) (black dotted line) and β(ρ) (black solid line) for specified values of “betas”.
The Q coefficient values for WIG20 and rWIG20 series
p | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Q (WIG20) | 0.29 | 0.59 | 0.75 | 0.62 | 0.59 | 0.79 | 0.57 | 0.57 | 0.01 | 0.87 |
Q (rWIG20) | 0.12 | 0.36 | 0.39 | 0.34 | 0.27 | 0.60 | 0.92 | 0.32 | 0.62 | 0.15 |
These results may be surprising. But it is worth to remind that the HKSVR model bases on the HASKE estimator and this estimator is trying to find the regression function in the modified space (strongly connected with the analyzed time series). In this case, when the points in the modified space exhibit a correlation (more specifically: there is a correlation between the dependent and independent variable), the kernel estimator is able to approximate the regression function.
Correlations for the WIG20 time series in the modified space
p | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
h_{2} | 0.92 | 0.92 | 0.92 | 0.92 | 0.92 | 0.92 | 0.92 | 0.92 | 0.91 | 0.91 |
h_{3} | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |
h_{4} | 0.94 | 0.94 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 |
Correlations for the rWIG20 time series in the modified space
p | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
h_{2} | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.02 | 0.01 | 0.03 |
h_{3} | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 |
h_{4} | 0.01 | 0.02 | 0.01 | 0.02 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 |
We see that the WIG20 series has significant correlation in the modified space and high values of correlation do not depend on the used order of the harmonic that defines the translation value δ. Analogous correlations for the rWIG20 series are insignificant (very close to zero). So it should be claimed that the usage of the HASKE estimator and the HKSVR model is justified in cases where the correlation between the dependent and independent variables is significant.
8 Summary
The article describes two new algorithms of the time series prediction. Both of them belong to the group of non-parametric methods and base on new definitions of neighborhood and locality. The prediction problem is brought to the estimation of the regression function in the modified data space.
The first algorithm (HASKE) defines the notion of h-neighborhood and helps to avoid the effect of its emptiness. Two required parameters of this algorithm are calculated adaptively with the usage of train and tune sets. The results of HASKE were compared with the results of the kernel regressor and the decomposition method. This algorithm is designed for time series that have well-defined time dependency (the value of the period is easy to interpret) like it is in the case of the G series (amount of passenger has the 12-month period). On the basis of this value the transformation to the new space is performed. Moreover, the data in the modified space should have significant correlation. If this condition is not fulfilled, the result of HASKE may not be satisfactory. It can be observed in the case of the results for E series prediction.
Second model (the HKSVR) is the local hybrid connection of the SVR and HASKE. It was tested on the real financial data. Generally, this algorithm is correct for time series that do not have the dominating period. That is the reason, why the other methods are used to find the value that defines the transformation of time series to the new space, for example the Fourier transformation. The algorithm improves the prediction for the short time horizons, excluding the next value prediction. Applicability of this model depends on the data correlation in the modified space. The normalized criterion Q ∈ (0, 1) describing the improvement of prediction was used as a comparison tool. Values higher than 0.5 indicate that the usage of the HKSVR was appropriate.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.