Gaussian Processes with Skewed Laplace Spectral Mixture Kernels for Long-term Forecasting

Long-term forecasting involves predicting a horizon that is far ahead of the last observation. It is a problem of high practical relevance, for instance for companies in order to decide upon expensive long-term investments. Despite the recent progress and success of Gaussian processes (GPs) based on spectral mixture kernels, long-term forecasting remains a challenging problem for these kernels because they decay exponentially at large horizons. This is mainly due to their use of a mixture of Gaussians to model spectral densities. Characteristics of the signal important for long-term forecasting can be unravelled by investigating the distribution of the Fourier coefficients of (the training part of) the signal, which is non-smooth, heavy-tailed, sparse, and skewed. The heavy tail and skewness characteristics of such distributions in the spectral domain allow to capture long-range covariance of the signal in the time domain. Motivated by these observations, we propose to model spectral densities using a skewed Laplace spectral mixture (SLSM) due to the skewness of its peaks, sparsity, non-smoothness, and heavy tail characteristics. By applying the inverse Fourier Transform to this spectral density we obtain a new GP kernel for long-term forecasting. In addition, we adapt the lottery ticket method, originally developed to prune weights of a neural network, to GPs in order to automatically select the number of kernel components. Results of extensive experiments, including a multivariate time series, show the beneficial effect of the proposed SLSM kernel for long-term extrapolation and robustness to the choice of the number of mixture components.


Introduction
Gaussian Processes (GPs) are an elegant Bayesian approach to modeling an unknown function and representing its predictive uncertainty. They provide regression models where a posterior distribution over the unknown function is maintained as evidence is accumulated. This allows GPs to learn involved functions when a large amount of evidence is available, and makes them robust against overfitting in the presence of little evidence [1,2]. A GP can model a large class of phenomena through the choice of its kernel, which characterizes one's assumption about the autocovariance of arXiv:2011.03974v3 [cs.AI] 2 Oct 2021 the unknown function. The choice of the kernel is a core step when designing a GP, since the posterior distribution can significantly vary for different kernels. Important advances in the design of kernel functions for GPs have been achieved in recent years. In particular, kernels based on exponential distributions [3,4,5,6,7], notably the flexible Spectral Mixture (SM) kernels, have been successfully applied to many real-life extrapolation tasks. However, it remains challenging for SM kernels to perform long-term forecasting. Long-term forecasting involves predictive horizons that are far ahead of the observed training data.
Our experimental analysis indicates that characteristics of the signal important for long-term forecasting can be unravelled by analyzing the distribution of the Fourier coefficients of the signal, as shown in Figure 1 for the (training part of the) Rail Passenger Miles time series dataset (fully described in Section 6.1). The distribution of the Fourier coefficients is non-smooth, heavy-tailed, sparse, and contains skewed peaks. Skewness of the peaks in the empirical density can be observed: the first, third, and fourth peak are right tailed. The heavy tails and skewness of peaks in the empirical spectral density correspond to long-range covariance in the time domain. We have fitted an SM kernel (red dashed line), Lévy kernel process (LKP) kernel (blue dashed), and SLSM kernel (black dashed) on this empirical spectral density, using the least squares method. SM and LKP cannot capture one-side heavy tail characteristics of the spectral density peaks, while SLSM can, yielding a better fit. Note that differences between the kernels are large, although in the figure they look negligible because an exponential scale is used.
Let us analyze the differences between these three spectral kernels in more detail. The inverse Fourier transform of an SM spectral density is an exponential function, so in the time domain, we have an exponential decay of covariance. When we use LKP to fit the empirical spectral density and apply the inverse Fourier transform, we get a Cauchy function, which has a much slower decay than that of the exponential function. Using SLSM we extend the tails of the Laplace distribution (by means of the γ term in Equation (8)), which in the time domain yields a further reduction of the decay rate of the covariance. These observations are illustrated in Figure 2.
In the top row, we can see the shapes of the spectral density for the three kernels, and the flexibility provided by the skewness parameter γ in the SLSM kernel. Covariance functions are shown in the middle row. They indicate that SM has a very short range covariance (range τ < 5). LKP has longer range covariance (range τ < 15) because of the heavier tails in the spectral density. SLSM has the longest covariance representation range (τ >> 15) due to skewness in the frequency domain.
In the bottom row, samples drawn from Gaussian Processes with these kernels are plotted. The SLSM kernel (red line) results in a larger variance, which is more flexible and less smooth than SM and LKP, because of heavy tails at high frequency position in the frequency domain. Both positive and negative skewness (determined by the value of γ) in SLSM can extend the covariance range and therefore increase the variance in the sampled functions. A positive value of γ contributes more to the signal variance. The above observations justify the usefulness of an asymmetric Laplace kernel, capable to capture characteristics of the signal important for long-range forecasting.
Building on previous work on modeling the asymmetric tails of discrete Fourier transform coefficients distributions [8,9,10], we investigate the use of the skewed Laplace mixture to model the spectral density, that is, the Fourier transform (FT) of a kernel. We then use the inverse Fourier transform to construct a kernel for GPs, which we call the Our contributions can be summarized as follows: • we show how long-term forecasting with GPs is linked to long-range characteristics of covariance between random vectors, as determined by properties of the spectral density, namely sparsity, skewed peaks, long tails, and non-smoothness; • we propose a new GP kernel for long-term forecasting, which includes a term to model skewness of each spectral peak; • we compare SLSM with other spectral kernels, and show that SLSM is linked to the popular Rational Quadratic kernel; • we introduce an algorithm for pruning kernel components and show experimentally its beneficial effect; • we perform extensive long-term forecasting experiments with various time series data, including multivariate and large-scale ones.
The rest of the paper is organized as follows. In the next section we summarize related works. Section 3.1 and 3.2 introduce GPs and the SM kernel. Then in Section 4 we introduce the SLSM kernel. Hyper-parameters initialization and automatic pruning of components are introduced in Section 5. Section 6 describes experiments on real-world datasets. Concluding remarks and future work are given in Section 7.

Related work
Because of the fundamental role of kernels in GPs, various works tackled the problem of kernel design, using two main approaches: the compositional approach, where kernels are constructed compositionally from a small set of simple base kernels, and the spectral representation approach, where kernels are constructed by modeling a spectral density, the Fourier transform of a kernel, with a Gaussian mixture, with theoretical support from the Bochner's theorem. In [11] the kernel learning problem is formalised as structure discovery. A space of kernel structures is defined compositionally in terms of sums and products of a small number of base kernel structures. This provides an expressive modeling language that concisely captures many widely used techniques for constructing kernels. An algorithm that searches over this space using a greedy search is proposed. Recent works based on this approach include [12], which introduces a flexible family of kernels represented by a neural network, whose architecture is based on the composition rules for kernels, so that each unit of the network corresponds to a valid kernel; and [13], which derives Bayesian Neural Network architectures mirroring compositional kernel combinations. The SLSM kernel with pruning strategy as proposed in this paper can be interpreted as a form of structure discovery, where unimportant components representing irrelevant structure are removed while important components representing the true underlying structure are kept during learning.
Spectral mixture kernels were introduced in [3] and are described in detail in the next section. The success of these kernels has stimulated research on their extensions and improvements. For instance, the spectral mixture product kernel [14] was introduced for modeling Cartesian structured datasets. Various works address the scalability of GPs with spectral kernels. In particular, in [15] a Bayesian non-parametric kernel-learning is proposed, which places a non-parametric prior on the spectral distribution of random frequencies allowing it to both learn kernels and scale to large datasets. The limitations of stationary kernels, notably their generalization to the non-stationary setting has been addressed in various works. For instance, the non-stationary SM kernel allows to model input-dependent GPs [6]. In [16] a family of non-stationary kernels, called harmonizable mixture kernels, is introduced, which supports stationary and many non-stationary covariances. A robust kernel learning framework for this kernel family is proposed. Spectral kernels with desirable properties, called generalized spectral kernels, have been proposed in [17]: they can approximate any (possibly non-stationary) bounded kernel, and allow inference of the degree of differentiability of the corresponding stochastic process when used as a covariance function. The SLSM kernel is stationary, a limitation discussed in the conclusion section. In [18] a new kernel called generalized convolution spectral mixture is proposed, which uses cross convolution to model dependencies between components of a spectral kernel. The SLSM kernel does not explicitly model dependencies between components.
Our investigation differs from previous works mainly because of the focus on long-term forecasting. The abovementioned spectral kernels consider a short tailed and smooth Gaussian as a base representation of the spectral density, which corresponds to an exponential function for the covariance in the time domain. Therefore these kernels are less suited for long-term forecasting.
In this respect, the most related work is the Lévy kernel process (LKP) [7]. LKP was used for modeling the sharply peaked spectral density with a location-scale mixture of Laplace components, which is more sparse than the SM kernel, hence more robust to the choice of the number of mixture components. LKP can discover heavy-tailed patterns which are useful for long-range extrapolation. However it is only available for one dimensional inputs and does not model skewness characteristics of the spectral density investigated in the proposed SLSM, which, as demonstrated by our extensive experiments, is beneficial for long-term forecasting. Specifically, SLSM extends the tails of the Laplace distribution (by means of the γ term in Equation (8)), which in the time domain yields a slower decrease of the decay rate of the covariance. The value of γ gives increasing weight to the right (lim γ → 1) or to the left (lim γ → −1) tail of the density.
In general, the Laplace distribution has been successfully used in the past within diverse signal processing applications.
Here we mention few examples. In [19] a Laplace mixture model is proposed, as a long-tailed alternative to the Normal distribution when identifying differentially expressed genes in micro-array experiments, and provide an extension to asymmetric over-or under-expression by means of an asymmetric extension of the mixture model. In [9] the multivariate Laplace probability model in the context of a Normal variance mixture model is analyzed and an application of the model to represent the statistics of the discrete Fourier transform coefficients of a speech signal is given. In [8] samples of a speech signal are considered as samples of a random variable, and the speech signal's probability density function in the time domain is investigated. It is shown that the distribution of speech samples is well described by a Laplace distribution. In [20] a method for capturing nonlinear dependencies in images of natural scenes is introduced, which builds a hierarchical model based on Independent Component Analysis and a mixture of Laplace distributions. These works mainly focus on the representation and modeling of spectral density using the Laplace distribution. Instead, our focus is on kernel design for GPs.

Gaussian processes
A Gaussian process defines a distribution over functions, specified by its mean function m(x) and covariance function k(x, x ) for given input vectors x, x ∈ R P [2]. Thus we can define a GP as Without loss of generality we assume the mean of a GP to be zero. The covariance function is used to construct a positive definite covariance matrix on the set X of training points, called the kernel and denoted by K = k(X, X).
By placing a GP prior over functions through the choice of a kernel and parameter initialization, from the training data X we can predict the unknown function valueỹ * and its variance V[ỹ * ] (that is, its uncertainty) for a test point x * using the following key predictive equations for GP regression [2]: where k * = k(x * , X) is the covariance vector between x * and X, and y are the observed function values corresponding to X. Typically, kernels contain free hyper-parameters, denoted by Θ, which can be optimized by minimizing the Negative Log Marginal Likelihood (NLML) of the observed values: where σ 2 n is the noise level and NLML is obtained through marginalization over the latent function [2]. This formulation follows directly from the fact that y ∼ N (0, K + σ 2 n I). In this paper we consider only stationary kernels, which are invariant to translation of the inputs. These can be described

Spectral mixture kernels
The class of flexible stationary kernels for GPs called Spectral Mixture (SM) kernels, was introduced in [3,4]. An SM kernel, here denoted by k SM , is derived through modeling its spectral density (the Fourier transform of a kernel) with a mixture of Gaussians. Such modeling is possible because of Bochner's Theorem [21,22], which states that the properties of a stationary kernel entirely depend on its spectral density: theorem 1 (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ ) = R P e 2πs τ ψ(ds), where ψ is a positive finite measure, and  denotes the imaginary unit.
If ψ has a densityk(s), thenk is called the spectral density or power spectrum of k, and k andk are Fourier duals, that is: The SM kernel k SM is defined as the inverse Fourier transform (that is, the Fourier dual) of a mixture of Q Gaussians in the frequency domain: where F −1 s→τ denotes the inverse Fourier transform operator from the frequency to the time domain, and w i is the weight of componentk SM,i . Here the spectral densityk SM,i (s) of i-th component, iŝ The symmetrization makesk(s) even, that is,k(s) =k(−s) for all s. So the Fourier transform ofk, that is our kernel, is real, since the Fourier transform of a real even function is real. Furthermore k is symmetric, since the Fourier transform of an even function is even (cf. e.g. [23]).
By expanding equation (5) we get are weight, mean, and variance of the i-th component in the frequency domain, respectively.
The inverse mean 1/µ i of component i is the period, and the inverse standard deviation 1/ √ Σ i the length scale, determining how quickly a component varies with the inputs. So the variance (σ 2 i ) (P ) can be thought of as an inverse length-scale, and µ (P ) i as a frequency. The weight w i specifies the relative contribution of the i-th mixture component.
In addition, by modeling the spectral density as a mixture of Q univariate Laplacians, the kernel form of LKP [7] can written as follows:

Skewed Laplace spectral mixture kernel
Any stationary covariance kernel can be approximated to arbitrary precision by an SM kernel, given enough components of mixture in the spectral representation, because mixture of Gaussians are dense in the set of all distribution functions [24]. So it would seem useless to introduce a new stationary kernel. However, using a large number of components can lead to overfitting when there is not enough training data. Therefore, it makes sense to introduce and study new stationary kernels, that are robust to overfitting in the presence of a large number of components. An example of such a kernel is LKP. However, as mentioned in the introduction, LKP does not model skewness characteristics of the spectral density beneficial for long-range forecasting.
Here we propose to overcome the limitations of the SM and LKP kernels by using a mixture of Skewed Laplace (SL) distributions, which can better capture skewness of peaks of the spectral density and its heavy tail characteristic. We consider the three-parameter (µ, γ, σ) family of skewed Laplace distributions introduced in [25], with density ϕ(s; µ, γ, σ) = where κ = √ 2σ/(γ + 2σ 2 + γ 2 ), γ is the skewness parameter, µ, and σ 2 are the mean and variance of the distribution, respectively. When γ = 0 this distribution reduces to the standard non-skewed Laplace distribution. The inverse Fourier transform of the i-th skewed Laplace density is [26] where C i (τ ) = 1 + 1 2 σ 2 i τ 2 . We proceed to construct a kernel from the skewed Laplace density in the same way as described in the previous section: 1) symmetrize ϕ i (s; µ i , γ i , σ i ); 2) take the inverse Fourier transform of the resulting even real function; and 3) make a mixture of several such components.
First, by symmetrization we obtain the spectral density of the i-th component of the SL kernel: Note that the above transformation does not destroy the skewness of the mixture components (the peaks) of the spectral density: although by construction the spectral density is symmetric around zero, each of its peaks is skewed around its mean (with skew parameter γ i ).
Next, by applying (9) to ϕ i (s; µ i , γ i , σ i ) and ϕ i (−s; µ i , γ i , σ i ), and adding the resulting functions we obtain the time domain representation of the i-th component k SLSM,i of the SLSM kernel: Each component k SLSM,i (τ ) is positive semi-definite because its corresponding spectral density defined in (10) is non-negative everywhere [21,22]. Also, each component is real because its spectral density is even [1].
The denominator of k SLSM,i (τ ) is a skew function of the inverse distance and the numerator term is a periodical trigonometry function. C i (τ ) is an inverse Cauchy function (the inverse Fourier transform of the zero positioned non-skewed Laplace) [7]. The Cauchy function has slower decay over the distance τ than the exponential function used in SM based kernels.
Finally, the Skewed Laplace Spectral Mixture (SLSM) kernel k SLSM corresponds to a mixture of Q skewed Laplace components, and is defined as The Skewed Laplace distribution has a natural extension to higher dimensions, described, e.g., in [25,27,28]. Therefore our kernel can be directly extended to the multivariate setting as follows: where Here Σ i denotes the covariance matrix of component i, and µ i is the vector of means of this component. Figure 2 provides a visual comparison of the SM, LKP and SLSM distributions used for fitting the empirical spectral density. The differences among SLSM, LKP and SM are illustrated in terms of covariance, spectral density, and sampling functions. The subplots in the second row of Figure 2 show random functions drawn from a GP with SM, LKP, and SLSM kernel, respectively. The sampled function values were obtained using 500 equally-spaced discrete points.

Comparison with other kernels
The inverse FT of the SM spectral density is an exponential function, so in the time domain we have an exponential decay of covariance. When we use LKP to fit the empirical spectral density and apply the inverse FT we get a Cauchy function, which decays in a much slower way than the exponential one. Using SLSM we extend the tails of the Laplace distribution (by means of the γ term in equation 8), which in the time domain yields a further reduction of the decay rate of the covariance. This phenomenon is illustrated in Figure 2, which shows longer range covariance of SLSM (middle of right plots). The main characteristics of SM, LKP and SLSM can be summarized as follows: • SM: multivariate Gaussian, dense, non-skewed peaks, short-tailed, smooth, symmetry of each peak, exponential decay of covariance; • LKP: univariate Laplacian, sparse, non-skewed peaks, heavy-tailed, non-smooth, symmetry of each peak, fast decay of covariance; • SLSM: multivariate skewed Laplacians, skewed peaks, one side more heavily-tailed, non-smooth, slower decay of covariance. Figure 1 shows an example where SLSM provides a better fit of the empirical spectral density than SM and LKP. SLSM extends the LKP kernel (apart from the way inference is performed), which can be obtained by removing the skewness term, i.e., by setting the γ i 's to 0. More precisely, LKP as defined in [7], uses Reversed-Jump MCMC (RJ-MCMC) to perform inference. While, SLSM uses LBFGS implemented in GPML toolbox to perform inference.
Also, SLSM is related to the Rational Quadratic (RQ) kernel, The RQ kernel can be interpreted as a scale mixture (an infinite sum) of squared scale mixture SE kernels with different characteristic length-scales [2]. The RQ kernel is considered the most general representation defining a valid isotropic covariance function in R P [4,22]. From Equations (11) and (14) it follows that a non-skewed SLSM component i and a cosine kernel. Thus the RQ kernel with α = 1 can be viewed as modeling the spectral density, as one Laplace distribution at zero mean position in the frequency domain. That is, the RQ kernel can be viewed as part of a component of a SLSM kernel.
SLSM component can be approximated at arbitrary precision with a mixture of Gaussians using sufficiently many components. Thus a SM kernel can approximate the SLSM kernel. However, as observed e.g. by [17], although mixture of Gaussian distributions can be used to approximate any stationary distribution, a large number of SM components might be required to account for the lower degrees of smoothness in the data. This would affect inference, which would become more expensive and be more vulnerable to the presence of local optima. In Figure 2 SM, LKP, and SLSM with 3 components are illustrated in terms of spectral densities, covariance, and sampling paths.

SLSM's hyper-parameter initialization, automatic pruning of components, and scalable inference
Kernels hyper-parameters are inferred by optimizing the negative log marginal likelihood (NLML), which for kernels such as SM and SLSM, amounts to solve a non-convex optimization problem. The initialization of these hyperparameters has a direct impact on the ability of the optimization process to find a good local optimum of the NLML. Below, we describe the initialization procedure used in our experiments. Another parameter of spectral kernels to be set is the number of kernel components. A way to circumvent this problem is to automatically prune irrelevant components. In LKP, automatic pruning of components is part of the method, because of the sparsity-inducing property of the Lévy priors (see [7], supplementary material). For SLSM and SM kernels, we propose a method for automatic pruning kernel components. In order to perform experiments with large datasets, scalable inference is needed, because in GPs exact inference is computationally expensive. Therefore, for the large-scale experiment in Section 6.5 we use the scalable inference procedure described in this section.

Hyper-parameter initialization
In general, SM-based kernels are particularly sensitive to the initialization of their hyper-parameters, and the SLSM kernel shares this initialization problem. Here we apply an initialization strategy that has been shown to be effective in previous works [3,29]. These works suggest that non-smooth peaks of the empirical spectral densities are near the true frequencies and use a Gaussian Mixture Model to fit the empirical spectral density, which is then used to initialize the hyper-parameters of an SM kernel. We make use of this result and initialize the hyper-parameters of the SLSM kernel using a Laplace Mixture Model, to fit the empirical spectral density in order to get Q cluster centers, where Q is the number of components. We use the Expectation Maximization algorithm [30] to estimate the parameters of this mixture model. The resulting estimatesw i , µ i , andΣ i are used as initial hyper-parameters values of the SLSM kernel. The skewness parameter γ i is initialized randomly between -1 and 1. We use the above initialization strategies in all our experiments on real-world datasets.

Automatic pruning of kernel components
In order to automatically select the number of components, we adopt a method originally introduced for pruning the weights of a neural network, based on the Lottery Ticket Hypothesis (LTH): dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that -when trained in isolation -reach test accuracy comparable to the original network in a similar number of iterations [31]. The above mentioned method trains a neural network and prunes a percentage of small weights; the process is repeated in a recursive fashion for each subsequent network. At each iteration, each unpruned weight value is reset to its initial value before training.
In our context, we want to prune Laplace kernel components as part of training the model. This means that the components that we prune are the Laplace kernel components that make up equation (12). Note that kernel components should not be confused with the sines and cosines terms of the Fourier decomposition of the empirical spectral density, which are often also referred to as components. Each component of the kernel covers many frequencies in the spectral density.
Specifically, in our case components with magnitude of the weight w i smaller than 1 are pruned. The choice of this threshold is heuristically motivated: since the energy of spectral density is very big, if a weight is smaller than 1, then the contribution to its component is very small and can be ignored. Figure 6 illustrate this phenomenon, and shows the weights of components of kernels of a GP trained on the sunspot data (described in Section 6.3) using Q = 100 components. Weights of many components (almost 65%) of SLSM are smaller than 1, while for SM and LKP there are less than 30% components with weights smaller than 1.
In our experiments we use the following LTH algorithm to prune components of SM and SLSM kernels for GP: 1. Initialize a GP with Q components.
2. Train the GP for 100 iterations using the LBFGS algorithm (see description below).
3. Prune component with weight smaller than 1.
4. Set weights of unpruned components to their initial value and re-train the GP.
We train the GP model by means of the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm with 100 iterations. The computational cost of this pruning procedure depends on the number of iterations performed. In practice, two pruning rounds are sufficient.
Our LTH pruning algorithm is applicable to SM and SLSM. We also consider the specific variant of SLSM obtained by setting γ = 0 (we call this variant SLSM(γ = 0)). This variant is strongly related with the LKP kernel (see Section 6). As already mentioned, in the LKP method, the Lévy process introduces a sparsity-inducing prior over mixture components, allowing automatic pruning of components.

Scalable inference
For GPs, exact inference is prohibitively slow for more than a few thousand of points because of its O(n 3 ) computational complexity and O(n 2 ) memory complexity [2,1,32,33] in the Cholesky decomposition of covariance matrix. The most computationally expensive steps are inverting the covariance matrix and computing the determinant of the covariance (i.e. (K + σ 2 n I)), see Eq. (2b) and Eq. (3). These problems have been addressed by covariance matrix approximation or inference approximation.
In our experiment with big time series of length n ≥ 10000, we adopt the robust Bayesian Committee Machine (rBCM) of distributed GPs [34] framework to approximate the covariance matrix structures for scalable inference. rBCM is a scalable product-of-experts model for large-scale GP regression, which recursively distributes computations to M independent local computational units and, recombine them to form an overall result. Due to the independence between local GP models, the marginal likelihood (see Eq. (3)) of a full GP using rBCM can be factorized into a product form of M marginal likelihoods of local GP models, thus performing scalable inference of the full GP model.
Using rBCM, we can train a GP model with SLSM kernel as follows: where M is the number of local GPs and p (i) (y (i) |X (i) , Θ) is the marginal likelihood of the i-th local GP using the i-th subset {X (i) , y (i) } of dataset {X, y}. The predictive probability is computed as the product of the predictive probabilities of the independent local GPs, as follows: where β i is the weight of i-th local GP. We assess the performance of SLSM and of the following popular stationary kernels available in the GPML toolbox [2,1]: linear with bias (LIN), squared exponential (SE), piecewise polynomial (Poly), periodic (PER), rational quadratic (RQ), Matérn 5/2 (MA), Gabor, neural network (NN), additive kernel based on SE using unary and pairwise interactions (ADD), as well as SM and LKP kernels. Also, we assess the performance of the LKP kernel as an instance of SLSM with γ = 0, we denote this variant by SLSM(γ = 0). LKP and SLSM(γ = 0) have the same kernel structure, but inference is performed using different optimization methods: LKP uses reversible jump Markov chain Monte Carlo (RJ-MCMC), while SLSM(γ = 0) uses the standard LBFGS method based on gradient-based optimization.

Experiments
Also, we consider popular traditional forecasting methods, namely autoregressive integrated moving average (ARIMA), automatic ARIMA (Auto-ARIMA) and the error, trend, seasonality (ETS) model with Holt-Winters seasonal method. These methods are different from GP models and have been used for forecasting time series with trend and seasonal components. We use the implementation of the ARIMA, Auto-ARIMA, and ETS in the ECOTOOL toolbox, with default parameters values [35].
For SM, LKP, and SLSM, we consider Q = 10 components, which are then automatically pruned. The choice of Q=10 is heuristic, as e.g. in [3]. The underlying assumption is that each component corresponds to a distribution in the frequency domain, which covers a much wider range of frequencies than a sine or cosine located at a single frequency position. Therefore a mixture of 10 components located at different positions can cover the structure of the underlying spectral density reasonably well, that is, the underlying peaks (determined by the weight), frequency range (determined by the variance), skewness, peak positions (determined by the mean) of spectral density in the learned model other than empirical spectral density).
For SM we initialize the hyper-parameters using a Gaussian Mixture Model p GMM (Θ|s) = Q i=1w i N (μ i ,Σ i ). For LKP we use a non-skewed Laplace Mixture Model. All other kernels in our comparison use the standard initialization in GPML toolbox.
The LKP method involves a Lévy process prior and RJ-MCMC to prune extraneous components and to infer the posterior distribution, respectively, as described in [7]. For all other kernels, we use exact inference together with LBFGS for optimizing the kernel hyper-parameters.
For SM and SLSM, we also report results obtained by pruning extraneous components using the LTH algorithm described in Section 5.2. Also, plots of the spectral density, specific skewness values, instance of skewed component, and experiments with varying values of Q in SLSM, LKP, SM are considered to provide a deeper insight into their performance.
In all experiments we use the following evaluation metrics: mean squared error (MSE) MSE = 1 mean absolute error (MAE) MAE = 1 n n i=1 |y i −ỹ i |, whereỹ i denotes the predicted value, and NLML (computed over the training data). Results are average over 10 independent runs. Standard deviations are also reported.
We consider many forecasting tasks with real-world datasets: rail miles, monthly electricity, mean sunspot, computer hardware, the monthly M3-competition dataset with more than 1000 time series, and the large pole telecom dataset. Source code of the proposed SLSM kernel is available at https://github.com/ck2019ML/ GPs-for-long-range-forecatsing.

Long-term forecasting of rail passenger miles: comparing spectral densities of SM, LKP and SLSM
In this experiment, we illustrate on a real-life long-term forecasting problem the differences between SM, LKP and SLSM kernels in terms of their spectral densities. We consider the rail miles dataset 1 . This dataset consists of 220 values covering the period from Jan. 2000 to Apr. 2018. We use Q = 10 components for the SM, LKP and SMLS kernels The top plots in Figure 3 show extrapolation results. For test points far from the last (training) observation, the SLSM kernel achieves best extrapolation performance. The bottom part of Figure 3 shows that the spectral density of the SLSM kernel has components with a higher variance than that of the other kernels as well as components with higher mean.
In Figure 3 we can observe that better fit on empirical spectral density of SM and LKP lead to limited generalization in terms of extrapolation. This indicates that the fitting on empirical spectral density is a poor proxy for the generalization error of testing, since the model may be overfitted, leading to low training error but poor generalization (testing) performance. In contrast, the optimized SLSM kernel looks different from the empirical spectral density. We believe that this is the result of the extra flexibility that skewness provides, which allows the optimizer to find a better local optimum for the kernel hyper-parameters. Which in turn leads to better generalization, since it can capture patterns not present in the empirical spectral density.
Looking in more detail, we can see that the optimized SLSM kernel has two components with large mean, µ 5 = 1.57 and µ 5 = 1.05, which may respectively correspond to smaller periods of about half month and one month, in line with half and one monthly variations that appear in the dataset. In SM and LKP, their spectral densities miss many of variations at higher frequencies, meaning they cannot learn structures located at higher frequencies. In Figure 4 we show spectral density, extrapolation, and non-skewed extrapolation of the 9th skewed SLSM component in SLSM. Figure 4 show that f 9 reduces to zero fast if we just remove its skewness and keep the rest hyper-parameters unchanged, demonstrating that skewness extends covariance range in GP model, which can correlate test points far away from training points and thus is beneficial for long range extrapolation.

Subplots (b) and (c) in
The specific values of weights w, means µ, variances σ 2 , and skewness γ in SLSM determine the contribution, periodicity, length-scale, and long-range dependency of each SLSM component, respectively.
Overall, results indicate that GP with SLSM kernel achieves best performance in terms of MSE (see Table 1) and MAE (see Table 2), while its NLML is bigger than that of SM (see Table 3). This may seem surprising. However, note that the computation of NLML is based on training data (see Eq. (3)). Therefore a lower NLML value usually indicates a better fitting of the training data, but does not guarantee better generalization. Also, note that NLML is the sum of two terms (and a constant term that is ignored): a model fit and a complexity penalty term. The first term is the data fit term which is maximized when the data fits the model very well. The second term is a penalty on the complexity of the model, i.e. the smoother the better. When optimizing NLML finds a balance between the two and this changes with the data observed.

Long-term forecasting of electricity consumption: varying the number of kernel components
In this experiment we investigate comparatively the performance of SM, LKP and SLSM when varying the number of components in a small range (2-10), using real-life data for long-term monthly electricity forecasting. Electricity forecasting is a problem of practical relevance for electric power system planning, tariff regulation and energy trading [36]. The dataset includes monthly residential electricity usage in Iowa city from Jan. 1971 to Oct. 1979. There are a total 106 time points, of which we use the first 60% for training and the rest for testing.
For SLSM, LKP, and SM we use Q = 10 components. Figure 5, subplots (a), (b), and (c), show that SLSM has a better extrapolation performance than SM and LKP, in particular for small peaks corresponding to the winter months. Subplot (d) shows the importance of skewness in SLSM, with a component having a relatively very high skewness value.
On this data, LKP selects automatically 11 components. Figure 5 (subplots (e) and (f)) shows the performance of SLSM, SLSM(γ = 0), and SM with varying number of components. Using the same value of Q for all kernels seems unfair, because SLSM has 4Q parameters while SM and LKP have 3Q parameters. Comparing performance at same value of the number of parameters would mean to increase the value of Q for SM and LKP. However, using a bigger Q does not necessarily yield better performance. For instance, results indicate that SLSM with Q = 6 is still better than SM and LKP with Q = 10. Results in terms of MSE, MAE, and NLML are given in Tables 1, Tables 2 and 3, respectively.

Long-term prediction of mean sunspot numbers: large number of kernel components
Here we further assess long-term forecasting capability of the considered kernels, this time with a large number of components. We consider the mean total sunspot number dataset. For this dataset the monthly number of sunspots was collected from Jan. 1842 -Dec. 1933 at the Royal Observatory of Belgium, Brussels, resulting in 1104 time points. We consider the 13-month smoothed data 2 . We consider Q = 100 for the SM, LKP and SLSM kernels. In order to increase the difficulty of extrapolation and perform longer-term forecasting, we use less training samples (45%) than test samples (55%).
Results are shown in Figure 6, in particular the hyper-parameters histogram of SM, LKP, and SLSM kernels, which indicates that in addition to long-range covariance, skewness is also beneficial for robust learning in the presence of a large number of components. However a large value of Q affects the optimization time due to the larger of hyper-parameter space. Table 1 and Table 2 shows improved MSE and MAE results of SLSM over the other kernels, while in this experiment NLML of SLSM is the smallest (see Table 3).

Extrapolating CPU relative performance: multidimensional computer hardware data
We now investigate the performance of the considered spectral kernels on an extrapolation task with multi-variate data. We consider a multidimensional computer hardware dataset. The dataset recorded relative CPU performance collected on Oct. 1987. There are 209 multivariate samples and the following 8 features: vendor name, model name, machine cycle time, minimum main memory, maximum main memory, cache memory, minimum channels, maximum channels. The task is to predict CPU relative performance. In this experiment we use Q = 10 components for SLSM, SLSM(γ = 0), and SM. We use the first 60% of the data for training and the rest for testing. Before training the GPs, we normalize the training data so that the data is centered around the mean and has unit standard deviation. The mean and standard deviation of the training data are used to normalized test data.
Note that for SM, LKP, and SLSM on this unevenly discrete sampled multi-dimensional dataset, the spectral density structures are unclear and their hyper-parameter spaces are much larger than for time series. Input locations in this multi-dimensional dataset are not like those of time series: they are unevenly discretely distributed. Thus we cannot get the multidimensional Fourier transform of the data. So we do not know their spectral densities structure. The size of their hyper-parameter space is equal to their input dimension times the length of the time series. Therefore, on this small multi-dimensional dataset we did not apply our procedure for pruning components. Also, we could not use informed hyper-parameter initialization for SM, LKP, and SLSM, hence we just randomly initialize all kernels.
LKP with Lévy prior and RJ-MCMC inference was developed and tested on one-dimensional data. So, in order to compare LKP with SLSM we only consider SLSM with γ = 0, with standard exact inference instead of RJ-MCMC inference used in LKP. The other baseline kernels use automatic relevance determination (ARD) to scale each dimension independently. Results of this experiment, in terms of MSE, MAE and NLML are given in Table 1, Table 2, and Table 3, respectively. While for SM a large number of components have weights bigger than 10, which means SM is very dense and these SM components usually have bigger magnitude in the frequency domain. On the other hand, the histogram of variance σ shows that SLSM components are much more heavy tailed with bigger variances than those of LKP and SM. SM components in the frequency domain are very short tailed, which exhibit fast decaying behaviors.

Long-term telecommunication forecasting: a large dataset
In order to investigate scalability of SLSM, we consider a large dataset: the pole telecom dataset [37]. The pole telecomm dataset describing a telecommunication problem has 26 non-constant input dimensions, 10000/5000 samples for train/test, respectively. By using rBCM, we can reduce the computational complexity of GP inference to O(mn 2 m ), where m and n m denote the number of BCM's and the size of data used in each BCM, respectively.
For SLSM, SLSM(γ = 0), and SM, we use Q = 5 components and randomly hyper-parameter initialization due to the unclear spectral density structure. ARIMA, ETS, and LKP using Lévy prior and RJ-MCMC inference cannot be (directly) applied to multi-dimensional data. In this experiment we do not perform automatic pruning of components with the LTH algorithm, due to the relatively low efficiency of its inference process. Results indicate good performance of SLSM when applied to a large dataset.

Long-term forecasting: large number of times series
We conclude our empirical investigation with experiments on the M3-Competition monthly data, which includes more than 1000 time series [38]. The M3-Competitions data consists mainly of business and economic time series of five types: micro, industry, macro, finance, and demographic. Because most of these time series are short, in order to perform long-term forecasting, we use the first 80% time points of each time series as training data, and the remaining data of each time series as test.
Time series in M3-Competition monthly data have different overall scales. The MSE is sensitive to the overall scale of the test data. Therefore, here we use the Standardized Mean Squared Error (SMSE): by normalizing with the variance of the target values of the test [2], then average the predictive performances of all M3-Competition monthly time series. Table 5 contains SMSE results obtained by GP with spectral kernels, and by the ARIMA, and ETS methods. Results indicate superior performance of GP with spectral kernels over traditional forecasting methods. Note, however, that we did not tune the parameters of ARIMA, and ETS methods, but used their default values given in ECOTOOL toolbox.

Discussion
Results of our experiments indicate that a GP with the SLSM kernel consistently outperforms other baselines in terms of MSE and MAE, but its NLML is not always the lowest. A reason for this phenomenon could be that the marginal   likelihood surfaces for the GPs with SM, LKP, and SLSM kernels usually have many local optima. Also, computation of NLML (Eq (3)) uses the training data, which may show a good fit, but this does not guarantee a better generalization performance. Analogously, our experiments indicate that a better fitting of the empirical spectral density does not necessarily yield the best extrapolation performance (see Figures 5 and 3). Based on these considerations, MSE and MAE are better evaluation metrics, because they are computed on the test data, with a slight preference for MAE, which is less sensitive to outliers.
Unsatisfactory performance of methods like SE, Poly, MA, Gabor, ADD, and Auto-ARIMA, is due to the fact that these methods are designed to perform interpolation, while we consider extrapolation tasks. Better performance of GPs with SLSM over other spectral kernels like SM and LKP, can be explained by the fact that SLSM extends the covariance range in the GP model, which can correlate test points far away from training and thus be beneficial for long range extrapolation. Specifically, compared to other kernels, the spectral density of the SLSM kernel tends to have very sharp peaks. In addition, we can observe that the spectral density SLSM has much bigger variance than SM and LKP, which further extends the tails of components.
Since the variance in the frequency domain is inversely related to the length-scale in the time domain, these components will lead to a kernel with long-range covariances in the time domain. This can explain the good extrapolation performance of the kernel. Also, compared to the SM and LKP kernels, the SLSM kernel can discover more short period patterns, corresponding to spectral density components with large means. In general the long-term variation is more important for extrapolation because short period patterns mixed with noise and specific local factors are difficult to capture.
Note that SLSM (γ = 0) has the same kernel structure and implementation as LKP, but uses LBFGS implemented in GPML toolbox to perform inference instead of Reversed-Jump MCMC (RJ-MCMC) in LKP [7]. Results of our experiments indicate that SLSM is also superior to SLSM(γ = 0), showing that the kernel used in the LKP method (that is, SLSM(γ = 0)) can be improved by incorporating skewness. Also, results show superior performance of SLSM(γ = 0) over LKP, indicating that for this type of kernel, gradient descent optimization usually infers a GP model that is better than that obtained using sampling based optimization. In general, we could not compare results achieved independently by the authors of the LKP method, because no quantitative results are reported in their paper. For SM, In general, we could not compare results achieved independently by the authors of the LKP method [7] , because no quantitative results are reported in their paper. For SM, we run experiments also on the monthly airline passenger with same training and test splitting as in [3]. The monthly airline passenger time series has length 144, with measurements from 1949 to 1961. It uses 96 records for training and the rest 48 records as test data. For SM we achieved the following results: average NLML, MAE, and MSE equal to 353.99, 16.87, and 460.04, respectively, close to those achieved independently in [3]. Thus our implementation of SM is in agreement with the original method described in [3]. On the same dataset, SMSL achieved the following results: average NLML, MAE, and MSE equal to 354.71, 16.78, 456.24, respectively.
We performed experiments with varying number of components, Q = 10, Q between 2 and 10, Q = 100, and investigated the effect of pruning components using the proposed LTH algorithm. Results indicate that automatic pruning using our LTH algorithm can drastically reduce the number of kernel components, which reduces the size of the hyper-parameter space, and hence improves performance (see Table 4).

Conclusion
We have proposed a new kernel for long-term forecasting with GPs, derived in a principled way by modeling the empirical spectral densities using a skewed Laplace spectral mixture, whose characteristics are more beneficial for long-range forecasting than those of the Gaussian mixture used in SM kernels.
Experiments on real-world datasets demonstrated that by using the SLSM kernel dense structures as well as sparse structures in the data can be captured. In addition, the SLSM kernel was shown to be able to learn periodical patterns, especially short-term patterns in the signal.
Overall, the SLSM kernel appears to be especially beneficial for time series containing patterns with long-range dependencies, as indicated, for example, by experiments with the electricity and rail miles datasets. Other real-world tasks [39] suited for SLSM include forecasting sea level [40] to estimate the climate change trends, and long-term forecasting of remaining useful battery life [41] to ensure reliable system operation and to minimise maintenance costs.
The SLSM kernel also has a connection with RQ kernel, which not only provides a new interpretation of the RQ kernel, but also demonstrates the usefulness of the SLSM kernel as a principled alternative to the exponential and RQ kernels. We might further extend SLSM to increase its long-range modeling capability by using a RQ kernel with α < 1 instead of the Cauchy function term of the SLSM kernel.
The SLSM kernel can also be generalized to scalable inference like Cartesian structured multidimensional datasets like images following [42,4,14,43]. The multidimensional SLSM kernel becomes: where k (p) SLSM,i (τ ) is the i-th SLSM component on the p-th dimension. Equation (18) can be used for fast inference by decomposing K SLSM as the Kronecker product of matrices over each input dimension K SLSM = K 1 SLSM ⊗ ... ⊗ K P SLSM . It is interesting to investigate applications involving such data.
The main limitation of the proposed kernel is that it is not suited for modeling non-stationary phenomena [6]. An interesting open problem is exploring how to overcome this limitation without loss of interpretation.