Kernels for Gaussian Processes

. Gaussian processes (GPs) are an elegant Bayesian approach to model an unknown function. The choice of the kernel characterizes one’s assumption on how the unknown function autocovaries. It is a core aspect of a GP design, since the posterior distribution can signiﬁcantly vary for diﬀerent kernels. The spectral mixture (SM) kernel is derived by modelling a spectral density - the Fourier transform of a kernel - with a linear mixture of Gaussian components. As such, the SM kernel cannot model dependencies between components. In this paper we use cross convolution to model dependencies between components and derive a new kernel called Generalized Convolution Spectral Mixture (GCSM). Experimental analysis of GCSM on synthetic and real-life datasets indicates the beneﬁt of modeling dependencies between components for reducing uncertainty and for improving performance in extrapolation tasks.


Introduction
Gaussian processes (GPs) provide regression models where a posterior distribution over the unknown function is maintained as evidence is accumulated. This allows GPs to learn complex functions when a large amount of evidence is available, and it makes them robust against overfitting in the presence of little evidence. GPs can model a large class of phenomena through the choice of the kernel, which characterizes one's assumption on how the unknown function autocovaries [17,18]. The choice of the kernel is a core aspect of a GP design, since the posterior distribution can significantly vary for different kernels. In particular, in [24] a flexible kernel called Spectral Mixture (SM) was defined, by modelings the kernel's spectrum with a mixture of Gaussians. An SM kernel can be represented by a sum of components, and can be derived from Bochner's theorem as the inverse Fourier Transform (FT) of its corresponding spectral density. SM kernels assume mutually independence of its components [24][25][26].
Here we propose a generalization of SM kernels that explicitly incorporates dependencies between components. We use cross convolution to model dependencies between components, and derive a new kernel called Generalized Convolution Spectral Mixture (GCSM) kernel. The number of hyper-parameters remains equal to that of SM, and there is no increase in computational complexity. A stochastic variational inference technique is used to perform scalable inference. In the proposed framework, GCSM without cross components (that is, by only considering auto-convolution of base components) reduces to the SM kernel.
We assess the performance of GCSM kernels through extensive experiments on real-life datasets. The results show that GCSM is able to capture dependence structure in time series and multi-dimensional data containing correlated patterns. Furthermore, we show the benefits of the proposed kernel for reducing uncertainty, overestimation and underestimation in extrapolation tasks. Our main contributions can be summarized as follows: -a new spectral mixture kernel that captures dependencies between components; -two metrics, posterior correlation (see Eq. 10) and learned dependency (see Eq. 19) to analyze intrinsic dependencies between components in the SM kernel and dependencies captured by our kernel, respectively; -an extensive comparison between the proposed GCSM and other SM kernels in terms of spectral density, covariance, posterior predictive density and sampling, as well as in terms of performance gain.
The remainder of this paper is organized as follows. We start by giving a background on GPs, SM kernels, and we briefly describe related work. Next, we introduce the GCSM kernel, and discuss the differences between the GCSM and SM kernels. Then we describe the experimental setting and show results on synthetic and real-world datasets. We conclude with a summary and discussion on future work.

Background
A GP is any distribution over functions such that any finite set of function values has a joint Gaussian distribution. A GP model, before conditioning on the data, is completely specified by its mean function m(x) = E(f (x)) and its covariance function (also called kernel) k(x, x ) = cov(f (x), f(x )) for input vectors x, x ∈ R P . It is common practice to assume that the mean function is simply zero everywhere, since uncertainty about the mean function can be taken into account by adding an extra term to the kernel (cf. e.g. [18]).
The kernel induces a positive definite covariance matrix K = k(X, X) of the training locations set X. For a regression task [18], by choosing a kernel and inferring its hyper-parameters Θ, we can predict the unknown function valueỹ * and its variance V[ỹ * ] (the uncertainty) for a test point x * as follows: where k * * = k(x * , x * ), k * is the vector of covariances between x * and X, and y are the observed values at training locations in X. The hyper-parameters can be optimized by minimizing the Negative Log Marginal Likelihood (NLML) − log p(y|x, Θ). Smoothness and generalization properties of GPs depend on the kernel function and its hyper-parameters Θ [18]. In particular, the SM kernel [26], here denoted by k SM , is derived by modeling the empirical spectral density as a Gaussian mixture, using Bochner's Theorem [2,22], resulting in the following kernel: where τ = x − x , Q denotes the number of components, k SMi is the i-th component, P denotes the input dimension, and w i , µ i = [μ i,1 , ..., μ i,P ], and are the weight, mean, and variance of the i-th component in the frequency domain, respectively. The variance σ 2 i can be thought of as an inverse length-scale, μ i as a frequency, and w i as a contribution. For SM kernel, we havek is a symmetrized scale-location Gaussian in the frequency domain.
The SM kernel does not consider dependencies between components, because it is a linear combination of {k SMi } Q i=1 (see Eq. 3). Therefore its underlying assumption is that such components are mutually independent. One should not confuse the spectral mixture components that make up the spectral density of the SM kernel with the base components of the Fourier Transform (FT): (1) FT components are periodic trigonometric functions, such as sine and cosine functions, while SM kernel components are quasi-periodic Gaussian functions; (2) FT components are orthogonal (i.e. the product of an arbitrary pair of Fourier series components is zero) while the product of two arbitrary SM components is not necessarily equal to zero; (3) the SM component in the frequency domain is a Gaussian function covering wide frequency range while an FT component is just a sharp peak at a single frequency, which is covered by multiple SM components.

Related Work
Various kernel functions have been proposed [18], such as Squared Exponential (SE), Periodic (PER), and general Matérn (MA). Recently, Spectral Mixture (SM) kernels have been proposed in [24]. Additive GPs have been proposed in [4], a GP model whose kernel implicitly sums over all possible products of one-dimensional base kernels. Extensions of these kernels include the spectral mixture product kernel (SMP) [25] k SMP (τ |Θ) = P p=1 k SM (τ p |Θ p ), which uses multi-dimensional SM kernels, and extends the application scope of SM kernels to image data and spatial time data. Other interesting families of kernels include non-stationary kernels [7,10,19,21], which are capable to learn input-dependent covariances between inputs. All these mentioned kernels do not consider dependencies between components. To the best of our knowledge, our proposed kernel is the first attempt to explicitly model dependencies between components.
The problem of expressing structure present in the data being modeled with kernels has been investigated also in the context of kernel composition. For instance, in [3] a framework was introduced for composing kernel structures. A space of kernel structures is defined compositionally in terms of sums and products of a small number of base kernel structures. Then an automatic search over this space of kernel structures is performed using marginal likelihood as search criterion. Although composing kernels allows one to produce kernels combining several high-level properties, they depend on the choice of base kernel families, composition operators, and search strategy. Instead, here we directly enhance SM kernels by incorporating dependency between components.

Dependencies Between SM Components
Since the SM kernel is additive, any f ∼ GP(0, k SM ) can be expressed as With a slight abuse of notation we denote by f i the function values at training locations X, and by f * i the function values at some set of query locations X * . From the additivity of the SM kernel it follows that the f i 's are a priori independent. Then, by using the formula for Gaussian conditionals we can give the conditional distribution of a GP-distributed function f * i conditioned on its sum with another GP-distributed function f j : where The reader is referred to [3] (Sect. 2.4.5) for the derivation of these results. The Gaussian conditionals express the model's posterior uncertainty about the different components of the signal, integrating over the possible configurations of the other components.
In particular, we have: In when dependencies between components are present. We can also compute the posterior covariance between the height of any two functions, conditioned on their sum [3]: We define posterior correlation ρ * ij as normalized posterior covariance: We can use ρ * ij = 0 as indicator of statistical dependence between components i and j. In our experiments, we will use the normalized posterior covariance to illustrate the presence of dependencies between components in SM kernels for GPs.

Generalized Convolution SM Kernels
We propose to generalize SM kernels by incorporating cross component terms. To this aim we use versions of the seminal Convolution theorem, which states that under suitable conditions the Fourier transform of a convolution of two signals is the pointwise product of their Fourier transforms. In particular, convolution in the time domain equals point-wise multiplication in the frequency domain. The construction of our kernel relies on the fact that any stationary kernel k(x, x ) can be represented as a convolution form on R P (see e.g. [5,6,13]) By applying a Fourier transformation to the above general convolution form of the kernel we obtaink(s) = (ĝ(s)) 2 in the frequency domain. For each weighted component w i k SMi (τ ) in the SM kernel, we can define the functionĝ SMi (s) aŝ which is the basis function of the i-th weighted spectral density. We use crosscorrelation, which is similar in nature to the convolution of two functions.
The cross-correlation of functions f (τ ) and g(τ ) is equivalent to the convolution of f (−τ ) and g(τ ) [1]. we have that the cross-correlation between two where F −1 s→τ , , and (−) denote the inverse FT, the cross-correlation operator, and the complex conjugate operator, respectively. Here ϕ SMi (s) = N (s; µ i , Σ i ) is a symmetrized scale-location Gaussian in the frequency domain (ϕ SMi (s) = ϕ SMi (s)). The product of Gaussians ϕ SMi (s) and ϕ SMj (s) is also a Gaussian. Therefore, the cross-correlation term in the frequency domain has also a Gaussian form and must be greater than zero, which implies the presence of dependencies between f i and f j .
The cross-correlation term k i×j GCSM (τ ) of our new kernel, obtained as crosscorrelation of the i-th and j-th base components in SM, corresponds to the cross spectral density termk i×j GCSM (s) =ĝ SMi (s) ·ĝ SMj (s) (14) in the frequency domain. From (12) and (14) we obtain The parameters for the cross spectral density termk i×j GCSM (s) corresponding to the cross convolution component k i×j GCSM (τ ) are: Parameters µ ij and Σ ij can be interpreted as frequency and inverse lengthscale of the cross component k i×j GCSM (τ ), respectively. Cross amplitude a ij is a normalization constant which does not depend on s.
Observe that whenĝ SMi (s) is equal toĝ SMj (s), w ij a ij , µ ij , and Σ ij reduce to w i , 1, µ i , and Σ i , respectively. In this case, the cross spectral densityk i×j GCSM (s) is equal tok SMi (s). We can observe that the closer the frequencies µ i and µ j are and as closer the scales Σ i and Σ j between components i and j in the SM kernel are, the higher the cross convolution components contribution in GCSM will be.
Using the inverse FT, by the distributivity of the convolution operator and by the symmetry of the spectral density, we can obtain the GCSM kernel with Q (auto-convolution) components as: (16) where c ij = w ij a ij is the cross contribution incorporating cross weight and cross amplitude to quantify the dependency between components in the GCSM kernel. The proof that GCSM is positive semi-definite is given in the Appendix. The auto-convolution cross-terms in GCSM correspond to the components in SM since k i×i GCSM (τ ) = k SMi (τ ). It is a mixture of periodic cosine kernels and their dependencies, weighted by exponential weights.  Figure 1 illustrates the difference between SM and GCSM, where each connection represents a convolution component of the kernel. SM is an auto-convolution spectral mixture kernel that ignores the cross-correlation between base components. The figure also shows that SM is a special case of GCSM since the latter involves both cross convolution and auto-convolution of base components. In GCSM, dependencies are explicitly modeled and quantified. In the experiment illustrated in Fig. 2, SM and GCSM have the same initial parameters the same noise term. The observations are sampled from a GP(0, K SM + K GCSM ). From Fig. 2 we can observe clear differences (in terms of amplitude, peak, and trend from SM) for the kernel functions (SM: top, in dashed red; GCSM: bottom, in dashed blue). For the corresponding spectral densities, the dependence (in magenta) modeled by GCSM is also a Gaussian in the frequency domain, which yields a spectral mixture with different magnitude. The posterior distribution and sampling are obtained from GCSM and SM conditioned on six observations (black crosses). One can observe that the predictive distribution of GCSM has a tighter confidence interval (in blue shadow) than SM (in red shadow).

Fig. 2.
Covariance, spectral density, and posterior functions drawn from GPs with SM and GCSM kernels conditioning on six samples. In the first row two SM components (w1kSM1(τ ) and w2kSM2(τ )) correspond to two solid lines (in cyan and black). In the second row two GCSM components with dependent structures (k 1×2 GCSM (τ )) (in magenta). SM and GCSM plots have the same axes. (Color figure online)

Scalable Inference
Exact inference for GPs is prohibitively slow for more than a few thousand datapoints, as it involves inverting the covariance matrix (K + σ 2 n I) −1 and computing the determinant of the covariance |K + σ 2 n I|. This issues are addressed by covariance matrix approximation [16,20,23] and inference approximation [8,9].
Here we employ stochastic variational inference (SVI) which provides a generalized framework for combining inducing points u and variational inference yielding impressive efficiency and precision. Specifically, SVI approximates the true GP posterior with a GP conditioned on a small set of inducing points u, which as a set of global variables summarise the training data and are used to perform variational inference. The variational distribution P (u) = N (u; µ u , Σ u ) gives a variational lower bound L 3 (u; µ u , Σ u ), also called Evidence Lower Bound (ELBO) of the quantity p(y|X). From [9], the variational distribution N (u; µ u , Σ u ) contains all the information in the posterior approximation, which represents the distribution on function values at the inducing points u. From ∂L3 ∂µ u = 0 and ∂L3 ∂Σu = 0, we can obtain an optimal solution of the variational distribution. The posterior distribution of testing data can be written as where k * u is the GCSM covariance vector between u and test point x * . The complexity of SVI is O(m 3 ) where m is the number of inducing points.

Hyper-parameter Initialization
In our experiments, we use the empirical spectral densities to initialize the hyperparameters, as recommend in [10,24]. Different from these works, we apply a Blackman window function to the training data to improve the quality of empirical spectral densities, e.g. the signal to noise ratio (SNR), and to more easily discover certain characteristics of the signal, e.g. magnitude and frequency. We consider the windowed empirical spectral densities p(Θ|s) as derived from the data, and then apply a Bayesian Gaussian mixture model (GMM) in order to get the Q cluster centers of the Gaussian spectral densities [10].
We use the Expectation Maximization algorithm [15] to estimate the parametersw i ,μ i , andΣ i . The results are used as initial values of w i , µ i , and Σ i , respectively.

Experiments
We comparatively assess the performance of GCSM on real-world datasets. Three of these datasets have been used in the literature of GP methods. The other is a relative new dataset which we use to illustrate the capability of GPs with the considered kernels to model irregular long term increasing trends. We use Mean Squared Error (MSE = 1 n n i=1 y i −ỹ i 2 ) as the performance metric for all tasks. We used the 95% confidence interval (instead of, e.g., error bar) to quantify uncertainty (see Eq. (2)). In addition to these performance metrics, we also consider the posterior correlation ρ * ij (see Eq. (10)) to illustrate the underlying dependency between SM components. Moreover, to illustrate the dependency between components captured by the cross-components in our GCSM kernel, we use the normalized cross-correlation term: We call γ ij learned dependency between component i and j. Note that γ ij = 1 when i = j. In our experiments we will analyze dependency between components in SM kernel for GPs as expressed by the posterior covariance, and dependency modeled by GCSM kernels for GPs as expressed by γ ij 's. We compare GCSM with ordinary SM for prediction tasks on four real-life datasets: monthly average atmospheric CO 2 concentrations [12,18], monthly ozone concentrations, air revenue passenger miles, and the larger multidimensional alabone dataset.
As baselines for comparison we consider the popular kernels implemented in the GPML toolbox [18]: linear with bias (LIN), SE, polynomial (Poly), PER, rational quadratic (RQ), MA, Gabor, fractional Brownian motion covariance (FBM), underdamped linear Langevin process covariance (ULL), neural network (NN) and SM kernels. For the considered multidimensional dataset, we use automatic relevance determination (ARD) for other kernels to remove irrelevant input. FBM and ULL kernels are only available for time series type of data, thus they are not applied to this dataset. We use the GPML toolbox [17] and GPflow [14] for ordinary and scalable inference, respectively. For GCSM, we calculate the gradient of the parameters using an analytical derivative technique. In all experiments we use the hyper-parameter initialization previously described for SM and GCSM kenels. The monthly average atmospheric CO 2 concentration dataset (cf. e.g. [18]) is a popular experiment which shows the advantage and flexibility of GPs due to multiple patterns with different scales in the data, such as long-term, seasonal and short-term trends. The dataset was collected at the Mauna Loa Observatory, Hawaii, between 1958 and 2003. We use 40% of the location points as training data and the rest 60% as testing data. For both GCSM and SM we consider Q = 10 components. The Gaussian mixture of the empirical spectral densities is considered to initialize the hyper-parameters. Figure 3(a) shows that GCSM (in dashed blue) is better than ordinary SM (in red) in terms of predictive mean and variance. Moreover, GCSM yields a smaller confidence interval than SM. Unlike SM, GCSM does not overestimate the long-term trend. As for the analysis of the posterior correlation and learned dependency, evidence of posterior positive and negative correlations ρ * ij can be observed for SM components (3, 4, 7) (left subplot in Fig. 4). These posterior correlations have been used for prediction (see Supplementary material). The right plot in Fig. 4 shows clear evidence of learned dependency γ ij for GCSM components (2,3,4). GCSM and SM are optimized independently, so component identifiers in the figures do not necessarily correspond to each other. Observe that plots for GCSM kernel with i = j (right subplot) show stripes because of the normalization term in Eq. (19). We consider the monthly ozone concentration dataset (216 values) collected at Downtown L. A. from time range Jan 1955-Dec 1972. This dataset has different characteristics than the CO 2 concentration one, namely a gradual long term downtrend and irregular peak values in the training data which are much higher than those in the testing data. These characteristics make extrapolation a challenging task. Here we use the first 60% of observations for training, and the rest (40%) for testing (shown in black and green in Fig. 5, respectively). Again we consider Q = 10 components for both kernels.  Figure 5 shows that the ozone concentration signal has a long term decreasing tendency while the training part has a relatively stable evolution. Here SM fails to discover such long term decreasing tendency and overestimates the future trend with low confidence. Instead, GCSM is able to confidently capture the long term decreasing tendency. These results substantiate the beneficial effect of using cross-components for correcting overestimation and for reducing predictive uncertainty.

Modeling Irregular Long Term Decreasing Trends
Results in Table 1 show that on this dataset GCSM consistently achieves a lower MSE compared with SM and other baselines. Figure 6 shows posterior correlation (left plot) and learned dependency (right plot), The texture of the posterior correlation ρ * ij among SM components (2, 6, 7) demonstrates a more complicated posterior correlation between these components than that of the previous experiment. The learned dependency γ ij is clearly visible between components (2, 3, 7).

Modeling Irregular Long Term Increasing Trends
In this experiment we consider another challenging extrapolation task, using the air revenue passenger miles 1 with time range Jan 2000-Apr 2018, monthly collected by the U.S. Bureau of Transportation Statistics. Given 60% recordings at the beginning of the time series, we wish to extrapolate the remaining observations (40%). In this setting we can observe an apparent long term oscillation tendency in the training observations which is not present in the testing data. As shown in Fig. 7, even if at the beginning (in 2001) there seems to be a decreasing trend due to 9/11 attack and since 2010 was known as a disappointing year for safety, there is a positive trend as a result of a boosting of the airline market and extensive globalization. In order to show the need for GCSM in a real-life scenarios, we consider the air revenue passenger miles dataset that contains a fake long term oscillation tendency happened in the training data but not in the testing data. The air revenue passenger miles 2 with time range Jan 2000-Apr 2018 was monthly collected by U.S. Bureau of Transportation Statistics.
Results in Table 1 show that on this dataset GCSM consistently achieves a lower MSE compared with SM and other baselines. In particular, kernels such as SE, Periodic and Matérn 5/2 have a poor performance on this extrapolation task.

Prediction with Large Scale Multidimensional Data
After comparing GCSM and SM on extrapolation tasks on time series with diverse characteristics, we investigate comparatively its performance on a prediction task using a large multidimensional dataset, the abalone dataset. The dataset consists of 4177 instances with 8 attributes: Sex, Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, and Shell weight. The goal is to predict the age of an abalone from physical measurements. Abalone's age is measured by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. Thus the task is to predict the number of rings from the above mentioned attributes. We use the first 3377 instances as training data and the remaining 800 as testing data. For both GCSM and SM we used Q = 5 components. We use the windowed empirical density to initialize the hyper-parameters, as described in Sect. 7.1. Here components are multivariate Gaussian distributions in the frequency domain.
Results in Table 1 show that also on this type of task GCSM achieves lower MSE than SM. SM and GCSM kernels achieve comparable performance in terms of NLML (see right part of Table 1). This seems surprising, given the smaller uncertainty and MSE results obtained by GCSM. However, note that NLML is the sum of two terms (and a constant term that is ignored): a model fit and a complexity penalty term. The first term is the data fit term which is maximized when the data fits the model very well. The second term is a penalty on the complexity of the model, i.e. the smoother the better. When Optimizing NLML finds a balance between the two and this changes with the data observed.
Overall, results indicate the beneficial effect of modeling directly dependencies between components, as done in our kernel.

Conclusion
We proposed the generalized convolution spectral mixture (GCSM) kernel, a generalization of SM kernels with an expressive closed form to modeling dependencies between components using cross convolution in the frequency domain.
Experiments on real-life datasets indicate that the proposed kernel, when used in GPs, can identify and model the complex structure of the data and be used to perform long-term trends forecasting. Although here we do not focus on non-stationary kernels, GCSM can be transformed into a non-stationary GCSM, through parameterizing weights w i (x), means μ i (x), and σ i (x) as kernel matrices by means of a Gaussian function. Future work includes the investigation of more generalized non-stationary GCSM.
An issue that remains to be investigated is efficient inference. This is a core issue in GP methods which needs to be addressed also for GPs with GCSM kernels. Levy process priors as proposed in [11] present a promising approach for tackling this problem, by regularizing spectral mixture for automatic selection of the number of components and pruning of unnecessary components.