Abstract
Gaussian processes (GPs) are distributions over functions, which provide a Bayesian nonparametric approach to regression and classification. In spite of their success, GPs have limited use in some applications, for example, in some cases a symmetric distribution with respect to its mean is an unreasonable model. This implies, for instance, that the mean and the median coincide, while the mean and median in an asymmetric (skewed) distribution can be different numbers. In this paper, we propose skewGaussian processes (SkewGPs) as a nonparametric prior over functions. A SkewGP extends the multivariate unified skewnormal distribution over finite dimensional vectors to a stochastic processes. The SkewGP class of distributions includes GPs and, therefore, SkewGPs inherit all good properties of GPs and increase their flexibility by allowing asymmetry in the probabilistic model. By exploiting the fact that SkewGP and probit likelihood are conjugate model, we derive closed form expressions for the marginal likelihood and predictive distribution of this new nonparametric classifier. We verify empirically that the proposed SkewGP classifier provides a better performance than a GP classifier based on either Laplace’s method or expectation propagation.
1 Introduction
Gaussian processes (GPs) extend multivariate Gaussian distributions over finite dimensional vectors to infinite dimensionality. Specifically, a GP defines a distribution over functions, that is each draw from a Gaussian process is a function. Therefore, GPs provide a principled, practical, and probabilistic approach to nonparametric regression and classification and they have successfully been applied to different domains (Rasmussen and Williams 2006).
GPs have several desirable mathematical properties. The most appealing one is that, for regression with Gaussian noise, the prior distribution is conjugate for the likelihood function. Therefore the Bayesian update step is analytic, as is computing the predictive distribution for the function behavior at unknown locations. In spite of their success, GPs have several known shortcomings.
First, the Gaussian distribution is not a “heavytailed” distribution, and so it is not robust to extreme outliers. Alternative to GPs have been proposed of which the most notable example is represented by the class of elliptical processes (Fang 2018), such as Studentt processes (O’Hagan 1991; Zhang et al. 2007), where any collection of function values has a desired elliptical distribution, with a covariance matrix built using a kernel.
Second, the Gaussian distribution is symmetric with respect to its mean. This implies, for instance, that its mean and median coincide, while the mean and median in an asymmetric (skewed) distribution can be different numbers. This constraint limits GPs’ flexibility and affects the coverage of their credible intervals (regions)—especially when considering that symmetry must hold for all components of the (latent) function and that, as for instance discussed by Kuss and Rasmussen (2005); Nickisch and Rasmussen (2008), the exact posterior of a GP classifier is skewed.
To overcome this second limitation, in this paper, we propose skewGaussian processes (SkewGPs) as a nonparametric prior over functions. A SkewGP extends the multivariate unified skewnormal distribution defined over finite dimensional vectors to a stochastic process, i.e. a distribution over infinite dimensional objects. A SkewGP is completely defined by a location function, a scale function and three additional parameters that depend on a latent dimension: a skewness function, a truncation vector and a covariance matrix. It is worth noting that a SkewGP reduces to a GP when the latent variables have dimension zero. Therefore, SkewGPs inherit all good properties of GPs and increase their flexibility by allowing asymmetry in the probabilistic model.
We focus on applying this new nonparametric model to a classification problem. In the case of parametric models, Durante (2019) shows that the posterior distribution of a probit likelihood and Gaussian prior has a unified skewnormal distribution. Such a novel result allowed the author to efficiently compute full posterior inferences for Bayesian logistic regression (for small datasets \(n\approx 100\)). Moreover the author showed that the unified skewnormal distribution is a conjugate prior for the probit likelihood (without using this prior model for data analysis).
Here we extend this result to the nonparametric case, we derive a semianalytical expression for the posterior distribution of the latent function and predictive probabilities for SkewGPs. The term semianalytical is adopted to indicate that posterior inferences require the computation of the cumulative distribution function of a multivariate Gaussian distribution (i.e., the computation of Gaussian orthant probabilities). By using a new formulation (Gessner et al. 2019) of elliptical slice sampling (Murray et al. 2010), liness, which permits efficient sampling from a linearly constrained (e.g., orthant) Gaussian domain, we show that we can compute efficiently posterior inferences for SkewGP binary classifiers. Liness is a special case of elliptical slice sampling that leverages the analytic tractability of intersections of ellipses and hyperplanes to speed up the elliptical slice algorithm. In particular, this guarantees rejectionfree sampling and it is therefore also highly parallelizable.
The main contributions of this paper are

1.
we propose a new class of stochastic processes called skewGaussian processes (SkewGP) which generalize GP models;

2.
we show that a SkewGP prior process is conjugate for the probit likelihood thus deriving for the first time the posterior distribution of a GP classifier in an analytic form;

3.
we derive an efficient way to learn the hyperparameters of SkewGP and compute Monte Carlo predictions using liness, showing that our model has similar bottleneck computational complexity of GPs;

4.
we evaluate the proposed SkewGP classifier against stateoftheart implementations of the GP classifier which approximate the posterior with the Laplace method or with Expectation propagation;

5.
we show on a small image classification dataset that a SkewGP prior can lead to better uncertainty quantification than a GP prior.
2 Background
The skew normal distribution is a continuous probability distribution that generalises the normal distribution to allow for nonzero skewness. The probability density function (PDF) of the univariate skewnormal distribution with location \(\xi \in \mathbb {R}\), scale \(\sigma >0\) and skew parameter \(\alpha \in \mathbb {R}\) is given by O’Hagan and Leonard (1976):
where \(\phi\) and \(\varPhi\) are the PDF and, respectively, cumulative distribution function (CDF) of the standard univariate normal distribution. This distribution has been generalised in several ways, see Azzalini (2013) for a review. In particular, Arellano and Azzalini (2006) provided a unification of the above generalizations within a single and tractable multivariate unified skewnormal distribution that satisfies closure properties for marginals and conditionals and allows more flexibility due the introduction of additional parameters.
2.1 Unified skewnormal distribution
A vector \(\varvec{z}\in \mathbb {R}^p\) is said to have a multivariate unified skewnormal distribution with latent skewness dimension s, \(\varvec{z}\sim \text {SUN}_{p,s}(\varvec{\xi },\varOmega ,\varDelta ,\varvec{\gamma },\varGamma )\), if its probability density function (Azzalini 2013, Ch.7) is:
where \(\phi _p(\varvec{z}\varvec{\xi };\varOmega )\) represents the PDF of a multivariate normal distribution with mean \(\varvec{\xi }\in \mathbb {R}^p\) and covariance \(\varOmega =D_\varOmega \bar{\varOmega } D_\varOmega \in \mathbb {R}^{p\times p}\), with \(\bar{\varOmega }\) being a correlation matrix and \(D_\varOmega\) a diagonal matrix containing the square root of the diagonal elements in \(\varOmega\). The notation \(\varPhi _s(\varvec{a};M)\) denotes the CDF of \(N_s(0,M)\) evaluated at \(\varvec{a}\in \mathbb {R}^s\). The parameters \(\varvec{\gamma }\in \mathbb {R}^s, \varGamma \in \mathbb {R}^{s\times s},\varDelta ^{p \times s}\) of the SUN distribution are related to a latent variable that controls the skewness, in particular \(\varDelta\) is called Skewness matrix. The PDF (1) is welldefined provided that the matrix
i.e., M is positive definite. Note that when \(\varDelta =0\), (1) reduces to \(\phi _p(\varvec{z}\varvec{\xi };\varOmega )\). Moreover we assume that \(\varPhi _0(\cdot )=1\), so that, for \(s=0\), (1) becomes a multivariate normal distribution.
The rôle of the latent dimension s can be briefly explained as follows. Consider now a random vector \(\begin{bmatrix} \varvec{x}_0 \\ \varvec{x}_1 \end{bmatrix} \sim N_{s+p}(0,M)\) with M as in (2) and define \(\mathbf {y}\) as the vector with distribution \((\varvec{x}_1 \mid \varvec{x}_0+\varvec{\gamma }>0)\), then it can be shown (Azzalini 2013, Ch. 7) that \(\varvec{z}= \varvec{\xi }+ D_\varOmega \mathbf {y}\sim \text {SUN}_{p,s}(\varvec{\xi },\varOmega ,\varDelta ,\varvec{\gamma },\varGamma )\). This representation will be used in Sect. 5 to draw samples from the distribution. Figure 1 shows the density of a univariate SUN distribution with latent dimensions \(s=1\) (a1) and \(s=2\) (a2). The effect of a higher latent dimension can be better observed in bivariate SUN densities as shown in Fig. 2. The contours of the corresponding bivariate normal are dashed. We also plot the skewness directions given by \(\bar{\varOmega }^{1}\varDelta\).
The skewnormal family has several interesting properties, see Azzalini (2013, Ch.7) for details. Most notably, it isr close under marginalization and affine transformations. Specifically, if we partition \(z = [z_1 , z_2]^T\), where \(z_1 \in \mathbb {R}^{p_1}\) and \(z_2 \in \mathbb {R}^{p_2}\) with \(p_1+p_2=p\), then
Moreover, (Azzalini 2013, Ch.7) the conditional distribution is a unified skewnormal, i.e., \((Z_2Z_1=z_1) \sim SUN_{p_2,s}(\varvec{\xi }_{21},\varOmega _{21},\varDelta _{21},\varvec{\gamma }_{21},\varGamma _{21})\), where
and \(\bar{\varOmega }_{11}^{1}:=(\bar{\varOmega }_{11})^{1}\).
3 SkewGaussian process
In this section, we define a SkewGaussian Process (SkewGP). Consider the functions \(\xi :\mathbb {R}^p \rightarrow \mathbb {R}\), a location function, \(\varOmega : \mathbb {R}^p \times \mathbb {R}^p \rightarrow \mathbb {R}\), a scale function, \(\varDelta :\mathbb {R}^p \rightarrow \mathbb {R}^s\), the Skewness vector function, and \(\varvec{\gamma }\in \mathbb {R}^s,\varGamma \in \mathbb {R}^{s \times s}\).
We say that a real function \(f: \mathbb {R}^p \rightarrow \mathbb {R}\) is distributed as a SkewGaussian process with latent dimension s, if, for any sequence of n points \(\varvec{x}_1,\dots ,\varvec{x}_n \in \mathbb {R}^p\), the vector \(f(X)=[f(\varvec{x}_1),\dots ,f(\varvec{x}_n)]^T\) is skewGaussian distributed with parameters \(\varvec{\gamma },\varGamma\), location, scale and skewness matrices, respectively, given by
The skewGaussian distribution above is well defined provided that the matrix
is positive definite for all \(X=\{\varvec{x}_1,\dots ,\varvec{x}_n\} \subset \mathbb {R}^p\) and for all 0. In this case we can write
We detail how to select the parameters in Sect. 4, the proposition below shows that SkewGP is a proper stochastic process.
Proposition 1
The construction of a SkewGaussian process from a finitedimensional distribution is wellposed.
All the proofs are in “Appendices 1 and 2".
3.1 Binary classification
Consider the training data \(\mathcal {D}=\{(\varvec{x}_i,y_i)\}_{i=1}^n\), where \(\varvec{x}_i \in \mathbb {R}^p\) and \(y_i \in \{0,1\}\). We aim to build a nonparametric binary classifier. We first define a probabilistic model \(\mathcal {M}\) by assuming that \(f \sim \text {SkewGP}(\xi , \varOmega , \varDelta , \varvec{\gamma }, \varGamma )\) and considering a probit model for the likelihood:
where \(W=\text {diag}(2y_11,\dots ,2y_n1)\). A SkewGP prior combined with a probit likelihood gives rise to a posterior SkewGP over functions, this because skewGaussian distributions are conjugate priors for probit models. In the finite dimensional parametric case, this property was shown by Durante (2019), hereafter we extend it to the nonparametric one.
Theorem 1
The posterior of f(X) is a skewGaussian distribution, i.e.
where, for simplicity of notation, we have denoted \(\xi (X),\varOmega (X,X),\varDelta (X)\) as \(\xi ,\varOmega ,\varDelta\) and \(\varOmega = D_\varOmega \bar{\varOmega } D_\varOmega\).
From Theorem 1 we can immediately derive the following result.
Corollary 1
The marginal likelihood of the observations \(\mathcal {D}=\{(\varvec{x}_i,y_i)\}_{i=1}^n\) given the probabilistic model \(\mathcal {M}\), that is the prior (5) and likelihood (6), is
with \(\tilde{\varvec{\gamma }},\tilde{\varGamma }\) defined in Theorem 1.
In classification, based on the training data \(\mathcal {D}=\{(\varvec{x}_i,y_i)\}_{i=1}^n\), and given test inputs \(\varvec{x}^*\), we aim to predict the probability that \(y^*=1\).
Corollary 2
The posterior predictive probability of \(y^*=1\) given the test input \(\varvec{x}^* \in \mathbb {R}^p\) and the training data \(\mathcal {D}=\{(\varvec{x}_i,y_i)\}_{i=1}^n\) is
where \(\tilde{\varvec{\gamma }}^*,\tilde{\varGamma }^*\) are the corresponding matrices of the posterior computed for the augmented dataset \(\hat{X}=[X^T,{\varvec{x}^*}^T]^T\), \(\hat{y}=[y^T,1/2]^T\).
Note that the dummy value \(\frac{1}{2}\) in \(\hat{y}\) does not influence the value of \(p(y^*=1  \varvec{x}^*, X,y)\) and it was chosen only for mathematical convenience, as it allows for marginalization over \(f(\varvec{x}^*)\) and to derive the expression of \(\tilde{\varvec{\gamma }}^*,\tilde{\varGamma }^*\) similarly to the ones in Theorem 1.^{Footnote 1}
4 Prior functions, parameters and hyperparameters
A SkewGP prior is completely defined by the location function \(\xi (\varvec{x})\), the scale function \(\varOmega (\varvec{x},\varvec{x}')\), the latent dimension \(s\in \mathbb {N}\), the skewness vector function \(\varDelta (\varvec{x},\varvec{x}') \in \mathbb {R}^s\) and \(\varvec{\gamma }\in \mathbb {R}^s,\varGamma \in \mathbb {R}^{s \times s}\). As it is common for GPs, we will take the location function \(\xi (\varvec{x})\) to be zero, although this need not be done. Let \(K(\varvec{x},\varvec{x}')\) be a positive definite covariance function (kernel) and let \(\varOmega = K(X,X)\) be the covariance matrix obtained by applying \(K(\varvec{x},\varvec{x}')\) elementwise to the training data X. In this paper, we propose the following way to define the location, scale and skewness functions of a SkewGP:
with \(L \in \mathbb {R}^{s \times s}\) is a diagonal matrix whose elements \(L_{ii} \in \{1,1\}\) (a phase), \(R=[\varvec{r}_1,\dots ,\varvec{r}_s]^T \in \mathbb {R}^{s \times p}\) is a vector of s pseudopoints and \(\bar{K}(\varvec{x},\varvec{x}')=\frac{1}{\sigma ^2}K(\varvec{x},\varvec{x}')\) (for stationary kernels) with \(K(\varvec{x},\varvec{x}')\) being the kernel function and \(\sigma ^2\) the variance parameter of the kernel, e.g., for the RBF kernel
It can easily be proven that \(M>0\) and, therefore, (2) holds. We select the parameters of the kernel \(\sigma ,\ell\), the locations \(\varvec{r}_i\) of the pseudopoints and the phase diagonal matrix L by maximizing the marginal likelihood. In particular we exploit the lower bound (16) explained in Sect. 5.
Similarly to the inducing points in the sparse approximation of GPs (QuiñoneroCandela and Rasmussen 2005; Snelson and Ghahramani 2006; Titsias 2009; Bauer et al. 2016), the points \(\varvec{r}_i\) can be viewed as a set of s latent variables. However, their rôle is completely different from that of the inducing points, they allow us to locally modulate the skewness of SkewGP. Figure 3 shows latent functions (in gray, second column) drawn from a SkewGP with latent dimension 2 and the result of squashing these sample functions through the probit logistic function (first column). In all cases, we have considered \(\xi (\varvec{x})=0\) and a RBF kernel with \(\ell =0.3\) and \(\sigma ^2=1\). The values of the other parameters of the SkewGP are reported at the top of the plots in the first column. The green line is the mean function and the red dots represent the location of the \(s=2\) latent pseudopoints. For large positive values of \(\gamma _1, \gamma _2\), SkewGP is equivalent to a GP (plots (a1)–(a2)). At the decreasing of \(\gamma _i\), \(i=1,2\) (plots (b1)–(b2)), the mean shifts up and the mass of the distribution is concentrated on the top of the figure. By changing the phase (the sign of) \(L_{22}\) (plots (c1)–(c2)), the mean and the mass of the distribution shift down at the location of the second pseudoobservation \(r_2\). We can magnifying this effect by decreasing both \(\gamma _i\) (plots (d1)–(d2)). It is also possible to introduce skewness without changing the mean (plots (e1)–(e2)). In this latter case, \(r_1=r_2\) and the mass of the distribution is shifted up.
5 Computational complexity
Corollaries 1 and 2 provide two straightforward ways to compute the marginal likelihood and the predictive posterior probability however Eqs. (12) and (13) both require the accurate computation of \(\varPhi _{s+n}\). Quasirandomized Monte Carlo methods (Genz 1992; Genz and Bretz 2009; Botev 2017) allows calculation of \(\varPhi _{s+n}\) for small n (few hundreds observations). Therefore, these procedures are not in general suitable for medium and large n [apart from special cases Phinikettos and Gandy (2011), Genton et al. (2018), Azzimonti and Ginsbourger (2018)]. We overcome this issue with an effective use of sampling for the predictive posterior and a minibatch approach to marginal likelihood.
5.1 Posterior predictive distribution
In order to compute the predictive distribution we generate samples from the posterior distribution at training points and then exploit the closure properties of the SUN distribution to obtain samples at test points. The following result from Azzalini (2013) allows us to draw independent samples from the posterior in (7):
where \(\mathcal {T}_{\tilde{\varvec{\gamma }}}(0;\tilde{\varGamma })\) is the pdf of a multivariate Gaussian distribution truncated componentwise below \(\tilde{\varvec{\gamma }}\). Equation (15) is a consequence of the additive representation of skewnormal distributions, see Azzalini (2013, Ch. 7.1.2 and 5.1.3) for more details. Note that sampling \(U_0\) can be achieved with standard methods, however using standard rejection sampling for the variable \(U_1\) would incur in exponentially growing sampling time as the dimension increases. A commonly used sampling technique for this type of problems is Elliptical Slice Sampling (ess) (Murray et al. 2010) which is a Markov Chain Monte Carlo algorithm that performs inference in models with Gaussian priors. This method looks for acceptable samples along elliptical slices and by doing so drastically reduces the number of rejected samples. Recently, Gessner et al. (2019) proposed an extension of ess, called linear elliptical slice sampling (liness), for multivariate Gaussian distribution truncated on a region defined by linear constraints. In particular, this approach analytically derives the acceptable regions on the elliptical slices used in ess and thus guarantees rejectionfree sampling. This leads to a large speed up over ess, especially in high dimensions.
Given posterior samples at the training points it is possible to compute the predictive posterior at a new test points \(\varvec{x}^*\) thanks to the following result.
Theorem 2
The posterior samples of the latent function computed at the test point \(\varvec{x}^*\) can be obtained by sampling \(f(\varvec{x}^*)\) from:
with f(X) sampled from the posterior \(\text {SUN}_{n,s+n}(\tilde{\xi },\tilde{\varOmega },\tilde{\varDelta },\tilde{\varvec{\gamma }},\tilde{\varGamma })\) in Theorem 1, K is a kernel that defines the matrices \(\varGamma ,\varDelta , \varOmega\) as in eq. (14) and where
and \(D_\varOmega =\text {diag}[D_\varOmega (X),D_\varOmega (\varvec{x}^*)])\)is a diagonal matrix containing the square root of the diagonal elements of the inner matrix in the r.h.s..
Observe that the computation of the predictive posterior requires the inversion of a \(n \times n\) matrix (\(\bar{\varOmega }_{11}\)), which has complexity \(O(n^3)\) with storage demands of \(O(n^2)\). SkewGPs then have a bottleneck in computational complexity similar to that of GPs. Moreover, note that sampling from \(SUN_{1,s}\) is extremely efficient when the latent dimension s is small (in the experiments we use \(s=2\)).
5.2 Marginal likelihood
As discussed in the previous section, in practical application of SkewGP, the (hyper)parameters of the scale function \(\varOmega (\varvec{x},\varvec{x}')\), of the skewness vector function \(\varDelta (\varvec{x},\varvec{x}') \in \mathbb {R}^s\) and the parameters \(\varvec{\gamma }\in \mathbb {R}^s,\varGamma \in \mathbb {R}^{s \times s}\) have to be selected. As for GPs, we use Bayesian model selection to consistently set such parameters. This requires the maximization of the marginal likelihood with respect to these parameters and, therefore, it is essential to provide a fast and accurate way to evaluate the marginal likelihood. In this paper, we propose a simple approximation of the marginal likelihood that allows us to efficiently compute a lower bound.
Proposition 2
Consider the marginal likelihood \(p(\mathcal {D}\mathcal {M})\) in Corollary 1, then it holds
where \(B_1,\dots ,B_b\) denotes a partition of the training dataset into b random disjoint subsets, \(B_i\) denotes the number of observations in the ith element of the partition, \(\tilde{\varvec{\gamma }}_{B_i},\,\tilde{\varGamma }_{B_i}\) are the parameters of the posterior computed using only the subset \(B_i\) of the data.
If the batchsize is low (in the experiments we have used \(B_i=30\)), then we can efficiently compute each term \(\varPhi _{s+B_i}(\tilde{\varvec{\gamma }}_{B_i};\,\tilde{\varGamma }_{B_i})\) by using a quasirandomised Monte Carlo method. We can then optimize the hyperparameter of SkewGP by maximizing the lower bound in (16).
5.3 Computational load and parallelization
To evaluate the computational load, we have generated artificial classification data using a probit likelihood model and drawing the latent function \(f(X)=[f(\varvec{x}_1),\dots ,f(\varvec{x}_n)]\), with \(\varvec{x}_i \sim N(0,1)\), from a GP with zero mean and radial basis function kernel (lengthscale 0.5 and variance 2). We have then computed the full posterior latent function from Theorem 1, that is a SkewGP posterior. Figure 4 compares the CPU time for sampling 1000 instances of f(X) from the SkewGP posterior as a function of n for liness versus a standard Elliptical Slice Sampler (ess) (5000 burn in).^{Footnote 2} It can be observed the computational advantage of liness with respect to ess.
The average CPU time required to compute \(\varPhi _{s+B_i}(\tilde{\varvec{\gamma }}_{B_i};\,\tilde{\varGamma }_{B_i})\) with \(B_i=30\), using the randomized lattice routine with 5000 points (Genz 1992), is 0.5 seconds on a standard laptop. Since the above method is randomized, we use the simultaneous perturbation stochastic approximation algorithm (Spall 1998) for optimizing the maximum lower bound (16).^{Footnote 3}
Finally notice that, both the computation of the lower bound of the marginal likelihood and the sampling from the posterior via liness are highly parallelizable. In fact, each term \(\varPhi _{s+B_i}(\tilde{\varvec{\gamma }}_{B_i};\,\tilde{\varGamma }_{B_i})\) can be computed independently as well as each sample in liness can be sample independently (because liness is rejectionfree and, therefore, no “burn in” is necessary).
6 Properties of the posterior
In the above sections, we have shown how to compute the posterior distribution of a SkewGP process when the likelihood is a probit model. The full conjugacy of the model allows us to prove that the posterior is again a SkewGP process. This section provides more details on the properties of the posterior and compares it with two approximations. For GP classification, there are two main alternative approximation schemes for finding a Gaussian approximation to the posterior: the Laplace’s method and the ExpectationPropagation (EP) method, see, e.g. Rasmussen and Williams (2006) chapter 3.
Figure 5 provides a onedimensional illustration using a synthetic classification problem with 50 observations and scalar inputs taken from (Kuss and Rasmussen 2005). Figure 5a shows the dataset and the predictive posterior probability for the Laplace and EP approximations. Moreover, by using a SkewGP prior with latent dimension \(s=0\) (that coincides with a GP prior), we have computed the exact SkewGP predictive posterior probability. Therefore, all three methods have the same prior: a GP with zero mean and RBF covariance function (the lengthscale and variance of the kernel are the same for all the three methods and have been set equal to the values that maximise the Laplace’s approximation to the marginal likelihood). Figure 5c shows the posterior mean latent function and corresponding 95% credible intervals. It is evident that the true posterior (SkewGP) of the latent function is skew (see for instance for \(x\in [0,2]\) and the slice plot in Fig. 5b)). Laplace’s approximation peaks at the posterior mode, but places too much mass over positive values of f. The EP approximation aims to match the first two moments of the posterior and, therefore, usually obtains a better coverage of the posterior mass. That is why EP is usually the method of choice for approximate inference in GP classification.
Figure 6 shows the true posterior and the two approximations for the same dataset, but now the lengthscale and variance of the kernel are the optimal values for the three methods. It is evident that the skewness of the posterior provides a better model fit to the data.
Figure 7 shows the posteriors corresponding to a prior SkewGP process with latentdimension \(s=2\). The red dot denotes the optimal location of the pseudopoint \(r_1\), while \(r_2=13.5\) (their initial location were 5.8 and, respectively, 6). The additional degrees of freedom of the SkewGP prior process gave a much more satisfactory answer than that obtained from a GP prior model. By comparing Figs. 6 and 7, it can be noticed that the skewpoint allows us to locally modulate the skewness. Moreover, the additional degrees of freedom do not lead to overfitting, even with small data, as highlighted by the optimized location of \(r_2\) (far away) that has not effect on the skewness of the posterior SkewGP.
7 Results
We have evaluated the proposed SkewGP classifier on a number of benchmark classification datasets and compared its classification accuracy with the accuracy of a Gaussian process classifier that uses either Laplace’s method (GPL) or Expectation Propagation (GPEP) for approximate Bayesian inference. For GPL and GPEP, we have used the implementation available in GPy (GPy, since 2012).
7.1 Penn machine learning benchmarks
From the Penn Machine Learning Benchmarks (Olson et al. 2017), we have selected 124 datasets (number of features up to 500). Since this pool includes nonbinary class datasets, we have defined a binary subproblem by considering the first (class 0) and second (class 1) class. The resulting binarised subset of datasets includes datasets with number of instances between 100 and 7400. We have scaled the inputs to zero mean and unit standard deviation and used, as performance measure, the average information in bits of the predictions about the test targets (Kuss and Rasmussen 2005):
where \(p_i^*\) is the predicted probability for class 1. This score equals 1 bit if the true label is predicted with absolute certainty, 0 bits for random guessing and takes negative values if the prediction gives higher probability to the wrong class. We have assessed the above performance measure for the three classifiers by using 5fold crossvalidation.
While we could use any kernel for GPL, GPEP and SkewGP, in this experiment we have chosen the RBF kernel with a lengthscale for each dimension. Figure 8 contrasts GPL and GPEP with SkewGP0 (SkewGP with \(s=0\)) and SkewGP2 (SkewGP with \(s=2\)). We selected \(s=2\) because we decided to use the same dimension for all datasets and, since there are several datasets where the ratio between the number of features and the number of instances is high, a latent dimension \(s>2\) leads to a number of parameters that exceeds the number of instances affecting the convergence of the maximization of the marginal likelihood. The proposed SkewGP2 and SkewGP0 outperform the other two models for most data sets. The average information score of SkewGP2 is 0.573 (average accuracy 0.904), SkewGP0 is 0.557 (acc. 0.882), GPEP is 0.542 (acc. 0.859) and GPL is 0.512 (acc. 0.863).
This claim is supported by a statistical analysis. We have compared the three classifiers using the (nonparametric) Bayesian signedrank test (Benavoli et al. 2014, 2017). This test declares two classifiers practically equivalent when the difference of average information is less than 0.01 (1%). The interval \([0.01,0.01]\) thus defines a region of practical equivalence (rope) for classifiers. The test returns the posterior probability of the vector \([p(Cl_1 > Cl_2), p(Cl_1 \approx Cl_2), p(Cl_1 < Cl_2)]\) and, therefore, this posterior can be visualised in the probability simplex (Fig. 9). For the comparison GPL versus GPEP, as expected it can seen that GPEP is better than GPL.^{Footnote 4} Conversely, for GPEP versus SkewGP0, 100% of the posterior mass is in the region in favor of SkewGP0, which is the region at the right bottom of the triangle. This confirms that SkewGP0 is practically significantly better than GPL and GPEP. The comparison SkewGP2 versus SkewGP0 shows that SkewGP2 has surely an average information score that is not worse than that of SkewGP0 and better with probability about 0.76.
The difference between GPL and GPEP, and SkewGP is that the posterior of SkewGP can be skewed. Therefore, we expect SkewGP to outperform GPL and GPEP on the datasets for which the posterior is far from Normal (e.g., highly skewed). To verify that we have computed the sample skewness statistics (SS) for each test input \(\mathbf {x}_i^*\):
with \(\mu ={\text {E}} \left[ (f(\mathbf {x}_i^*)\right]\) and the expectation \({\text {E}}[\cdot ]\) can be approximated using the posterior samples drawn as in Theorem 2. Note that \(SS(\mathbf {x}_i^*)=0\) for symmetric distributions. Figure 10(left) shows, for each of the 124 datasets, the difference between the average information score of SkewGP0 and GPEP in the yaxis, and \(\max _{\mathbf {x}_i^*} SS(\mathbf {x}_i^*)\) for SkewGP0 in the xaxis. We used a regression tree (green line) to detect structural changes in the mean of these data points. It is evident that, for large values of the maximum skewness statistics, SkewGP0 outperforms GPEP (the average difference is positive). Figure 10(right) reports a similar plot for SkewGP2 and the difference is even more evident. This confirms that SkewGP on average outperforms GPEP in those datasets where the posterior is skewed and has a similar performance otherwise.
7.2 Image classification
We have also considered an image classification task: Fashion MNIST dataset (each image is \(28\times 28 =784\) pixels and there are 10 classes: 0 Tshirt/top, 1 Trouser, 2 Pullover, 3 Dress, 4 Coat, 5 Sandal, 6 Shirt, 7 Sneaker, 8 Bag, 9 Ankle boot). We randomly pooled 10000 images from the dataset and divided them into two sets, with 5000 cases for training and 5000 for testing. For each one of the 10 classes, we have defined a binary classification subproblem by considering one class against all the other classes. We have compared GPEP and SkewGP2, that is a SkewGP with latent dimension \(s=2\) (for the same reason outlined in the previous section). We have initialised \(r_i\) by taking 2 random samples from the training data. We have also considered two different kernels: RBF and the Neural Network kernel (Williams 1998). Table 1 reports the accuracy for each of the 10 binary classification subproblems. For the RBF kernel, it can be noticed that SkewGP2 outperforms GPEP in all subproblems. For the NN kernel, the differences between the two models are less substantial (due to the higher performance of the NN kernel on this dataset) but in any case in favor of SkewGP2. We have also reported, for both the models, the computational time^{Footnote 5} (in minutes) needed to optimize the hyperparameters, to compute the posterior and to compute the predictions for all instances in the test set. This shows that SkewGP2 is also faster than GPEP.^{Footnote 6} The last row reports the accuracy on the original multiclass classification problem obtained by using the onevsrest heuristic, with the only goal of showing that the more accurate estimate of the probability by SkewGP leads also to an increase in accuracy for onevsrest. A multiclass Laplace’s approximation for GP classification was developed by Williams and Barber (1998) and other implementations are for instance discussed by HernándezLobato et al. (2011) and Chai (2012), we plan to address multiclass classification in future work.
Our goal is assessing the accuracy but also the quality of the probabilistic predictions. Figure 11, plot (a1), shows, for the RBF kernel case and for each instance in the test set of the binary subproblem relative to class 8 versus rest, the value of of the mean predicted probability of class rest for SkewGP2 (xaxis) and GPEP (yaxis). Each instance is represented as a blue point. The red points highlight the instances that were misclassified by GPEP. Figure (a2) shows the same plot, but the red points are now the instances that were misclassified by SkewGP2. By comparing (a1) and (a2), it is clear that SkewGP2 provides a higher quality of the probabilistic predictions.
SkewGP2 also returns a better estimate of its own uncertainty, as shown in plots (b1) versus (b2). For each instance in the test set and for each sample from the posterior, we have computed the predicted class (the class that has probability greater than 0.5). For each test instance, we have then computed the standard deviation of all these predictions and used it to color the scatter plot of the mean predicted probability. In this way, we have a visual representation of first order (mean predicted probability) and second order (standard deviation of the predictions) uncertainty. Fig. 11(b1) is relative to GPEP and shows that GPEP confidence is low only for the instances whose mean predicted probability is close to 0.5. This is not reflected in the value of the mean predicted probability for the misclassified instances (compare plot (a1) and (b1)). We have also computed the histogram of the standard deviation of the predictions for those instances that were misclassified by GPEP in Figure (c1). Note that, the peak of the histogram corresponds to very low standard deviation, that means GPEP has misclassified instances that have low second order uncertainty. This implies that the model is overestimating its confidence. Conversely, the second order uncertainty of SkewGP2 is clearly consistent, see plot (a2) and (b2), and in particular the histogram in (c2)—the peak is in correspondence of high values of the standard deviation of the predictions. In other words, SkewGP2 has mainly misclassified instances with high second order uncertainty, that is what we expect from a calibrated probabilistic model. We have reported additional examples of the better calibration of SkewGP2 for the MNIST and Germanroad sign dataset in "Appendices 1 and 2".
8 Conclusions
We have introduced the Skew Gaussian process (SkewGP) as an alternative to Gaussian processes for classification. We have shown that SkewGP and the probit likelihood are conjugate and provided marginals and closed form conditionals. We have also shown that SkewGP contains the GP as a special case and, therefore, SkewGPs inherit all good properties of GPs and increase their flexibility. The SkewGP prior was applied in classification showing improved performance over GPs (Laplace’s method and Expectation Propagation approximations).
As future work, we plan to study other more native ways to parametrize the skewness matrix \(\varDelta\) that do not rely on an underlying kernel. Moreover, we plan to investigate the possibility of using inducing points, as for sparse GPs, to reduce the computational load for matrix operations (complexity \(O(n^3)\) with storage demands of \(O(n^2)\)) as well as deriving the posterior for the multiclass classification problem.
Notes
Note in fact that, for \(y=1/2\), the likelihood \(\varPhi ((2y1)f(\varvec{x}^*),1)=0.5\) and it does not depend on \(f(\varvec{x}^*)\), and so it is marginalised out.
Sampling is performed according to (15) and, therefore, liness and ess are applied to sample \(U_1\) from \(\mathcal {T}_{\tilde{\varvec{\gamma }}}(0;\tilde{\varGamma })\). To increase the probability of acceptance for ess, we have replaced the indicator function that defines the truncation \(I_{u_1>\tilde{\varvec{\gamma }}}\) with a logistic function \(\text {sigmoid}(80(u_1\tilde{\varvec{\gamma }}))\). We have verified that, using 5000 samples for burn in, the posterior first moment of liness and ess are close for all considered values of n.
The SkewGP classifier is implemented in Python, but we call Matlab to use existing implementations of both the simultaneous perturbation stochastic approximation algorithms and the randomized lattice routine. We plan to reimplement everything in Python and then release the source code for SkewGP.
It is well known that GPEP usually achieves a more calibrated estimate of the class probability. Laplace’s method gives overconservative predictive probabilities (Kuss and Rasmussen 2005).
More precisely, the table reports the average computational time for the RBF and NN kernel case.
This is due to both the efficiency of liness and the batch approximation of the marginal likelihood.
References
Arellano, R. B., & Azzalini, A. (2006). On the unification of families of skewnormal distributions. Scandinavian Journal of Statistics, 33(3), 561–574.
Azzalini, A. (2013). The skewnormal and related families (Vol. 3). Cambridge: Cambridge University Press.
Azzimonti, D., & Ginsbourger, D. (2018). Estimating orthant probabilities of highdimensional Gaussian vectors with an application to set estimation. Journal of Computational and Graphical Statistics, 27(2), 255–267.
Bauer, M., van der Wilk, M., & Rasmussen, C.E. (2016). Understanding probabilistic sparse Gaussian process approximations. In Advances in neural information processing systems (pp 1533–1541).
Benavoli, A., Mangili, F., Corani, G., Zaffalon, M., & Ruggeri, F. (2014). A Bayesian Wilcoxon signedrank test based on the Dirichlet process. In Proceedings of the 30th International Conference on Machine Learning (ICML 2014) (pp 1–9).
Benavoli, A., Corani, G., Demsar, J., & Zaffalon, M. (2017). Time for a change: A tutorial for comparing multiple classifiers through Bayesian analysis. Journal of Machine Learning Research, 18(77), 1–36.
Botev, Z. I. (2017). The normal law under linear restrictions: Simulation and estimation via minimax tilting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1), 125–148.
Chai, K. M. A. (2012). Variational multinomial logit Gaussian process. Journal of Machine Learning Research, 13, 1745–1808.
Durante, D. (2019). Conjugate Bayes for probit regression via unified skewnormal distributions. Biometrika, 106(4), 765–779.
Fang, K. W. (2018). Symmetric multivariate and related distributions. Boca Raton: Chapman and Hall/CRC.
Genton, M. G., Keyes, D. E., & Turkiyyah, G. (2018). Hierarchical decompositions for the computation of highdimensional multivariate normal probabilities. Journal of Computational and Graphical Statistics, 27(2), 268–277.
Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1(2), 141–149.
Genz, A., & Bretz, F. (2009). Computation of multivariate normal and t probabilities (Vol. 195). Berlin: Springer.
Gessner, A., Kanjilal, O., & Hennig, P. (2019) Integrals over Gaussians under linear domain constraints. arXiv preprint arXiv:1910.09328.
GPy (since 2012) GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy
HernándezLobato, D., HernándezLobato, J.M., & Dupont, P. (2011). Robust multiclass Gaussian process classification. In Advances in neural information processing systems (pp 280–288).
Kuss, M., & Rasmussen, C. E. (2005). Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6, 1679–1704.
Murray, I., Adams, R., & MacKay, D. (2010). Elliptical slice sampling. In: Teh YW, Titterington M (eds) Proceedings of the thirteenth international conference on artificial intelligence and statistics, PMLR, Chia Laguna Resort, Sardinia, Italy, proceedings of machine learning research (Vol. 9, pp 541–548).
Nickisch, H., & Rasmussen, C. E. (2008). Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9, 2035–2078.
O’Hagan, A. (1991). BayesHermite quadrature. Journal of Statistical Planning and Inference, 29(3), 245–260.
O’Hagan, A., & Leonard, T. (1976). Bayes estimation subject to uncertainty about parameter constraints. Biometrika, 63(1), 201–203.
Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J., & Moore, J. H. (2017). PMLB: A large benchmark suite for machine learning evaluation and comparison. BioData Mining, 10(1), 36.
Orbanz, P. (2009). Construction of nonparametric Bayesian models from parametric Bayes equations. In: Advances in neural information processing systems (pp. 1392–1400).
Phinikettos, I., & Gandy, A. (2011). Fast computation of highdimensional multivariate normal probabilities. Computational Statistics & Data Analysis, 55(4), 1521–1529.
QuiñoneroCandela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6, 939–1959.
Rasmussen, C. E., & Williams, C. K. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press.
Snelson, E., & Ghahramani, Z. (2006). Sparse Gaussian processes using pseudoinputs. In: Advances in neural information processing systems (pp 1257–1264).
Spall, J. C. (1998). Implementation of the simultaneous perturbation algorithm for stochastic optimization. IEEE Transactions on Aerospace and Electronic Systems, 34(3), 817–823.
Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In: van Dyk D, Welling M (eds) Proceedings of the twelth international conference on artificial intelligence and statistics, PMLR, Hilton clearwater beach resort, Clearwater Beach, Florida USA, proceedings of machine learning research (Vol. 5, pp. 567–574).
Williams, C. K. (1998). Computation with infinite neural networks. Neural Computation, 10(5), 1203–1216.
Williams, C. K., & Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351.
Zhang, Z., Wu, G., & Chang, E. Y. (2007). Semiparametric Regression Using Student \(t\) Processes. IEEE Transactions on Neural Networks, 18(6), 1572–1588.
Acknowledgements
D. Azzimonti gratefully acknowledges support from the Swiss National Research Programme 75 “Big Data” Grant No. 407540_167199 / 1.
Funding
Open access funding provided by SUPSI  University of Applied Sciences and Arts of Southern Switzerland.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Proofs
Proposition 1 To prove that, we exploit Kolmogorov’s extension theorem (Orbanz 2009). Suppose that a family \(\mathcal {F}^I\) of probability measures are the Ifinitedimensional marginals of an infinitedimensional measure \(\mathcal {F}\) (a “stochastic process”). Each measure \(\mathcal {F}^I\) belongs to the finitedimensional subspace of dimension I. Given two marginals \(\mathcal {F}^I,\mathcal {F}^J\), as marginals of the same measure \(\mathcal {F}\), they must be marginals of each other, that is
where \(\cdot \downarrow I\) denotes the projection onto the subspace of dimension I. A family of probability measures that satisfies (17) is called a projective family. The Kolmogorov’s extension theorem states that any projective family on the finitedimensional subspaces of an infinitedimensional product space uniquely defines a stochastic process on the space. This means that we can define a nonparametric Bayesian model from a finitedimensional distribution by simply verifying that (17) holds. From (3), it then immediately follows that the definition (5) uniquely defines a stochastic process and, therefore, SkewGP is a welldefined stochastic process.
Theorem 1 We aim to derive the posterior of f(X). The joint distribution of \(f(X),\mathcal {D}\) is
where we denoted \(\varvec{f}= f(X) \in \mathbb {R}^n\) and omitted the dependence on X. First, note that
Therefore, we can write
with
and
From (18)–(19) and the definition of the PDF of the SUN distribution (1), we can easily show that we can rewrite (18) as a SUN distribution with updated parameters:
Corollary 1 This follows directly from the above proof by observing that \(\varPhi _{s+n}(\tilde{\varvec{\gamma }},\tilde{\varGamma })\) is the normalization constant of the posterior and, therefore, the marginal likelihood is
Corollary 2 Denote with \(\hat{f}(\hat{X})=[f(X)^T,f({\varvec{x}^*})^T]^T\) and observe that the predictive distribution is by definition
with \(f^*:=f({\varvec{x}^*})\) and \(\varvec{f} = f(X)\). Note that we have omitted the dependence on \(\varvec{x}^*,X\) for easier notation (\(p(f^*\varvec{f})\) corresponds to \(p(f^*\varvec{x}^*,X,\varvec{f})\)). We can write the posterior as
and so
Observe that
with \(\hat{W}=\text {diag}(2y_11,\dots ,2y_{n+1}1)\) and \(y_{n+1}=0.5\). Note that \(2y_{n+1}1=2 \cdot 0.51=0\) and this is the reason why we have introduced the dummy class value 1/2.
Observe that
is the marginal likelihood of a SkewGP posterior corresponding to the augmented dataset \(\hat{X}=[X^T,{\varvec{x}^*}^T]^T\), \(\hat{y}=[y^T,1/2]^T\). Therefore, we have that
where \(\tilde{\varvec{\gamma }}^*,\tilde{\varGamma }^*\) are the corresponding matrices of the posterior computed for the augmented dataset.
Proposition 2 This follows by the Fréchet inequality:
where \(A_i\) are events. In fact, note that
where Pr is computed w.r.t. the PDF of a multivariate distribution with zero mean and covariance \(\tilde{\varGamma }\).
Theorem 2 The proof follows straightforwardly from that of Corollary 2.
Appendix 2: Additional image classification examples
We have defined a binary subproblem from the German Traffic Sign data by considering SpeedLimit 30 versus 50 and from MNIST digit dataset by considering 3 versus 5. We have compared GPL versus SkewGP\(_2\) (both with RBF kernel). Table 2 reports the average accuracy and shows again SkewGP\(_2\) outperforms GPL (GPEP achieves lower accuracy than GPL in both cases).
Our goal is assessing the accuracy but also the quality of the probabilistic predictions. Figure 12, plot (a1), shows, for each instance in the test set (one of the fold) of the MNIST datatset , the value of of the mean predicted probability of class 5 for SkewGP\(_2\) (xaxis) and GPL (yaxis). Each instance is represented as a blue point. The mean predicted probability ranges in [0.41, 0.53] for GPL and in [0.25, 0.8] for SkewGP\(_2\). The red points highlight the instances that were misclassified by GPL (plot (a2) reports the images of some of the misclassified instances included in the rectangle). Figure (b1) shows the same plot, but the red points are now the instances that were misclassified by SkewGP\(_2\). By comparing (a) and (b), it is evident that SkewGP\(_2\) provides a higher quality of the probabilistic predictions.
SkewGP\(_2\) also returns a better estimate of its own uncertainty. This is showed in plots (c1) and (c2). For each instance in the test set and for each sample from the posterior, we have computed the predicted class (the class that has probability greater than 0.5). For each test instance, we have then computed the standard deviation of all these predictions and used it to color the scatter plot of the mean predicted probability. In this way, we have a visual representation of first order (mean predicted probability) and second order (standard deviation of the predictions) uncertainty. Plot 12(c1) is relative to GPL and shows that GPL confidence is low only for the instances whose mean predicted probability is close to 0.5. This is not reflected in the value of the mean predicted probability for the misclassified instances (compare plot (a1) and (c1) and note that the red spot in (a1) is outside the yellow area in (c1)). Conversely, the second order uncertainty of SkewGP\(_2\) is clearly consistent with plot (b1).
Figure 13 shows a similar plot for the “German roadsign” dataset. Plot 13(c1) is relative to GPL and shows that GPL confidence is low only for the instances whose mean predicted probability is in [0.3, 0.7]. This is not reflected in the value of the mean predicted probability for the misclassified instances (compare plot (a1) and (c1) and note that some of the red points in (a1) are outside the yellow area in (c1)). Conversely, 13(c2) shows that the second order uncertainty of SkewGP\(_2\) is clearly consistent with plot (b1).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Benavoli, A., Azzimonti, D. & Piga, D. Skew Gaussian processes for classification. Mach Learn 109, 1877–1902 (2020). https://doi.org/10.1007/s10994020059063
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994020059063