As we mentioned earlier, analysing time-course microarray data by means of Gaussian process (GP) regression is not a new idea (cf. Background). In this section we review the methodology to estimating the continuous trajectory of a gene expression by GP regression and subsequently describe a likelihood-ratio approach to ranking the differential expression of its profile. The following content is based on the key components of GP theory as described in [21, 22].
The Gaussian process model
The idea is to treat trajectory estimation given the observations (gene expression time-series) as an interpolation problem on functions of one dimension. By assuming the observations have Gaussian-distributed noise, the computations for prediction become tractable and involve only the manipulation of linear algebra rules.
A finite parametric model
We begin the derivation of the GP regression model by defining a standard linear regression model (a more concrete example of such a model is for ϕ = (1, x, x2)⊤, i.e. a line mapped to a quadratic curve)
where gene expression measurements in time y = {y
n
}n = 1..Nare contaminated with white Gaussian noise and the inputs (of time) are mapped to a feature space Φ = {ϕ (x
n
)⊤}n = 1..N. Furthermore, if we assume the noise to be i.i.d. (identically and independently distributed) as a Gaussian with zero mean and variance 
then the probability density of the observations given the inputs and parameters (data likelihood) is Gaussian-distributed
Where
.
Introducing Bayesian methodology
Now turning to Bayesian linear regression, we wish to encode our initial beliefs about the parameters w by specifying a zero mean, isotropic Gaussian distribution as a prior over the parameters
By integrating the product of the likelihood × prior with respect to the parameters, we get the marginal likelihood
which is jointly Gaussian. Hence the mean and covariance of the marginal are
By computing the marginal likelihood in eq. (8), we can compare or rank different models, without fear of overfitting on the data, or having to explicitly apply a regulariser to the likelihood; the marginal likelihood implicitly penalises too complex models [21, sec. 5.4].
Notice in eq. (7) how the structure of the covariance implies that choosing a different feature space Φ results in a different K
y
. Whatever K
y
is, it must satisfy the following requirements to be a valid covariance matrix of the GP:
-
Kolmogorov consistency, which is satisfied when K
ij
= K(x
i
, x
j
) for some covariance function K, such that all possible K are positive semidefinite (y⊤ Ky ≥ 0).
-
Exchangeability, which is satisfied when the data are i.i.d.. It means that the order in which the data become available has no impact on the marginal distribution, hence there is no need to hold out data from the training set for validation purposes (for measuring generalisation errors, etc.).
Definition of a Gaussian process
More formally, a Gaussian process is a stochastic process (or collection of random variables) over a feature space, such that the distribution p (y(x1), y(x2),..., y(x
n
)) of a function y(x), for any finite set of points {x1, x2, ..., x
n
} mapped to that space, is Gaussian, and such that any of these Gaussian distributions is Kolmogorov consistent.
If we remove the noise term
I from K
y
in eq. (7) we can have noiseless predictions of f(x) rather than y(x) = f(x) + ε. However, when dealing with finite parameter spaces K
f
may be ill-conditioned (cf. sec. SE derivation), so the noise term guarantees that K
y
will have full rank (and an inverse). Having said that, we now formulate the GP prior over the latent function values f by rewriting eq. (8) as
where the mean function (usually defined as the zero function) and the covariance function respectively are
The squared-exponential kernel
In this paper we only use the univariate version of the squared-exponential (SE) kernel. But before embarking on its analysis, the reader should be aware of the existing wide variety of kernel families, and potential combinations of them. A comprehensive review of the literature on covariance functions is found in [21, chap. 4].
Derivation and interpretation of the SE kernel
In the GP definition section we mentioned the possibility of an ill-conditioned covariance matrix. In the case of a finite parametric model (as in eq. (1)), K
f
can have at most as many non-zero eigenvalues as the number of parameters in the model. Hence for any problem of any given size, the matrix is non-invertible. Ensuring K
f
is not ill-conditioned, involves adding the diagonal noise term to the covariance. In an infinite-dimensional feature space, one would not have to worry about this issue as the features are integrated out and the covariance between datapoints is no longer expressed in terms of the features but by a covariance function. As demonstrated in [22, sec.45.3] and [21, sec.4.2.1], with an example of a one-dimensional dataset, we express the covariance matrix K
f
in terms of the features Φ
then by considering a feature space defined by radial basis functions and integrating with respect to their centers h, eq. (12) becomes
where one ends up with a smooth (infinitely differentiable) function on an infinite-dimensional space of (radial basis function) features. Taking the constant out front as a signal variance
and squaring the exponential gives rise to the standard form of the univariate squared-exponential (SE) covariance function. The noisy univariate SE kernel -- the one used in this paper is
The SE is a stationary kernel, i.e. it is a function of d = x
i
- x
j
which makes it translation invariant in time. δ
ij
is the Kronecker delta function which is unity when i = j and zero otherwise and l2 is the characteristic lengthscale which specifies the distance beyond which any two inputs (x
i
, x
j
) become uncorrelated. In effect, the lengthscale l2 governs the amount that f varies along the input (time). A small lengthscale means that f varies rapidly along time, and a very large lengthscale means that f behaves almost as a constant function, see Figure 5. This parameterisation of the SE kernel becomes very powerful when combined with hyperparameter adaptation, as described in a following section. Other adapted hyperparameters include the signal variance
which is a vertical scale of function variation and the noise variance
(introduced in eq. (2)) which is not a hyperparameter of the SE itself, but unless we consider it as a constant in the noisy case, its adaptation can give different explanations about the latent function that generates the data.
One can also combine covariance functions as long as they are positive-definite. Examples of valid combined covariance functions include the sum and convolution of two covariance functions. In fact, eq. (14) is a combined sum of the SE kernel with the covariance function of isotropic Gaussian noise.
Gaussian process prediction
To interpolate the trajectory of gene expression at non-sampled time-points, as illustrated in Figure 5, we infer a function value f* at a new input (non-sampled time-point) x*, given the knowledge of function estimates f at known time-points x. The joint distribution p(f*, f ) is Gaussian, hence the conditional distribution p(f*| f ) is also Gaussian. In this section we consider predictions using noisy observations; we know the noise is Gaussian too, so the noisy conditional distribution does not differ. By Bayes' rule
where the Gaussian process prior over the noisy observations is
Predictive equations for GP regression
We start by defining the mean function and the covariance between a new time-point x* and each of the ith known time-points, where i = 1..N
For every new time-point a new vector k* is concatenated as an additional row and column to the covariance matrix K
C
to give
where C = N..N* is incremented with every new k* added to K
C
. By considering a zero mean function and eq. (19), the joint distribution p(f*, y) from eq. (15) can be computed
Finally, to derive the predictive mean and covariance of the posterior distribution from eq. (15) we use the Gaussian identities presented in [21, sec.A.2]. These are the predictive equations for GP regression of a single new time-point
and K
f
= K
f
(x, x). These equations can be generalised easily for the prediction of function values at multiple new time-points by augmenting k* with more columns and k(x*, x*) with more components, one for each new time-point x*.
Hyperparameter learning
Given the SE covariance function, one can learn the hyperparameters from the data by optimising the log-marginal likelihood function of the GP. In general, a non-parametric model such as the GP can employ a variety of kernel families whose hyperparameters can be adapted with respect to the underlying intensity and frequency of the local signal structure, and interpolate it in a probabilistic fashion (i.e. while quantifying the uncertainty of prediction). The SE kernel allows one to give intuitive interpretations of the adapted hyperparameters, especially for one-dimensional data such as a gene expression time-series, see Figure 5 for interpretations of various local-optima.
Optimising the marginal likelihood
In the context of GP models the marginal likelihood results from the marginalisation over function values f
where the GP prior p(f|x) is given in eq. (9) and the likelihood is a factorised Gaussian
. The integral can be evaluated analytically [21, sec. A.2] to give the log-marginal likelihood (LML, it is common practice to take the log of the antiderivative for the sake of numerical stability, as it yields a sum instead of a product)
We notice that the marginal here is explicitly conditioned on θ (hyperparameters) to emphasise that it is a function of the hyperparameters through K
f
. To optimise the marginal likelihood we take the partial derivatives of the LML with respect to the hyperparameters
We use scaled conjugate gradients[32] -- a standard optimisation scheme -- to maximise the LML.
Ranking with likelihood-ratios
Alternatively, one may choose to go "fully Bayesian" by placing a hyper-prior over the hyperparameters p(θ|
), where
represents some type of model, and compute a posterior over hyperparameters
based on some initial beliefs, such as the functions having large lengthscales, and optimise the marginal likelihood so that the optimum lengthscale tends to a large value, unless there is evidence to the contrary. Depending on the model
, the integral in eq. (27) may be analytically intractable and thus one has to resort to approximating this quantity [33] (e.g. Laplace approximation) or using Markov Chain Monte Carlo (MCMC) methods to sample from the posterior distribution [34].
In the case where one is using different types of models (e.g. with different number of hyperparameters), a Bayesian-standard way of comparing between such two models is through Bayes factors [11, 19, 23] -- ratios of the integral quantities in eq. (27)
where the models
usually represent two different hypotheses, namely
- the profile has a significant underlying signal and thus it is truly differentially expressed and
- there is no underlying signal in the profile and the observed gene expression is just the effect of random noise. The ranking is based on how likely
in comparison to
, given a profile.
In this paper we present a much simpler -- but effective to the task -- approach to ranking the differential expression of a profile. Instead of integrating out the hyperparameters, we approximate the Bayes factor with a log-ratio of marginal likelihoods (cf. eq. (25))
with each LML being a function of different instantiations of θ. We still maintain hypotheses
and
that represent the same notions explained above, but in our case they differ simply by configurations of θ. Specifically, on
the hyperparameters are fixed to θ1 = (∞, 0; var(y))⊤ to encode a function constant in time (l2 → ∞), with no underlying signal
, which generates a time-series with a variance that can be solely explained by noise
. Analogously, on
the hyperparameters θ2 are initialised to encode a function that fluctuates in accordance to a typical significant profile (e.g. ℓ2 = 20), with a distinct signal variance that solely explains the observed time-series variance
and with no noise
.
Local optima of the log-marginal likelihood (LML) function
These two configurations correspond to two points in the three-dimensional function that is the LML, both of which usually lie close to local-optimum solutions. This assumption can be verified, empirically, by exhaustively plotting the LML function for a number of profiles, see Figure 5. In case the LML contour differs for some profiles, more initialisation points should be used to ensure convergence to the maximum-likelihood solution. Because the configuration of the second hypothesis (no noise,
) is an extremely unlikely scenario, we let θ2 adapt to a given profile by optimising the LML function, as opposed to keeping it fixed like θ1.
In most cases the LML (eq. (25)) is not convex. Multiple optima do not necessarily pose a threat here; depending on the data and as long as they have similar function values, multiple optima present alternative interpretations on the observations. To alleviate the problem of spurious local optimum solutions however, we make the following observation: when we explicitly restrict the signal variance hyperparameter (
) to low values during optimisation, we also implicitly restrict the noise variance hyperparameter (
) to large values. This occurs as the explanation of the observed data variance (var(y)) is shared between the signal and noise variance hyperparameters, i.e.
. This dependency allows us to treat the three-dimension optimisation problem as a two-dimension problem, one of lengthscale ℓ 2 and one of signal-to-noise ratio
without fear of missing out an optima.
Figure 5 illustrates the marginal likelihood as a function of the characteristic lengthscale ℓ2 and the SNR. It features two local optima, one for a small lengthscale and a high SNR, where the observed data are explained with a relatively complex function and a small noise variance, and one optimum for a large lengthscale and a low SNR, where the data are explained by a simpler function with high noise variance. We also notice that the first optimum has a lower LML. This relates to the algebraic structure of the LML (eq. (25)); the first term (dot product) promotes data fitness and the second term (determinant) penalizes the complexity of the model [21, sec.5.4]. Overall, the LML function of the Gaussian process offers a good fitness-complexity trade-off without the need for additional regularisation. Optionally, one can use multiple initialisation points focusing on different non-infinite lengthscales to deal with the multiple local optima along the lengthscale axis, and pick the best solution (max LML) to represent the
hypothesis in the likelihood-ratio during the ranking stage.
Source code
The source code for the GP regression framework is available in MATLAB code http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/gp/ and as a package for the R statistical computing language http://cran.r-project.org/web/packages/gptk/. The routines for the estimation and ranking of the gene expression time-series are available upon request for both languages. The time needed to analyse the 22690 profiles in the experimental data, with only the basic two initialisation points of hyperparameters, is about 30 minutes on a desktop running Ubuntu 10.04 with a dual-core CPU at 2.8 GHz and 3.2 GiB of memory.