Stochastic variational inference for scalable non-stationary Gaussian process regression

A natural extension to standard Gaussian process (GP) regression is the use of non-stationary Gaussian processes, an approach where the parameters of the covariance kernel are allowed to vary in time or space. The non-stationary GP is a flexible model that relaxes the strong prior assumption of standard GP regression, that the covariance properties of the inferred functions are constant across the input space. Non-stationary GPs typically model varying covariance kernel parameters as further lower-level GPs, thereby enabling sampling-based inference. However, due to the high computational costs and inherently sequential nature of MCMC sampling, these methods do not scale to large datasets. Here we develop a variational inference approach to fitting non-stationary GPs that combines sparse GP regression methods with a trajectory segmentation technique. Our method is scalable to large datasets containing potentially millions of data points. We demonstrate the effectiveness of our approach on both synthetic and real world datasets.


Introduction
Gaussian process models represent a non-parametric supervised learning approach frequently used in the machine learning community for both regression and classification purposes. Learning from data using Gaussian processes involves specifying a covariance structure for the process via a covariance kernel, inferring the parameters of the kernel, then calculating or sampling from the posterior distribution of the process conditional on observed data. Typically, a stationary GP is used so that the covariance kernel parameters depend only on the difference between data points i.e. they are invariant to translations in the input space (Rasmussen and Williams 2006 A disadvantage of stationary GPs is that they lack the flexibility to fit non-stationary data, where the characteristics of the function differs across the input domain (Paciorek and Schervish 2004;Gibbs 1997;MacKay 1997). For these types of data, a non-stationary GP, where a subset of the kernel parameters are allowed to vary, may be more appropriate. For example, in the context of tracking animal movement the error in position estimates may depend on the animal's location within a receiver array (Guzzo et al. 2018), meaning the observation noise parameter (or nugget term) of the covariance kernel will be spatially varying. Observed phenomena may also have varying smoothness or amplitude. This could be due to temporal drivers of the process which lead to periods of rapid change and high volatility (Blum and Riedmiller 2013), the nonlinear dynamics underlying the observed variables (Heinonen et al. 2015), or varying environmental characteristics such as differences in elevation and in the nature of the terrain (Lang et al. 2007).
To model data that presents characteristics with varying covariance properties, Paciorek (2003), Paciorek and Schervish (2006) and Heinonen et al. (2016) proposed a nonstationary GP, where all or a subset of the kernel covariance parameters are input-dependent and modelled by other GPs. The hierarchical structure presented by this model is intrinsi-cally linked with deep Gaussian Processes (DGP) (Damianou and Lawrence 2013;Salimbeni and Deisenroth 2017). DGP models are a multi-layer generalisation of a GP, where the prior is defined recursively on multiple stochastic functions (Damianou and Lawrence 2013;Salimbeni and Deisenroth 2017;Wang et al. 2016) and the outputs from one layer become the inputs of the next layer. This is in contrast to the hierarchical model introduced by Heinonen et al. (2016) where the outputs of the lower layers specify the kernel parameters of the final layer GP.
A further disadvantage of GPs is their high computational complexity, scaling cubically with the number of training points n, which makes standard GP regression impractical to implement when the datasets are large. To overcome this limitation, sparse GPs that make use of a set of m inducing points have been developed in the literature (Lázaro-Gredilla and Titsias 2011;Titsias 2009;Hensman et al. 2013;Snelson and Ghahramani 2006;Monterrubio-Gòmez and Wade 2021) leading to a reduction in computational complexity from O(n 3 ) to O(nm 2 ) or below.
Using a reduced set of inducing points (m n), popular variational inference methods (Hensman et al. 2013;Titsias 2009;Lázaro-Gredilla and Titsias 2011) construct an approximate posterior distribution to the true posterior by effectively introducing a reduced rank approximation to the full covariance matrix. The distance between the true posterior and the approximate posterior is then minimised by maximising a lower bound on the marginal log likelihood. This is equivalent to minimising the Kullback-Leibler (KL) divergence between the true posterior and the variational distribution.
In the approach proposed by Titsias (2009), the locations of the inducing inputs are treated as variational parameters that are selected, along with the kernel hyperparameters, either by applying continuous optimisation or by using a variational EM algorithm. By marginalising the inducing variables, Titsias (2009) derived a lower bound on the log likelihood of the training data that by-passed the need for an explicit representation of the variational distribution at the inducing point locations. While this approach reduced computational complexity to O(nm 2 ) and storage requirements to O(nm), Hensman et al. (2013) showed that by retaining an explicit representation of the variational distribution, sparse GP regression could be recast within the framework of stochastic variational inference (SVI) (Hoffman et al. 2013), leading to a computational complexity of O(m 3 ). The key insights of Hensman et al. (2013) were to treat the inducing variables as global variables, and to note that the conditional independence of data observations given the GP latent function meant data may be considered individually or in mini-batches.
In this work, we further develop the stochastic variational inference approach of Hensman et al. (2013) so that it may be applied to the non-stationary hierarchical GP model pro-posed in Heinonen et al. (2016). In our framework, we use a variational approximation to the posterior distribution of the lower layer GPs, then use SVI to minimise the KLdivergence between this distribution and the true posterior. As the hierarchical model we consider does not factorise over the data, we employ a trajectory segmentation technique (Rasmussen and Ghahramani 2002) (also termed a mixture of GP experts) to enable scalable inference, and Monte Carlo sampling to calculate expectations under the variational posterior (Salimbeni and Deisenroth 2017). We demonstrate that this novel combination leads to an efficient inference scheme that can accurately infer the dynamic covariance structures underlying observed phenomena.

Stationary GP regression
A GP is a a stochastic process formed by a set of random variables indexed by time or space, such that any finite subset of those random variables is a multivariate Gaussian distribution. A GP is therefore completely specified by its mean function and its covariance function.
In standard GP regression, we assume that we observe n outputs (data) y = y 1 , . . . , y n at the input (training) points , and f is a latent (unobserved) function. We set a prior on the latent function f such that where in a finite dimensional representation, and k x, x is a kernel function that defines a n × n covariance matrix K over the training points. Typically, the covariance function depends on the hyperparameters lengthscale and amplitude 1 and is stationary, i.e. k( Moreover, the mean is typically set at 0 (Murphy 2012, Section 15.2). GPs are a non-parametric class of methods in which a posterior over the possible function values f at x is obtained that is consistent with the observed data. The aim is often to infer a distribution over the function values at x given the data, p(f|x, y), and then to use this to make predictions at a new set of input (test) points x * , given the training points x (Murphy 2012, Section 15.1), i.e. to compute p y * |x * , x, y = p(y * |f, x * ) p(f|x, y)df.
To select the parameters that define the covariance matrix, the log likelihood of the data can be calculated and the vector f marginalised out, where we have omitted the explicit dependence on the training input locations x. Since the probability distributions under the integrals in Eqs. 2 and 3 are multivariate Gaussian distributions, the integrals are tractable. Hence, posterior distributions and the log marginal likelihood are obtained in closed form. Hierarchical non-stationary GPs are GPs where a subset of the parameters are input-dependent. Assuming that the lengthscale, the amplitude and the mean of the GP are inputdependent functions, as it is the case with the model presented in this paper, the log marginal likelihood of the data in a nonstationary GP model has the following form where in a finite dimensional representation, GP priors are placed on the latent functions l, σ and μ such that where l = l(x), σ = σ (x), μ = μ(x) are the input-dependent lengthscale, amplitude and the mean vectors respectively. Then, the hierarchical prior is given by the following where p(f|l, σ , μ) = N (f; μ, K). While Eq. 4 is intractable it is possible to marginalise over the latent function f and construct an objective function combining the data likelihood and the GP priors placed on the latent functions l, σ and μ. This was used by Heinonen et al. (2016) to obtain maximum-a-posteriori (MAP) estimates of the latent function values and to sample from their posterior distributions with Hamiltonian Monte Carlo (HMC) sampling. This inference framework was replicated by Torney et al. (2021) to learn time-dependent animal movement parameters with a periodical covariance structure. Markov chain Monte Carlo (MCMC) methods are a powerful class of methods conceptually capable of sampling from the posterior distribution of hierarchical models under certain regularity conditions (ergodicity), then the sample of points converges in distribution to the true posterior distribution. However, the computational complexity is demanding when the datasets are large or when complex models with large numbers of parameters display poor mixing and slow convergence rates. The computational complexity of MCMC methods, together with the high-dimensionality of hierarchical GPs are difficult obstacles in the path of inferring the parameters of interest for large datasets. Alternatives to sampling-based inference methods such as MCMC are variational methods that formulate inference as an optimisation problem. In a GP context, they have been used within the framework of sparse GPs (Hensman et al. 2013).

Sparse GP regression
In sparse GP regression, we begin as in the standard case with n observations y at n locations x and a GP prior on the latent function f specified by Eq. 1. We then consider a set of m inducing points z i , i = 1, . . . , m, which reside in the same space as x, with an associated set of function values given by u = f (z). In order to derive a lower bound on the log marginal likelihood log p(y), we assume that u is a sufficient summary statistic for f (Titsias 2009) where f and f * denote the corresponding vector values at the training locations x, and at arbitrary test locations x * respectively. As u are function points drawn from the same GP prior as f we have where μ (u) is the mean of the GP prior at the inducing points z and K (u) is a m ×m covariance matrix defined by the kernel k (u) over the inducing points. Given Eq. 9 and making use of the fact that y is a noisy version of f, the posterior distribution of u and f * is given by (Titsias 2009) To perform variational inference, we introduce a variational distribution φ(u) such that the true posterior is approximated by, where φ(u) is a Gaussian distribution with mean μ (q) and covariance K (q) , i.e.
Thus, from Eqs. 11 and 12 the approximation to the posterior at the training locations is given by the following relationship Therefore, we have that a lower bound on the log marginal likelihood can be found as, where q(f) = q(f, u)du, and we have applied Jensen's inequality. KL denotes the Kullback-Leibler divergence between distributions and in this case is the divergence between the prior distribution p(u), and the variational distribution φ(u). All of the details of the derivation of the lower bound on the marginal likelihood are explained in the "Appendix", Section 2. While the integral in Eq. 15 is in general intractable, for Gaussian likelihoods, p(y|f) factorises across the data, i.e.
where n is the number of data points. Thus, in this scenario a n-dimensional integral can be rewritten as n one-dimensional integrals and Eq. 15 becomes, where q( f i ) is the marginal of q(f). Given a Gaussian likelihood along with a Gaussian form for the variational distribution φ(u) both the integral and the divergence term are tractable, enabling optimisation of the lower bound with respect to variational parameters and kernel hyperparameters via stochastic gradient descent (Hensman et al. 2013). The sparse variational method can be extended to multiple latent functions, thus defining a chained or a multilatent GP (Saul et al. 2016). Focusing on the case when the likelihood depends on two latent functions f and g, Saul et al. (2016) showed that a similar lower bound could be obtained on the log marginal likelihood, where u ( f ) , u (g) are vectors of the latent function values evaluated at the inducing points z and it is assumed that the vectors are a priori independent, i.e.
As in the single latent function scenario, a variational approximation to the posterior is introduced such that and a lower bound on the marginal log likelihood can be found as (Saul et al. 2016), where q(f) = p(f|u ( f ) )φ(u ( f ) )du ( f ) and q(g) = p (g|u (g) )φ(u (g) )du (g) . Comparing Eqs. 4 and 18, it can be seen that the hierarchical non-stationary GP is analogous to the multilatent GP. While the model developed by Heinonen et al. (2016) has the likelihood function depend on one latent function f , which in turn depends on two latent functions l and g modelled by GPs, the model described by Saul et al. (2016) has the likelihood function depend directly on two independent latent functions f and g modelled by GPs.
While it is possible to formulate the non-stationary GP within the framework of multilatent GPs, the difficulty lies in calculating the integral term in Eq. 21, which corresponds to the expected log likelihood with respect to the marginal variational distribution of the latent function. Calculating this expectation is potentially tractable if the likelihood factorises over the data, as in the scenario of a single latent function, i.e. if This condition is satisfied in the framework described in Saul et al. (2016), but does not hold in the non-stationary model where the latent functions determine the covariance structure of the higher level GP. Methods such as Gauss-Hermite quadrature (Hensman et al. 2015) or Monte Carlo sampling (Ranganath et al. 2014;Salimbeni and Deisenroth 2017;Bonilla et al. 2018) can generally be used to calculate the expectation. While quadrature increases the computational complexity of the model, the approximation can be made exact by increasing the quadrature nodes depending on the functional form of the likelihood. Monte Carlo sampling represents a model agnostic approach to calculating the variational objective (Ranganath et al. 2014) and has been applied to deep Gaussian processes (Salimbeni and Deisenroth 2017). It is this approach we follow in this paper as detailed below.
Our goal is therefore to combine the hierarchical nonstationary GP model developed by Heinonen et al. (2016) with the variational inference framework for multiple latent functions proposed by Saul et al. (2016). As the likelihood in a non-stationary GP model does not factorise over the data, we employ a trajectory segmentation technique (Rasmussen and Ghahramani 2002) that enables us to decompose the expectation term across data segments and employ stochastic optimisation (Hensman et al. 2013) to maximise a lower bound on the log likelihood with respect to the variational parameters.

Model formulation
In a one-dimensional setting, we assume a regression model, where y i is the observation at a random time point t i and i ∼ N 0, ω 2 . We then place a GP prior on the latent function f such that where t i , t j are random time points, μ(t) is the mean of the GP, and the kernel k ( f ) is a valid non-stationary kernel, such as the Matérn 1/2 (Paciorek and Schervish 2004), or a non-stationary version of the Radial Basis function (RBF) kernel (Heinonen et al. 2016), or other valid non-stationary kernels (Rasmussen and Williams 2006), Chapter 4. Eq. 23 is referred as the first layer of the hierarchical non-stationary GP.
Through the kernel k ( f ) , which is dependent on the latent functions lengthscale and amplitude and through the mean function, the chain between the two layers of the GP is constructed, as separate GP priors are set on the three afore-mentioned latent functions. Moreover, it is straightforward to extend the method to include heteroscedasticity in the final layer GP by setting a GP prior on the observation noise variance w 2 (Heinonen et al. 2016). This is referred as the second layer of the GP. The hierarchical structure of the model is illustrated in Fig. 1.
Mathematically, the latent function f (t) models the structure of the residuals after subtracting μ(t) from the data. This gives us extra modelling flexibility. In our applications, μ(t) is modelled with a stationary GP, so f (t) can focus on modelling the non-stationary deviations. Moreover, μ(t) can capture a smooth regular pattern, like a periodic background signal, while f (t) can be a rough process to model short-term fluctuations.
In a multivariate setting, the random time points t i get replaced by the random vectors x i .

Kernels specification
In Eq. 23 on the first layer of the GP we employ a Matérn 1/2 non-stationary kernel (Paciorek and Schervish 2004). The kernel evaluated at random times t i and t j in one-dimension is given by the following relationship where d i j = (t i − t j ) 2 , σ i , l i are the amplitude, lengthscale parameters respectively at the time point t i . 2 We set separate GP priors on all the three latent functions where η ∈ {μ, l, σ } andη ∈ {μ,l,σ }.
In order to ensure positivity of the latent functions (just the lengthscale and amplitude, as the mean function of a GP can be negative) we use a softplus transformation 3 such that we have η (t) ≡ log 1 + exp μ (η) +η(t) .
When performing inference using the one-dimensional synthetic data, shown in Sect. 5.1, the chosen kernel for each latent function is RBF, given by where η ∈ {μ, l, σ }, α (η) is the amplitude, β (η) is the lengthscale and μ (η) is the mean of the lower-level GPs. For higher dimensional data, the difference t i − t j is replaced by an appropriate norm ||x i − x j ||, where x i , x j are random pairs of input points. When performing inference using the empirical data, shown in Sect. 5.2, the kernel for all of the three latent functions is a periodical kernel, namely Exponential Sine Squared, given by where η ∈ {μ, l, σ }, α (η) is the amplitude, β (η) is the lengthscale, and p is the period (in our empirical data applications a period of 24 h is chosen). To ensure positivity, we again use a softplus transformation only for the lengthscale and amplitude functions.

Log probability density of the model
The total log probability density of the hierarchical GP model is calculated by summing the log probability densities over the two layers of the model i.e. the log likelihood of the data on the first layer and the log probability densities of the GP priors on the second layer, where K ( f ) , K (l) , K (σ ) and K (μ) are the covariance matrices defined by the kernels k ( f ) , k (l) , k (σ ) , k (μ) respectively. If the mean function of the GP is constant, then the last term disappears.

Variational inference for non-stationary GPs
We extend the variational inference equations from a simple GP to a double-layer hierarchical GP model, where the lengthscale, amplitude and the mean function of the GP are all modelled by another GP. In the limiting case where the mean function is not modelled by another GP, 4 it is a straightforward process to adapt the following procedure to include a constant mean function. Therefore, at the inducing points z, the corresponding vectors of the latent functions (similar to the finite dimensional representation of Eq. 1) are l (z) , σ (z) and μ (z) . Predictions at points x * correspond in the latent function space to the vectors of the latent functions l * , σ * , and μ * .

Prior distributions of the model
We list a few assumptions needed for the derivation of the variational lower bound to the true posterior distribution. Firstly, we assume that the vectors of the latent functions l, σ , and μ are a priori independent. Secondly, we assume that the prior distributions p(l (z) ), p(σ (z) ) and p(μ (z) ) are multivariate Normal distributions, i.e.
To derive the variational lower bound to the true posterior distribution we further assume that the the following relationship holds (32)

Posterior distribution of the augmented model
Using the previous assumptions, the augmented posterior distribution of the vectors of the latent functions at the training points and inducing points is where p(l (z) , σ (z) , μ (z) |y) is the joint posterior distributions at the inducing points and p(l|l (z) ), p(σ |σ (z) ), p(μ|μ (z) ) can be found in closed form by using conditional probabilities of Gaussian distributions.

Variational approximation and lower bound
To perform variational inference, we introduce a variational approximate distribution φ to the augmented posterior distribution, Fig. 1 This figure shows the structure of the hierarchical Bayesian model proposed in this paper for the one-dimensional setting, where we assume that the lengthscale, amplitude, and the mean functions of the GP are also modelled by a GP. On the first layer of the hierarchical GP, the latent functions f and μ depend on time t. On the lower level of the hierarchical GP, the latent functions μ, l and σ are defined at the inducing points z. Conditional Gaussian probabilities are used to calculate the values of μ, l and σ at t given z. The hyperparameters of the kernels have been indicated in the brackets after the kernel k. The circle nodes denote variables and the rectangle node denotes fixed values or observations where we take the variational distribution at the inducing points to be of the following form The marginal log likelihood has the following formula Using Eq. 32 we get that the marginal log likelihood is Furthermore, to perform variational inference we proceed as in Eq. 15 to obtain a lower bound on the marginal log likelihood using Jensen's inequality such that where the variational distribution is a Gaussian distribution given standard properties of Gaussian integrals.
To make our method scalable to large datasets, we use a trajectory segmentation technique such that the data and the input domain are split into L independent segments (local domains) y i of equal length and different local GPs are fitted on each domain (Snelson and Ghahramani 2007;Das and Srivastava 2010;Park and Apley 2017). However, inside a segment, the observations y i j , given l i j , σ i j and μ i j are dependent, since we work within a non-stationary GP framework. The gradients and the log likelihood of each segment can be added up to get a good approximation to the full gradients and to the log likelihood provided that the length of the domain (segment length) of the local GP is sufficiently large compared to the lengthscale of the GP being fitted. Having the number of points per segment at least two orders of magnitude larger than the correlation length of the local GP is sufficient to get a good approximation, since the density of the data points is fixed and reducing/increasing the number of points per segment reduces/increases the length of a segment. Thus, given that the likelihood factorises over segments we can use stochastic variational inference (Hensman et al. 2013).
The approximate likelihood factorises across L segments and has the following form where y i denotes the segment i of observations, and l i , σ i , μ i denote the segment i of the vectors of the latent functions. Then, the integral in Eq. 38 can also be factorised such that we have Thus, Eq. 38 becomes In a general case, the integral (the expected log likelihood) in Eq. 42 is intractable. The Gauss-Hermite quadrature method performs poorly in a non-stationary GP model (Monterrubio-Gòmez et al. 2019; Monterrubio-Gòmez and Wade 2021). Also, in our case, the multi-dimensional integral in Eq. 42 cannot be computed using the Gauss-Hermite approach, since the observations y i inside a segment i, are not independent given the vectors l i , σ i , and μ i . Thus, the multi-dimensional integral cannot be split into multiple one-dimensional integrals given the non-stationary GP framework that we employ. The Monte Carlo sampling method is used in Salimbeni and Deisenroth (2017); Saul et al. (2016) to calculate the expected log likelihood and this is the approach that we are following here. More specifically, given Eq. 35, we draw samples l i j , σ i j and μ i j from the multivariate Normal distribution q(l i j , σ i j , μ i j ). Then, we calculate the log likelihood of a trajectory segment log p(y i |l i j , σ i j , μ i j ) per each sample we draw, where y i is the segment i of observations and l i j , σ i j , μ i j are the j'th sample of the corresponding vector of the latent functions for segment i. We then proceed to take the average over all the samples drawn in order to calculate the expected log likelihood term in Eq. 42. Moreover, in Eq. 42, we have closed form expressions for the KL divergence terms, as φ(l (z) , σ (z) , μ (z) ), together with p(l (z) ), p(σ (z) ) and p(μ (z) ) are multivariate Normal distributions, as shown in Eqs. 29-31 and Eq. 35. Thus, once the lower bound is calculated, it can be maximised using stochastic optimisation in order to determine the optimal parameters of the joint variational distributions φ(l (z) , σ (z) , μ (z) ) and other parameters of interest such as the lengthscale, amplitude vectors l, σ and the mean vector μ (when applicable) on the first layer of the GP. A pseudo-algorithm of the variational inference method is presented in Algorithm 1.

Lower bound on the predictive distribution
The lower bound on the predictive log probability density is computed using Eq. 42, where the training data y is replaced by the out-of-sample test data y test and the latent functions computed at the training set are replaced by the predicted latent functions at the test set.

Model inference
We have implemented our variational framework inference in TensorFlow, an open-source deep learning library (Abadi et al. 2016), using the package TensorFlow Probability. Using the TensorFlow library has multiple advantages including access to automatic differentiation for an efficient and easy method to calculate the gradients, without the specification of analytical formulae. It also facilitates access to GPUaccelerated calculations.
To process large amounts of data, we use a trajectory segmentation technique, where we break the individual trajectories into multiple and computationally more accessible segments. We use one set of inducing points z, shared by each vector of a latent function, l, σ and μ. Using the inducing points z we define the vector values of a latent function for the lower-level Gaussian GPs. To obtain the values of the lengthscale, the amplitude and of the mean vector on the first layer of the GP, we calculate the vector of a latent function values at the observed data points, given the inducing points locations.
When performing inference using the one-dimensional synthetic data, the inducing points are chosen as evenly spaced across the domain (about 50 points in 1 day). For the two-dimensional synthetic data, the inducing points are a grid of equally distanced locations (625 points) situated within the domain.
The empirical datasets (the electricity consumption data) described in Sect. 4.2 have a known periodic structure, given by the diurnal cycle. This prior information is naturally incorporated into the model via a periodic kernel in Eq. 27. To paraphrase this: If sparse inducing points are thinly spread out over an observation period of several years, it will be extremely challenging for the model to learn the periodic structure. For that reason, we assume the periodic structure a priori and use the inducing points to cover the diurnal 24-h cycle. Thus, the inducing points (about 50 points) are chosen over the course of 1 day.
When performing inference using the synthetic dataset generated with a constant mean, the number of points per segment, described in Sect. 3.4 is 1024. For all the other applications, where the mean is flexible, the number of points per segment is 500 due to memory limitations of the hardware we used. Hence, the length of the local domain in all cases is sufficiently larger than the correlation length of the local GPs, which is inferred in Sect. 5. Moreover, the number of samples used to calculate the expectation term in Eq. 42 is 700 for the two-dimensional dataset, and 50 for all the one-dimensional datasets, either synthetic or empirical.
Then, we proceed to maximise the lower bound using stochastic optimisation in order to infer all the parameters of interest mentioned at the end of Sect. 3.4 together with all of the hyperparameters of the second layer of the GP -[α (η) ] 2 , β (η) and the mean μ (η) . In addition, for the twodimensional dataset the noise variance parameter is also optimised. The inducing points can potentially be inferred too, but in this paper they are kept fixed as the number of inducing points is sufficiently dense for the domain.
One of the disadvantages of the variational inference approach is that it relies on the optimisation of a lower bound to the log likelihood of the true model. This function may be multimodal, and a greedy optimisation may therefore only converge to a local optimum, with the consequence that the optimisation may be heavily dependent on the initialisation of the parameters. In this case, a good approach is the one implemented by Heinonen et al. (2016), where the optimi- Fig. 2 Synthetic data generated with the constant mean set at 0: a the observed synthetic data for 1 individual over a day (time is in minutes). b, c True lengthscale function, and the true amplitude function respectively that generated the dataset shown in (a) Fig. 3 Synthetic data generated with a flexible mean: a the observed synthetic data for 1 individual over a day (the time is in minutes). b-d True lengthscale function, the true amplitude function and the true mean function respectively that generated the dataset shown in (a) sation is performed from multiple initialisations, taken e.g. from a space filling design, and the model with the highest maximum log likelihood is selected. This procedure avoids the dependence of the results on specific initial parameter values, albeit at higher computational costs.

Synthetic data generation
We generate two one-dimensional synthetic datasets and one two-dimensional synthetic dataset from a non-stationary GP model. The first one-dimensional synthetic dataset is generated with a constant mean set at 0, and the second one-dimensional synthetic dataset and the two-dimensional dataset are generated with a flexible mean.
The covariance kernel is given by the non-stationary Mátern 1/2 kernel defined in Eq. 24 with constant observation error. We generate l and σ and μ (when applicable) by taking a sample from a GP prior with a RBF kernel, such that there is no mismatch between our inference model and the synthetic data.
For the one-dimensional datasets, we simulate from our model trajectories of 1, 8, 128 individuals 5 and collect about 8,200 observations per individual over the course of 1 day. For the two-dimensional dataset, we simulate from our model observations subject to observation noise recorded at a grid of locations (1 millions observations in total from 100 individuals recorded at 10,000 locations). The values of the latent functions are used to generate data from a GP by using Eq. 23 with the non-stationary Mátern 1/2 kernel (Paciorek and Schervish 2004). In Fig. 2, we show the first observed synthetic dataset when the mean is constant and set at 0 for 1 individual. In Fig. 3, we show the second observed synthetic dataset generated with a flexible mean for 1 individual and in Fig. 4 we show the two-dimensional observed synthetic dataset.

Empirical data
We apply our method to two empirical datasets. The first empirical dataset consists of 3 years (2016-2018) of power consumption readings (about 1 million observations in kilowatts, after disregarding the missing observations) recorded every single minute in an individual household in Sceaux (7 km of Paris, France) (Dua and Graff 2017). The second empirical dataset consists of power consumption observations (about 52,000) recorded every 10 min in 2 different distribution networks of Tetouan city, located in north Morocco (Salam and Hibaoui 2018). To visualise the empirical datasets, the mean and the one standard deviation plots of the log power consumption levels are shown in Fig. 5. The histograms of the raw power consumption observations are shown in the "Appendix", Fig. 13. Since the data is skewed in Fig. 13, a log transformation has been applied to the data.

Results
We fit our non-stationary GP and we apply the proposed variational inference method. We optimise the lower bound derived in Eq. 42 using the Adam optimiser (Kingma and Ba 2017).

Synthetic model inference results
We begin with an analysis of the inference results for the first one-dimensional synthetic dataset, which was generated with a constant mean (fixed at 0). Moreover, in our first study we consider the limiting case when the mean function of the GP model is also constant, fixed at the true value of 0 (this will be relaxed in our subsequent studies). We focus on the inference of the lengthscale and amplitude functions.
The results are shown in Fig. 6, for datasets of different size, consisting of simulated trajectories for 1, 8, and 128 individuals respectively. The figure shows the inferred posterior means along with the corresponding posterior credible intervals (for 80%, 95%, and 99% respectively) and compares them with the ground-truth values. As more data is added (increasing from top to bottom), the uncertainty decreases for both functions, as expected. There is close agreement between the inferred posterior means of the functions and the true values, and almost all of the deviations from the ground truth lie within the 99% credible interval.
In Fig. 7 we show the histograms and the probability density functions (pdfs) for the difference and standardised difference between the true and the predicted mean of each latent function. To produce the plots we simulated 5 replicate datasets consisting of observations from 1 individual, 8 individuals, and 128 individuals respectively, inferred each latent function and computed the differences and the standardised differences (by dividing the differences by the corresponding standard deviation of the inferred latent functions) between the true function and the inferred mean for each latent function. We performed this for each dataset and then concatenated the results from each replicate dataset. For example, the differences from 5 replicate datasets, all consisting of observations from 1 individual are concatenated. Then, we approximated the pdfs with a kernel density estimator, using a Gaussian kernel and selecting the bandwith with cross-validation. If the inference is successful then the pdfs and histograms should be similar to a standard Gaussian distribution. We explain why that is the case in more detail below.
Assume the posterior distribution for a quantity of interestm with true value m 0 is N (m, std 2 ). If this estimate is consistent, then the quantity converges in distribution to N (0, 1) when repeating the estimation repeatedly on different independent data instantiations. This consistency is what we aim to test. In our case, the marginal variational posterior distribution of the quantities of interest shown in the argument on the left of Eq. 39 is normal (as seen from the same equation), and we denote it (where t denotes time, that is the input argument in Figs. 5 and 7,8,9,10). Due to the computational costs we only repeated the estimation for 5 independent data instantiations, and we boosted the number of instances by treating different time points as separate instances. The functions m(t) at sep- arate time instances t are not independent, as seen from the figures referenced above. This lack of independence is not a problem per se, but we need to allow for the fact that due to this temporal autocorrelation the effective sample size is lower than the actual sample size. As a simple estimate, for a lengthscale l and a total domain size T we get an effective sample size of where the factor 5 stems from the 5 independent data instantiations. With T = 1440 and l = 60 this gives us an effective sample size of N e f f ≈ 120. The pdfs and the histograms of the differences, shown in Fig. 7a-c (of the lengthscale) and g-i (of the amplitude), get increasingly peaked around 0 as the size of the training data increases, from 1 to 8 and then 128 individuals. This is as expected and indicates that the accuracy of parameter estimation increases with increasing dataset size. Fig. 7d-f, j-l show kernel density estimates of the pdfs of and histograms the standardised differences (of the lengthscale in panels (df) and the of the amplitude in panel (j-l)). There is, overall, good agreement with a standard Normal distribution, which is superimposed as a dashed black line. Small discrepancies are to be expected, which reflects both the stochastic nature of the data and the approximations inherent in kernel density estimation. The heavier tails of the pdfs are due to the fact that variational inference is a biased method that tends to underestimate the posterior uncertainty of the parameters. This has been reported repeatedly in the literature; see e.g. Blei et al. (2017) and Lawrence and Rattray (2010), Chapter 5. This implies that std is smaller than expected. Since std appears in the denominator of Eq. 43 above, this further implies that our distributions should be slightly over-dispersed, with heavier tails than a standard Normal distribution.
Next, we analyse the inference results for the same dataset (the first synthetic dataset generated with a constant mean fixed at 0), but fitting the full hierarchical GP model with a flexible mean function, that is, the mean function of the high-level GP is modelled by a separate low-level GP. The results are shown in Fig. 8. There is overall good agreement between the inferred posterior means of the latent functions and the true values, and most of the deviations from the true latent function values lie within the 99% credible interval. An exception is shown in Fig. 8k, l. On the natural scale (Fig. 8k), matching the variations in the amplitude shown in Fig. 8j, the difference between the true and inferred mean functions is hardly noticeable. The scale needs to be magnified by two orders of magnitude to notice that the true mean does not lie within the 99% credible interval (Fig. 8l). This may again reflect the stochastic nature of the data. To establish whether there is a systematic bias, the simulations would have to be repeated over a population of data instantiations at high computational costs. Since the observed offset is two orders of magnitude below the natural scale, though, this has not been further pursued.
We analyse the inference results for the second synthetic dataset generated with a flexible mean function. We have fitted our full hierarchical GP model, with a flexible mean function modelled by a separate low-level GP. The results are shown in Fig. 9. The posterior means of the latent function are very similar to the true values, and almost all of the deviations lie within the 99% credible intervals.
Finally, we analyse the inference results for the twodimensional dataset generated with a flexible mean function. The inferred means, the standard deviations of the posterior samples and the true values of the latent functions are shown in Fig. 10. There is close agreement between the true values of the latent functions and the mean inferred values, and the deviations from the true values are inside the 99% credible interval (z-score for the 99% credible interval is 2.576).

Empirical data inference results
The inference results when the empirical dataset considered is the power consumption recorded every minute in an individual household in the city of Sceaux are shown in Fig. 11. The power consumption persistence illustrated in Fig. 11a measures how likely the power consumption is to remain constant over a period of time. The power consumption variance illustrated in Fig. 11b shows the variation in power expenditure, while Fig. 11c illustrates the power consumption mean (on a log scale). The mean power consumption levels are plausible and compatible with the log mean power consumption readings recorded in Fig. 5a. The mean energy usage is low during the night (12AM-5AM),  g-i] show the pdfs (bandwith selected by cross-validation) and histograms (green bins, chosen with default size) of the difference between the true and the predicted mean lengthscale and predicted mean amplitude respectively for datasets consisting of observations from [1,8,128] individuals respectively. [d-f] and [j-l] show the pdfs and his-tograms (green bins) of the standardised difference between the true and the mean predicted lengthscale and mean predicted amplitude respectively for datasets consisting of observations from [1,8,128] individuals respectively together with the pdf of a standard Normal distribution (black dashed line). (Color figure online) after which it increases to medium power consumption levels during the day only to reach peak power consumption levels around 10PM. The results are consistent with high mean power consumption persistence values and low mean power consumption variance during the night time hours (12AM-5AM) as energy consumption levels are expected to be fairly stable during the night. Medium mean power consumption persistence and relatively high mean variance values during the day are expected, as there is more uncertainty in the mean power consumption levels during this time frame. High mean power persistence and low mean variance values occur around 10PM are plausible, when the mean power consumption levels are constantly high for some time, before dropping again to the levels recorded at the beginning of the night time hours.
The inference results when the empirical datasets considered consist of power consumption observations in 2 different distribution networks of Tetouan city are illustrated in Fig. 12. The mean power consumption levels illustrated in Fig. 12c, f are plausible and in agreement with the mean log power consumption for the two distribution networks illustrated in Fig. 5b, c. Since the results are similar between the different distribution networks, we analyse only the inference results for the distribution network 3, illustrated in Fig. 12d-f. In Fig. 12f, the mean power consumption levels are low during the 8 h (12AM-7AM), after which they increase fairly steadily to reach the peak levels around 9PM. The results are in agreement with the high power consumption persistence/low power variance values during the night time hours (12AM-4AM), as the mean power consumption levels are slowly decreasing during that time period. After 4AM until 7PM, the power consumption persistence is fairly low and this is in agreement with the fairly high, but unsteady power consumption variance and with the generally increasingly levels in the mean power consumption. After 9PM, the mean power consumption drops to the levels reached during the night, and the mean power persistence is increasing with relatively low mean variance values, as expected since the power expenditure is fairly constant during the night time hours. Moreover, the peaks in the power consumption vari-ance that occur around 7AM and 5PM are consistent with the troughs in the power consumption at the same hours, as this indicates a change in the mean power consumption levels.

Empirical data model comparison
For the empirical data, we compare the fit of the nonstationary hierarchical GP model against a 'standard' GP model. Our 'standard' GP model uses the trajectory segmentation technique and has a flexible mean in order to be competitive, but like a textbook standard GP model, the lengthscale and amplitude parameters are not inputdependent, but constant. In order to perform a sensible comparative performance analysis, and since standard GPs cannot accommodate large datasets, we modify the nonstationary hierarchical GP by modelling the lengthscale and amplitude as constant adjustable parameters. To paraphrase this: the lengthscale and amplitude of the higher-level GP are not modelled as outputs of lower-level GPs, but become scalar parameters, thus our model resembles a standard GP. To obtain reliable predictions for the mean function, we still model it by another GP with a periodic Exponential Sine, shown in Eq. 27 kernel, as the set of inducing points are set over the course of 1 day, compared to the observed data, which is comprised of power consumption readings over at least a year (the trajectory segmentation technique and the location of the inducing points are discussed in Sects. 3.4 and 3.5, respectively). However, in this case the GP model is equivalent to a single-layer standard GP, with a constant mean and the kernel given by the sum of a stationary Matérn Inference using the empirical dataset for the Sceaux household: a the mean power consumption persistence, b the mean power consumption variance power usage and c the mean of the power con-sumption mean as blue lines in a day (time is in hours). The blue regions (from dark to light) represent the 80%, 95%, and 99% credible intervals respectively. (Color figure online) 1/2 kernel and a periodic Exponential Sine kernel. 6 Thus, we preserve the original set of inducing points, the trajectory segmentation technique and variational inference framework allowing for the inference of large datasets and for making valid predictions at the test points. To compare the fit of the models, we split our data into a training set (about 80% of the original dataset size) and a test set (about 20% of the original dataset size). We train the models on the training set as described in Sect. 3.5, evaluate it on the test set, and then compare the lower bounds on the predictive log probability densities for each model. The lower bound on the 6 We prove this in the "Appendix", Section 3. predictive log probability density is computed as described in Section 3.4.
We present the results in Table 1. For each dataset, the model with the highest lower bound on the predictive log probability density is the hierarchical non-stationary GP model followed by the standard GP. Thus, while the analysis is based only on a comparison between the lower bounds on the predictive log probability densities of the two models, the results indicate that the hierarchical GP, where all the parameters are modelled by a GP is a better model to the data than the standard GP.

Conclusions
In the present paper we have presented a variational Bayesian inference method within a hierarchical non-stationary GP framework. We have combined the flexibility and robustness of hierarchical non-stationary GP models with the computationally efficient variational inference scheme proposed by Hensman et al. (2013). We have successfully applied our method to a variety of synthetic and empirical datasets, achieving reliable and intuitively plausible results. Our method has significant advantages over related methods from in the literature. The variational inference scheme is computationally more efficient than the HMC method of Heinonen et al. (2016), which in practice does not scale up to the data set sizes considered in our work. Unlike the maximum a posterior (MAP) method of Heinonen et al. (2016), which only produces point estimates, our method properly quantifies the posterior estimation uncertainty. Moreover, the scalability of our method is further improved by a trajectory segmentation technique.
The variational inference method presented in our paper has been adapted from Hensman et al. (2013), and we have extended this framework from a standard vanilla GP, as in Hensman et al. (2013), to a two-layer hierarchical GP. As we have illustrated in our paper, this leads to a much more flexible and powerful method for dealing with complex nonstationary processes.
However, unlike in an MCMC scenario, in which the MCMC chain asymptotically converges to the true posterior distribution, the variational inference framework consists in optimising the parameters of a distribution that is approximate to the true posterior distribution. One of the disadvantages of the variational inference is that it has been shown that it tends to under-estimate the posterior uncertainty of the parameters (Blei et al. 2017;Lawrence and Rattray 2010), Chapter 5. Moreover, the optimisation of the approximate distribution faces the standard challenges of iterative multimodal optimisation (e.g. entrapment in local optima).
Various diagnostics have been proposed in the literature to assess whether the variational distribution has converged.
The majority of them focus on monitoring the log likelihood of the lower bound over time (Chee and Toulis 2017). This approach has been used to assess convergence of the variational distribution in Paun et al. (2022), where this variational method has been applied to identify the key pathways of the wildebeest migration in the Serengeti national park. More specifically, training was halted if the log likelihood of the lower bound did not increase over 5 epochs (one epoch corresponded to one iteration through the entire dataset).
The quality of the variational representation can be quantified by using the Pareto smoothed importance sampling (PSIS) diagnostic (Yao et al. 2018), which was introduced by Vehtari et al. (2017) to smooth the importance weights when performing importance sampling using a Pareto distribution. An advantage of using this diagnostic is that it quantifies the uncertainty between the approximate and the true distribution by the estimatek. When this estimate is larger than a prespecified threshold it indicates that the approximate distribution is not close to the true posterior distribution (Yao et al. 2018).
As the collection of fine-scale data in a variety of application domains is increasing, fast and robust inference with sound uncertainty quantification in complex flexible models is increasingly becoming indispensable. We therefore see potential applications of our work in a wide range of disciplines, from modelling collective animal movement in diverse landscapes to modelling terrain surfaces with different characteristics such as smoothness or ruggedness or to modelling the volatile structure of interest rates (Swanson and Xiong 2018). Fig. 13 Empirical raw data: a the raw histograms for the power consumption observations (in kilowatts) in the Sceaux household. b, c the raw histograms for the power consumption (in kilowatts) for the zones 3, and 1 respectively of the Tetouan city, respectively

Derivation for a stationary GP with a flexible mean
As in Sect. 3.1, we assume a regression model in onedimension, y i = f (t i ) + i , where y i is the observation at a random time point t i and i ∼ N 0, ω 2 . We set a GP prior on the latent function f such that where t i and t j are random time points and μ(t) is the mean of the GP. We assume that the lengthscale and amplitude of the GP are constant, but the mean function of the GP is flexible and modelled by a separate GP, i.e.

Empirical raw data
In Fig. 13, we illustrate the raw histograms for the power consumption observations for the empirical datasets. Since the data is skewed, we have applied a log transformation when performing inference.