Correlated product of experts for sparse Gaussian process regression

Gaussian processes (GPs) are an important tool in machine learning and statistics. However, off-the-shelf GP inference procedures are limited to datasets with several thousand data points because of their cubic computational complexity. For this reason, many sparse GPs techniques have been developed over the past years. In this paper, we focus on GP regression tasks and propose a new approach based on aggregating predictions from several local and correlated experts. Thereby, the degree of correlation between the experts can vary between independent up to fully correlated experts. The individual predictions of the experts are aggregated taking into account their correlation resulting in consistent uncertainty estimates. Our method recovers independent Product of Experts, sparse GP and full GP in the limiting cases. The presented framework can deal with a general kernel function and multiple variables, and has a time and space complexity which is linear in the number of experts and data samples, which makes our approach highly scalable. We demonstrate superior performance, in a time vs. accuracy sense, of our proposed method against state-of-the-art GP approximations for synthetic as well as several real-world datasets with deterministic and stochastic optimization. Supplementary Information The online version contains supplementary material available at 10.1007/s10994-022-06297-3.

andl j (θ) = l j (θ) − λ j (θ) in the deterministic and stochastic case, respectively, generalizes the CPoE method such that for C → J we recover the mentioned method global methods in Table A2. Thereby, we used which is the difference of the true and local approximated covariance. The setting in VFE [5] is particularly interesting, since it constitutes in the global case a direct posterior approximation derived via a variational maximization of the lower bound of the log marginal likelihood. Moving a bit away from the true marginal likelihood of full GP has the effect that overfitting (w.r.t. full GP) can not happen when optimizing the hyperparameters with the lower bound. This is particularly important when all inducing inputs are optimized as it is usually recommended in sparse global methods which is not the case for our model since it allows to have a number of inducing points in the order of the number of data samples. In the adapted 'local VFE' CPoE model when using V j = 0 and minimize also λ j = tr{D j } has the effect that the model is locally variationally optimal, however, it would be interesting to directly derive a lower bound analogously to [5] so that the posterior of our CPoE model is rigorously connected to full GP. Since this is not a straight-forward extension, we postpone this task to future work. Below, we present the connection to full GP for this adapted model in the joint prior sense analogously to Prop. 6 for the local FITC model. (0, 1] sparsity parameter α (0, J] approximation quality parameter yi R individual data output y j R B data output of expert j y k:j R B(j−k1+) data output of experts k up to j y * R pointwise noisy prediction output y R N all data output xi R D individual data indput Xj R B×D data input of expert j X k:j R B(j−k+1)×D data input of experts k up to j X R N ×D all data input x * R D query input for prediction f j R B latent function outputs of expert j f k:j R B(j−k+1) latent function outputs of experts k up to j f R N all latent function outputs f * R pointwise (latent) prediction output f (Xj ) R B GP evaluation for input matrix aj R L local inducing outputs of expert j a k:j R L(j−k+1) local inducing outputs of experts k up to j a R M all local inducing outputs Aj R L×D local inducing inputs of expert j A k:j R L(j−k+1)×D local inducing inputs of experts k up to j A R M ×D all local inducing inputs Ij {1, . . . , C − 1} number of predecessors of expert j ϕ i (j) {1, . . . , j − 1} ith predecessor of expert j π(j) {1, . . . , j − 1} I j predecessor index set π + (j) {1, . . . , j} I j +1 predecessor index set including j ψ(j) {1, . . . , max(j, C)} C correlation index set σ 2 n R + observation noise variance θ R |θ| kernel hyperparameters including σ 2 n k θ (x, x ′ ) R kernel evaluation for 2 query points K AB R Ma ×M b kernel matrix of two query matrices p(z) R evaluation of (true) probability density q(z) R evaluation of approximated probability density S R M ×M prior precision matrix T R M ×M projection precision matrix N + number of prediction experts β * j R + unnormalized predictive weight of expert j at x * β * j R + normalized predictive weight of expert j at x * m * j R predictive mean of expert j at x * v * j R + predictive variance of expert j at x * D [C,C 2 ] R + difference in KL between two approximate models Table 1: Overview of notation. Table A2: Generalizations of CPoE model.
Proposition 7 (Local VFE) Using a deterministic projection q(f j |a ψ(j) ) = N f j |H j a ψ(j) , V j in the graphical model in Def. 4 and Prop. 1, that is, setting the covariance V j = 0 in the projection step recovers global VFE for C → J. Moreover, the difference in KL to full GP of the joint prior is also decreasing. In particular, the difference in KL of the prior of the local VFE model for 1 ≤ C ≤ C 2 ≤ J is Further, the difference in KL of the projection is The overall prior approximation quality is Compared to the FITC model is the difference in the trace instead of the fraction of the log-determinants.

A.2 Solving Linear System & Partial Inversion
For solving the sparse linear system Σ −1 µ = η in Prop. 3, sparse Cholesky decomposition is exploited, that is, M Σ −1 M T = LL T =: Y is computed so that ν and µ can be efficiently obtained via solving Lν = η and L T µ = ν, respectively, where M is a so-called fill-reduction permutation matrix such that the Cholesky matrix L is as sparse as possible and thus µ = M −1 µ. Note that M is computed only via the structure on the block level which is only J dimensional instead of JL. Additionally to the mean µ, also some entries Σ ψ(j) in the covariance matrix Σ has to be explicitly computed, which are needed for computing local predictions used in Section 3.4.
For computing the local posteriors q c,γ (a ψ(j) |y) = N µ ψ(j) , Σ ψ(j) , we need to compute the entries Σ ψ(j) in the covariance matrix Σ, which correspond to the non-zeros in the precision matrix Σ −1 . Computing efficiently these entries is not straightforward since in the inverse the blocks are no longer independent. However, we can exploit the particular sparsity and block-structure of our precision matrix and obtain an efficient implementation of this part which is key to achieve a competitive performance of our algorithm. Computing some entries in Z = Y −1 is also known as partial inversion. We adapted the approach in [38] where the recursive equations with J blocks for computing the full inverse Z are provided Instead of computing the full inverse using this recursion, we exploited the block-sparsity structure of our posterior precision matrix in order to gain significant speed-up. We only computed the entries in the inverse Z which are symbolically non-zero in L. In Algorithm 1 in the Appendix we provide efficient pseudo-code using sparse-block-matrices in the block-sparse-row format. Alternatively for computing the Cholesky factor of Σ −1 = S + HV −1 H, we could directly exploit that the prior precision S = F T Q −1 F is already decomposed into a upper/lower-triangular form since F lower triangular. However, when updating the Cholesky factor with HV −1 H needs quadratic time in the number of nonzeros for each expert.

Algorithm 1 Partial Sparse Block Inversion
Require: Cholesky matrix L of size JB×JB in block-sparse-row (bsr) format with J × J total blocks, block-size B and N non-zero blocks. Data array d of size N × B × B, the column-block-indices r of size N , row-block-pointer p of length J + 1, and lookup table M of dimension J × J. Ensure: The lower part of the symmetric partial inversion is computed in bsr-format with the same row-block-indices r and row-block-pointer p and data array z of size N × B × B.

A.3 Hyperparameter Estimation
In Section 3, we introduced CPoE for fixed hyperparameters θ where implicitly all distributions are conditioned on θ, however, we omitted the dependencies on θ in the most cases for the sake of brevity. Similar to full GP, sparse GP or PoEs, the log marginal likelihood (LML) can be used as an objective function for optimizing the few hyperparameters θ.

A.3.1 Deterministic Optimization
The log of the marginal likelihood of our model formulated in Section 3.3 can be written as with P = HS −1 H T + V . Since P is dense, we can apply the inversion (B2) and determinant lemma (B3) to P and exploit |F | = 1 yielding so that all involved quantities Σ −1 , V and Q are sparse. For efficient parameter minimization, the derivative of the log marginal likelihood with respect to each parameter in θ is needed for which the derivations are provided in Appendix C.3. Thereby also some parts of the covariance matrix Σ are needed which is explained in Section A.2. Alternatively to the marginal likelihood, we can maximize a lower bound of it which is a generalization of our model so that we recover a range of well known sparse global GP models for C → J as discussed is Section A.1. [6,11]. For moderate sample size N , deterministic optimization with full batch y can be performed. That means, the log marginal likelihood for the whole data is computed for which the sparse system of equations with the sparse posterior precision as well as the partial inversion of the posterior covariance has to be solved. In particular, the functions for computing L(θ) and ∂L(θ) ∂θ for each θ and full data y are repetitively called by a numerical minimizer. Fig.  A1 illustrates the performance of this deterministic batch hyperparameter optimization where the convergence for the log marginal likelihood, average KL divergence, 95%-coverage (both quantities exactly defined in Appendix F) for different number of experts J compared to full GP are depicted. The N = 2048 data samples are generated with a D = 2-dimensional SE-kernel and the test KL and coverage mean values are reported for N test = 1000 samples with 5 repetitions. We used γ = 1 and C ∈ {1, . . . , 7}. We observe that the log marginal likelihood and KL are getting better for increasing C, and the deterministic parameter estimates converge to the ones of full GP for increasing function calls. It is interesting to observe that also for smaller C values, the coverage of our methods are consistent. In particular, they are slightly too big, meaning our confidence information are conservative. This is due to the aggregation based on the covariance intersection method with normalized weights, which guarantees consistent second order information.

A.3.2 Stochastic Optimization
The presented method in the previous section works fine for small datasets, however, in order to scale this parameter optimization part to larger number of samples N in a competitive time, stochastic optimization techniques has to be exploited similarly done for the global sparse GP model (SVI [9]; REC [11]; IF [10]). In the approximation method REC [11], the recursive derivatives are exactly propagated which would also be possible for our model, however, it turned out that in practice the differences in accuracy are very small when using instead the hybrid approach IF of [10]. Thereby, the independent factorization of the log marginal likelihood is used for the computations of the optimization part, whereas the exact posterior is used for inference and prediction. Adapted to our setting, the independent factorized log marginal likelihood log q (y|θ) can be approximated by The difference compared to the deterministic case in (A1) and to [10] for the global sparse model is the independent prior q (a j ) instead of q (a) and p (a), respectively. In the approximate case, we can writẽ with l j (θ) = − 1 2 y T j P −1 j y j + log |P j | which has the advantage that it decomposes into the J terms l j in the sum, so that it can be used for stochastic optimization. This constitutes a very fast and accurate alternative for our method as shown in Figure A2 and is exploited in Section 4 for large data sets.

A.3.3 Prior on Hyperparameters
Alternatively to the log marginal likelihood (LML) maximization as presented above, the maximum a posteriori (MAP) estimator for θ can be used. It is the log of the posterior distribution p (θ|y) ∝ p (y|θ) p (θ) where p (θ) is a suitable prior on the hyperparameters yielding log p (θ|y) = log p (y|θ) + log p (θ) .
In the following, we assume p (θ) = j p (θ j ) and a log-normal prior for each hyperparameter p (θ j ) = log N θ j |ν j , λ 2 j for means ν j and variances λ 2 j . For the deterministic case, the MAP estimator can be straightforwardly computed by just adding the log prior on θ to the batch log marginal likelihood, i.e. log p (θ|y) = L(θ) + log p (θ). Similarly for the stochastic case, the stochastic MAP can be decomposed as log p (θ|y) ≈ J j=1 l j (θ) + 1 J log p (θ) , where l j (θ) is the jth term in the stochastic marginal likelihood , so that it can be used again for stochastic mini-batch optimization. An example using priors for the hyperparameters is presented in Section 4.4.
time complexity for testing. This might be interesting if we want to make fast predictions for many points N t . For reasonable values of W , for instance W = 1, W = C or W = Z = log(N )C (used in prediction aggregation), preliminary experiments show very comparable performance. Note that the consistency properties for covariance intersection method are preserved as long as the weights are normalized over the used W experts. Table 1 compares the asymptotic complexities with other GP algorithms. It is interesting that for α = 1, our algorithm has the same asymptotic complexity for training as sparse global GP with M g = B global inducing points but we can have M l = LJ = γBJ = γN total local inducing points! Thus, our approach allows much more total local inducing points M in the order of N (e.g. M = 0.5N with C = 2) whereas for sparse global GP usually M g ≪ N . This has the consequence that the local inducing points can cover the input space much better and therefore represent much more complicated functions. As a consequence, there is also no need to optimize the local inducing points resulting in much fewer parameters to optimize. Consider the following example with N = 10 ′ 000 in D = 10 dimensions. Suppose a sparse global GP model with M g = 500 global inducing points. A CPoE model with the same asymptotic complexity has a batch size B = M g = 500 and α = 1. Therefore, we have J = N B = 20 experts and we choose C = 2 and γ = 1 2 such that we obtain L = γB = 250 local inducing points per experts and M = γN = 5 ′ 000 total inducing points! Further, the number of hyperparameters to optimize for a SE kernel is for global sparse GP M g D + |θ| = 5012, whereas for CPoE there are only |θ| = 12. For our method, the time and space complexity is linear in the number of samples N and the number of experts J which makes our approach highly scalable. The approximation quality parameter α = Cγ appears cubic/quadratic in the time/space complexity. The optimal approximation quality (and thus equivalent to full GP) is achieved for α = J which implies C = J and γ = 1. However, it is clear that this is not feasible for big datasets and thus some moderate values of C and γ have to be selected to trade off time and accuracy which is illustrated in the Appendix in Table A3 and   Table A3: KLs to full GP (above) and times (below) of our method CPoE for varying C and γ for experiment in Section 3.6.3. Compare also Fig. A4 .

A.5 Relation to Deep Structured Mixtures of GPs
There are several approaches exploiting local experts, as discussed in Section 2.2. In particular, the recent approach Deep Structured Mixtures of Gaussian Processes (DSMGP) [29] shares some similarities with our approach. However, our approach has a main advantages against DSMGP, namely, the continuous predictive distribution, as illustrated in Figure A6. In the following, we briefly summaries their approach, discuss the similarities and differences to our approach, and compare them in two illustrative experiments. The GP approximation approach DSMGP exploits a sum-product network of local GPs, which corresponds to a mixture of independent base experts as in the ordinary PoEs. This approach allows analytic posterior inference, however, in order to compute efficient predictions, the posterior of the DSMGP is projected to the closest GP, i.e. the GP with minimal KL divergence from the DSMGP. This has the effect that the aggregated predictive distribution might discontinuous, as illustrated in Figure A6. This is a common disadvantage of local aggregation methods based on predictions form a single region. DSMGP uses independent base experts similarly as the local PoE methods discussed in Section 2.2. However, the predictions are averaged by a mixture over different partitions of the input space. On the other hand, our approach directly models correlations between the base experts, which recovers the exact dependencies between the different partitions, leading to smooth predictions. We compared both methods on two illustrative examples. In particular, we used the data from the main example on the GitHub repository 3 of their approach, which is illustrated on the left in Figure A6. The N = 100 data samples are generated from a sine and the number of splits per product nodes is K = 4, the number of children per sum node is V = 3, and the minimum number of experts is M = 20 corresponding to J = 5 base experts. We optimized the hyperparameters with full GP and fixed them for all approximation methods. The results for this kind of data are consistent with the findings in [29], where DSMGP can improve over the other local independent methods (min-Var, GPoE, BCM). However, the global sparse method and in particular our method CPoE with C = 2 significantly outperforms DSMGP, as demonstrated with the KL to full GP. Similarly, we run their algorithm on the example of Figure 2, where the generated data has a smaller lengthscale and bigger observation noise. The parameters for DSMGP were set to K = 4, V = 3, M = 10, which correspond to J = 6 independent experts. This more complex setting illustrates the main limitation of DSMGP with the discontinuous predictions, as depicted in Figure A6 on the right. Moreover, this is also reflected in the accuracy in term of KL to full GP. We run some preliminary comparisons of the experiment depicted in Figure 9, with similar performance as obtained in the both examples shown in Figure A6. This comparison can be found on our github repository. 4 However, it is hard to compare the running times of both algorithms based on Python and Julia, respectively. Therefore, it would require to reimplement their algorithm in Python for proper empirical comparisons in a time vs. accuracy sense.

A.6 Implementation Details
All experiments were run on a standard Laptop (IntelCore i7, 8 CPU 1.9GHz). Our code is implemented in Python and is available on GitHub 5 . It contains several examples 6 7 how to use it together with experiments from this paper. For solving the sparse linear system of equations, we used Cholmod [39] in the Python package scikit-sparse which relies on sparse Cholesky decomposition. It would be advantageous to use/implement a sparse block Cholesky decomposition and solver which exploits directly our structure. This was indeed needed for computing some entries in the posterior covariance, since with available implementation of partial matrix inversion we could not exploit the block sparsity and thus did not obtain competitive performance as discussed in Section A.2. An efficient implementation of this part is presented in Algorithm 1. In our current implementation the size of each partition has to be equal; which is in theory not necessary, but it allows more efficient implementations since then the block character can be easily exploited in the computation of the sparse posterior precision. Using the KD-tree construction with J = 2 K , the sizes of the partitions differ at most by 1. Thus, if the partitions are not equal, the number of local inducing points are set to L = min(B j ) J j=1 . Our implementation exploits the kernel and likelihood functions of GPy [40]. For the optimization of the hyperparameters we used the L-BFGS-B algorithm in the Python package scipy in the deterministic full batch case. For stochastic optimization we used the stochastic optimizer ADAM [41] (implemented from scratch) with appropriate learning rates which are learnt in preliminary experiments.
For the competitor methods we used the implementation in GPy [40] for full and sparse global GP (the approach of [5]). For PoE, GPoE, BCM, RBCM and GRBCM we implemented the corresponding aggregation algorithms based on the GPy implementations for the independent experts in Python for the sake of comparisons. For the stochastic version of SGP, the hybrid information filter approach in [10] and their implementations are used. We also run the approaches REC [11] and SVI [9], however the former approach shows superior accuracy vs. time performance in preliminary experiments. For the sparse global GP model there is the choice of optimized or fixed inducing points. For the same number of inducing points the accuracy is obviously better with optimized inducing points, however taking into account the time for optimizing them, we found in the experiments with batch optimization (i.e. also smaller datasets) that the fixed random subset approach was superior. Therefore we report here the results for fixed (random subset of data) inducing points in the deterministic case and optimized in the stochastic case. The reason for that is that the sparse global approximation with unknown inducing inputs has M D + |θ| (variational) parameters to optimize in the batch version. In the stochastic version REC & IF there are as well M D + |θ| parameters, whereas SVI has even M + 0.5M 2 + M D + |θ| number of parameters since the posterior mean and covariance has to be optimized.
On the other hand, full GP has only a few kernel hyperparameters |θ| to optimize. Similarly, our method CPoE (and also independent PoEs) inherit this property because there is no necessity to optimize the local inducing points since the total amount of them can be in the order of N . This is also true for the stochastic version of our algorithm. Assume for instance D = 8 and M = 100, the number of parameters with a SE kernel for full GP and CPoE are only |θ| = 10 parameters to optimize, whereas for batch SGP, REC & IF 810 and even 5910 for SVI. For fixed inducing points, SGP and IF also only have |θ| = 10 hyperparameters which allows to have more inducing points but speed-up the optimization a lot and makes the accuracy vs. time comparison more competitive.
We used the KD-partition for our method as discussed in 3.1 while in the PoE-literature [12][13][14][15][16], often K-Means is used for partitioning. However, for large J and N this is quite inefficient and often the partition sizes for each expert differs significantly which introduces an imbalance among the experts in the prediction aggregation as well as in the stochastic optimization. Therefore we also used the KD-tree partition for these algorithms for the sake of comparisons.
For assessing the quality of the different algorithms in the next sections, we report the two quantities the Kullback-Leibler-(KL)-divergence to full GP and the Continuous Ranked Probability Score (CRPS) both depending on the pointwise predictive distributions p (f * |y). The reported values correspond always to an average of N test prediction points which are not contained in the training data.

A.7 Experiments
In this section, we provide more details about the experiments in Section 4.

A.7.1 Synthetic Data
In this simulation study with synthetic GP data, we examine the accuracy vs. time performance of different GP algorithms for fixed hyperparameters. We generated N = 8192 data samples in D = 2 with 5 repetitions from the sum of two SE kernels with a shorter and longer lengthscale (l s = 0.125, v s = 0.2 and l l = 0.5, v l = 1.1; see Fig. A5) such that both global and local patterns are present in the data. In Fig. 9  experts J = {1, 2, 4, . . . , 128} in red, cyan and magenta, respectively. For our correlated PoEs, the results for the correlations C = {1, . . . , 12} are shown in green for J = 32 and γ = 0.5. In the first two plots, the superior performance of our method compared to competitors in accuracy to full GP vs. time can be observed. Our method constitutes a fast and accurate method for a range of different approximation qualities. Moreover, in the third plot, one can observe that the confidence informations are reliable already for small approximation orders since it is based on the consistent covariance intersection method. This experiment is available in a notebook 9 on github.

A.7.2 Real World Data
In this section, we provide more details about the experiments with real world data as summarized in Section 4. The used datasets in this paper, as summarized in Table 3a, are from public data repositories and can be downloaded as shown in a jupyter notebook 10 on github. In particular, it shows where and how to download the raw datasets, which variables are used as input and the target variables, where we followed the remarks on the repositories or other authors previously worked with these datasets. Finally, the preprocessing of the data is shown, where we standardized all variables to mean zero and standard deviation of one (for elecdemand see details below). We use N = min(0.9N tot , 1000) data sample for training, the rest for testing; except for kin and elecdemand we run experiments with N test = 3000 and N test = 15288 such that we could also run full GP a standard Laptop. For each dataset we fixed the number J of experts (given in Table 3a) such that the partitions/mini-batches have a reasonable size (≈ 500).
For the deterministic SGP we used M = 100 and for the stochastic SGP M ∈ {500, 1000} inducing points (more results are provided in Appendix F).
For our method CPoE we run the algorithm for C ∈ {1, 2, 3, 4} for the small and C ∈ {1, 2, 3} for the large datasets with always γ = 1. For the stochastic versions we used learning rates δ = 0.03 for the dataset kin2 and δ = 0.01 for the remaining for all methods. The maximum number of epochs is set to 15 together with a relative stopping criteria of 1e −2 . We use a SE-kernel with a different lengthscale per dimension and initialized all hyperparameters to 1, and the global inducing point to a random subset of the data. These experiments are provided at github. 11 12

A.7.3 Application
This section contains additional details to the application described in Section 4.4, where our method is applied to the elecdemand time series [37], which contains the half-hourly measured electricity demand together with the corresponding temperature and the variable whether it is a working day for 1 year.
In particular, the preprocessed dataset contains the standardized electricity demand (mean=0, sd=1) as the response variable y, the normalized time as the first variable X 1 ∈ [0, 1], the standardized temperature and indicators as X 2 and X 3 , respectively. The data is depicted in the first two plots in Fig. 10, where we shifted the first and third variable in the second plot for the sake of clarity. We removed the last day resulting in 364 days = 52 weeks = 13 "months" consisting of 4 weeks. In each of the 13 "months", we used the first 3 weeks for training and the last week for testing the out-of-sample accuracy. In order that it is possible to run full GP as comparison, we only used every 6th sample (corresponding to a measurement every 3h) of the training weeks for the actual training and the remaining for testing the in-sample accuracy. This gives N = 2184, N IN = 10920 and N OU T = 4368 samples as depicted in the first plot in Fig. 10. With our CPoE model it is straightforward to handle time series with covariates, as opposed to other time series methods [33][34][35][36]. The kernel k θ depends on several hyperparameters θ, for which we use the parametrization in [33]. We assume a log-normal prior on θ as described in Section A.3.3 in which the corresponding means and variances are taken from Table 1 in [33]. Similarly as in the previous section, we run full GP, SGP and PoEs and CPoE and optimized the hyperparameter deterministically using the MAP as objective function taking into account the priors. For SGP we used M ∈ {100, 200} fixed inducing points, for PoEs and CPoE we used J = 13 partitions which are obtained by splitting the first variable into J blocks. For CPoE we used C ∈ {1, 2, 3, 4} and γ = 1. The results are provided in Table 3b which again shows very competitive performance also for a general kernel with priors on the hyperparameters.

B.0.3 Block Inversion
Given an invertible, symmetric block matrix the inverse can be computed as

B.0.4 Block Determinant
Given an invertible, symmetric block matrix the determinant can be computed as B] , the conditional can be computed as follows

B.0.8 Product of Gaussians
Assume J Gaussians p J (x) = N x|µ j , Σ j and a j ∈ R. Then the product can be written as

B.0.9 Entropy of Gaussian
The Entropy H of p(x) = N (x|µ, Σ) with |x| = B is defined as where we use log as the natural logarithm and thus the entropy is measured in nats (natural units).

B.0.13 General Difference in KL
Let 1 ≤ C < J and 0 < γ ≤ 1 be fixed. For any C 2 ∈ {C, . . . , J} we define the difference in KL, denoted as D (C,C2) [x], between the true distribution of x and two different approximate distributions, i.e.
using the definition of KL (B11). Similarly, we define the the difference in KL, denotes as D (C,C2) [x|y], of a conditional distribution x|y to be which follows from the the definition of KL (B11).

C.1 Additional Results
Proposition 9 (Marginal Likelihood; Proof 13) The marginal likelihood is q C,γ (y|θ) = q C,γ (y) = q C,γ (y, a) da = N (0, P ) with P = HS −1 H T + V ∈ R N ×N where all dependencies on θ of the matrices are omitted.  Fig. C8: Covariance W and precision Z = W −1 of joint prior of different GP models. Compare Figure C7 for the corresponding covariance and precision matrices for CPoE model. Note that we used H = K XA K −1 AA andV the same as in the local CPoE. Proposition 10 (Prior Approximation II; Proof 15) Alternatively to Proposition 2, the prior approximation q (a) = N a|0, S −1 can be equivalently written as jF j ,F j = −F j I and π + (j) = π(j) ∪ j. Further, the prior precision matrix can also be written as where S (j) ∈ R M ×M is the augmented matrix consisting of S (j) ∈ R LC×LC at the entries [π + (j), π + (j)] and 0 otherwise.
Proposition 11 (Prior Approximation III; Proof 16) Alternatively to Prop. 2 and Prop. 10 the prior approximation q (a) can be equivalently written as at the entries [ϕ, ϕ] and 0 otherwise with T = |ϕ|.
Proposition 12 (Exact Diagonal of Prior; Proof 14) The precision matrix S C of the prior approximation q C (a) is exact on the diagonal, that is, where JL is the dimension of the matrices.
Proposition 14 (Decreasing Prior Entropy; Proof 10) For any predecessor structure π C as in Def. 2, the entropy H of the approximate prior q C (a) is decreasing for Similar results can be obtained for the joint prior q C (a, f , y).
From the last proposition we know that increasing the degree of correlation C add always more information to the prior. In particular, the prior of complete independent PoEs (i.e. C = 1) encodes the least of information since all correlations between the experts are missing, whereas the prior of full GP incorporates the most information since all correlations are modeled.
Proposition 15 (Prior Quality II) The prior approximation quality improvement D (C,T ) [a] in Prop. 6 can be equivalently written as

C.2 Proofs
Proof 1 (Proof of Prop. 1; Joint Distribution) . The matrices in the conditional distributions (6) and (7) in Prop. 1 can be obtained via Gaussian conditioning (B6) from the assumed joint densities resulting in Proof 2 (Proof used in Def. 4; Joint Distribution II) In the case γ = 1, thus a j = f j and a = f , the joint distribution can be written as q (f , a, y) Proof 3 (Proof of Prop. 5; Equality to Full GP ) Full GP: For γ = 1, the joint distribution of our model is formulated in Def. 4 and Proof 2. For C = J, we have where the predecessor set π J (j) correspond to {1, . . . , j − 1} and thus the conditional variables f π(j) = f 1:j−1 . The posterior q J (f |y) is proportional to the joint distribution q J (f , y) (see Proof 12), thus we have which is equal to the posterior distribution of full GP (1). Also the hyperparameter optimization is the same since the marginal likelihood q J (y) can be derived from the joint q C (f , y) (see Proof 13). Further, in the prediction step, for C = J we have J 2 = C − J + 1 = 1 predictive expert which is based on the full region ψ(J) = {1, . . . , J}. Therefore we conclude that the two models in considerations are the same.
Sparse global GP: Similarly, for C = J but γ < 1, we have so that the posterior correspond to that of sparse GP in (2.1). The prediction simplifies also to 1 predictive expert based on the full region. Also the marginal likelihood is the same for C = J and could be adapted as illustrated in Section A.1.

Independent local GP: For
p y j |f j p f j which is equal to (4). Prediction and hyperparameters similar as above.
Proof 4 (Proof of Prop. 2; Prior Approximation) Here we prove the first part for the prior over a, the second part is proved in Proof 5. Using Prop. 10 (with Proof 15), the prior q (a) can be equivalently written as This LCdimensional Gaussian for a π + (j) can be augmented to a M -dimensional Gaussian for a proportional to ∈ R L×L at the entries [π + (j), π + (j)]. Further, the matrixF j ∈ R M ×M has one sparse row at j, that is, where F i j ∈ R L×L is the ith part of F j ∈ R L×L(C−1) which correspond to the contribution of the ith predecessor π i (j). By using the property in (B9), the original product q(a) is then and F correspond then to the matrix depicted in Fig. 5. Note that S is positive definite since Q −1 positive definite because each Q −1 j is positive definite which concludes the proof.
Proof 5 ((Sub)proof of Prop. 2 (Projection Approximation) ) The projection which can be equivalently written as where jth row not empty with H i j ∈ R B×L the ith entry in H j which correspond to to ψ i (j). Further, I j ∈ R BJ×BJ a zero matrix with I ∈ R B×B at [j, j]. For the original product of the projections using the product rule of Gaussians in (B9), we obtain Since V positive definite this concludes the statement.
Proof 6 (Proof of Prop. 6; Decreasing Prior KL) We first show the decomposition . Starting with the definition in Def. B14 we get where we used the definitions in Def. B14. We also immediately see that Proof 7 (Proof of Subproof I of Proof 6) We prove We abbreviate q 1 (a) = q C (a) and q 2 (a) = q C+T (a). The difference D (C,T ) [a] is We recall property (iii) in Def. 2, thus we have π 2 (j) = π 1 (j) ∪ ϕ(j) where ϕ(j) is the additional predecessor of expert j in the model 2 compared to model 1. In the following, we abbreviate π 1 (j) = π(j) yielding I a j , a ϕ(j) |a π(j) ≥ 0 whereã j = a j ∪ a ϕ(j) ∪ a π(j) and I a j , a ϕ(j) |a π(j) the conditional mutual information which is always positive [42, p. 30] and therefore concludes the first part of the proof.
Proof 12 (Proof of Prop.3; Posterior Approximation) The posterior approximation is where the first equality comes from the definition of conditional probabilities, the proportionality because the marginal likelihood q(y) is independent of a and the last equality exploits Proof 11. Since Proof 14 (Proof of Prop. 12; Exact Diagonal of Prior ) Using Prop. 11, the trace can be written as By construction of the matrices K −1 A ϕ ,A ϕ , they contain the matrix K −1 A ϕ ,A ϕ at the entries [ϕ, ϕ]. Therefore, the resulting product when multiplying with K AA is a matrix with identity I T at the position [ϕ, ϕ] with T = |ϕ| and 0 at the diagonal where not ϕ. The quantity above is then Proof 15 (Proof of Prop. 10; Prior Approximation II) The prior approximation is for which the quadratic term inside the exponential of the individual Gaussian can be written as with S (j) =F T j Q −1 jF j ∈ R LC×LC andF j = −F j I ∈ R L×LC which proves the first part. We can augment this Gaussian for a π + (j) ∈ R LC to over a ∈ R M whereS (j) ∈ R M ×M is the augmented matrix consisting of S (j) at the entries [π + (j), π + (j)] and 0 otherwise. Using (B9), the original product q(a) is then and thus S = J j=1S(j) positive definite which concludes the proof.
Proof 16 (Proof of Prop. 11; Prior Approximation III) The prior q(a) can be written as Similarly to the Proof 15, we can augment the CL-dimensional and the (C − 1)L-dimensional Gaussian in the nominator and denominator, respectively, to Mdimensional Gaussians with covariance K −1 A ϕ A ϕ consisting of K −1 A ϕ A ϕ at the entries [ϕ, ϕ] and 0 otherwise. This gives with (B9) Proof 17 (Proof of Prop. 4; Prediction Aggregation) The predictive posterior distribution is defined as Since the local predictions p f * j |y = N m * j , v * j are all univariate Gaussians, we obtain via the product rule of Gaussians in (B9) directly Using the usual likelihood p (y * |f * ) = N f * , σ 2 n yields with (B7) the final noisy prediction p (y * |y) = p (y * |f * ) p (f * |y) df * = N m * , v * + σ 2 n . Note that, in Definition 5 of the weights, we introduced a scaling factor Z. This Z in the exponent of the normalization of the weights has a sharpening effect, so that the informative experts have even more weight compared to the non-informative experts for more data N and more correlations C. This is a heuristic but showed quite robust performance in experiments. Moreover, the consistency properties are more relevant than the particular weights.
Proof 18 (Proof of Prop. 8; Local Predictions) The predictive conditional p f * j |a ψ(j) can be again derived via (B6) from the assumed joint Moreover, the local posteriors q a ψ(j) |y = N µ ψ(j) , Σ ψ(j) are obtained from the corresponding entries ψ(j) of the mean µ and covariance Σ (via partial inversion A.2) in Prop, 3. Finally, the local predictions p f * j |y in Prop. (8) can then be computed with Gaussian integration (B7) yielding q f * j |y = p f * j |a ψ(j) p a ψ(j) |y da ψ(j) which correspond to the desired quantities Proof 19 (Proof for Figure C7; Joint Prior Covariance) For the joint prior Fig. C7, we show that we recover the marginal and conditional distributions q C (a), q C (f |a) and p(y|f ). For q C (a), the marginalization correspond to selecting the corresponding mean and covariance, i.e. N (a|0, Σaa) = N a|0, S −1 .
For q C (f |a), we use Eq. (B6) yielding

C.3 Derivative of LML
The log marginal likelihood in Section 3.6.2 in Eq. (A1) is proportional to In the following, we provide the partial derivative with respect to θ for each additive term.
In the last expression the whole posterior covariance is needed, however, it turns out that only the entries which are non-zero in the precision are needed. The right term in the last expression equals sum Σ ⊙ ∂Σ −1 ∂θ , where ⊙ denotes the pointwise multiplication. Therefore it is enough to only compute sum Σ ⊙ ∂Σ −1 ∂θ , where Σ is the partial inversion (for more derails A.2) which is sparse as well and already computed for the local predictions in Prop. 8.

Appendix D Sequential Algorithm
The probabilistic equations in Section 3 can be equivalently formulated as a j = F j a π(j) + γ j ; with γ j ∼ N 0, Q j , ν j ∼ N 0, V j and ε j ∼ N 0, σ 2 n I . Instead to the inference procedure described in Prop. 3, the posterior could be alternatively computed with sequential algorithms. Assuming C = 2 and π (j) = {j − 1}, the Kalman Filter and Smoother (e.g. [43]) provide an equivalent solution to the posterior distribution in Prop. 3. For C > 2 and general neighbourhood set, the Gaussian (loopy) belief propagation algorithm or Gaussian expectation propagation (e.g. [43]) might constitute an interesting approach for sequential/online and distributed learning procedures. Together with the competitive results in the application to time series with covariates makes this idea very promising for future work.

Appendix E More Details about GPR
In this section we provide more details for Section 2.
Suppose we are given a training set of N pairs of inputs x i ∈ R D and noisy scalar outputs y i generated by adding independent Gaussian noise to a latent function f (x), that is n . We denote y = [y 1 , . . . , y N ] T the vector of observations and with X = [x T 1 , . . . , x T N ] T ∈ R N ×D . We can model f with a Gaussian Process (GP), which defines a prior over functions and can be converted into a posterior over functions once we have observed some data (consider e.g. [1]). To describe a GP, we only need to specify a mean m(x) and a covariance function k θ (x, x ′ ) where θ is a set of a few hyperparemeters. Thereby, k θ is a positive definite kernel function (see [1]), for instance the squared exponential (SE) kernel with individual lengthscales for . . , l 2 D and {σ 0 , l 1 , . . . , l D } ∈ θ. For the sake of simplicity, we assume m(x) ≡ 0, however it could be any function. Given the training values T and a test latent function value f * = f (x * ) at a test point x * ∈ R D , then the joint distribution p(f , f * ) is Gaussian N 0, K [X; x * ][X; x * ] . Thereby, we use the notation [A 1 ; A 2 ] for the resulting matrix after stacking A 1 ∈ R N1×D and A 2 ∈ R N1×D above each other and K ∈ R M1×M2 denotes the kernel covariance matrix with entries [K AB ] ij corresponding to the kernel evaluation k θ (a i , b j ) with the corresponding rows a i , b j for any A ∈ R M1×D and B ∈ R M2×D . Typically, in GP regression, the likelihood is Gaussian, that is, p (y|f ) = N y|f , σ 2 n I , and with Bayes theorem (B8) we obtain analytically the predictive posterior distribution p (f * |y) = N (f * |µ * , Σ * ) with µ * = K x * X K XX + σ 2 n I −1 y and Alternatively to the standard derivation shown above, the posterior distribution over the latent variables f given the data y can be explicitly formulated as where the data is split into J mini-batches of size B, i.e. D = y j , X j J j=1 with inputs X j ∈ R B×D , outputs y j ∈ R B and the corresponding latent function values f j = f (X j ) ∈ R B . In (1) we used the notation f k:j indicating [f k , . . . , f j ] and the conditionals p f j |f 1:j−1 can be derived from the joint Gaussian p f j , f 1:j−1 = N 0, K [Xj ; X1:j−1][Xj ; X1:j−1] via Gaussian conditioning (B6). The corresponding graphical model of (1) is depicted in Figure  1(a)i). Given the posterior over f |y, the predictive posterior distribution from above is equivalently obtained as p (f * |y) = p (f * |f ) p (f |y) df via Gaussian integration (B7) where p (f * |f ) is derivable from the joint via (B6). The graphical model of the prediction procedure is depicted in Figure 1(b)i). We present this alternative two stage procedure to highlight later connections to our model with full GP.

E.1 Global Sparse GPs
Sparse GP regression approximations based on global inducing points reduce the computational complexity by introducing M ≪ N inducing points a ∈ R M that optimally summarize the dependency of the whole training data globally, compare the graphical model in Figure 1b). In order to find optimal inducing inputs A and hyperparameters θ, a sparse variation of the log marginal likelihood similar can be used e.g. [5][6][7]. In particular, the authors in [5] proposed to maximize a variational lower bound to the true GP marginal likelihood which has the effect that the sparse GP predictive distribution converges to the full GP predictive distribution as the number of inducing points increases. For larger datasets, stochastic optimization has been applied e.g. [8][9][10][11] to obtain faster and more data efficient optimization procedures. For recent reviews on the subject consider e.g. [1,3,19].  [15,27]. We refer to [19] for a recent overview.
Simple baseline methods are the minimal variance (minVar) and the nearest expert (NE) aggregation, where only the prediction from the expert with minimal variance or nearest expert is used, respectively. Although both these method show often surprisingly good performance, they suffer from an huge disadvantage, namely that there are serious discontinuities at the boundaries between the experts (see for instance Fig. 2) and thus often not useful in practice. This is also the main limitation of all local methods based only on the prediction of one expert (e.g. [23][24][25]44]) and it was one of the reason for introducing smooth PoEs with combined experts. Since in basically all cases minVar is better than NE (which is also consistent with the findings in [15]), we only compare our method to minVar and not NE for the sake of simplicity.