Abstract
Gaussian process regression is a flexible regression scheme but suffers from its high computational complexity regarding the inversion of a matrix with the same size as the training dataset. Aggregation method is one of the approximation techniques for reducing the complexity. In this paper, we propose a novel aggregation method, Nested Aggregation of Experts using Inducing Points (NAEIP), which is an extension of a conventional method and enables dimensionality reduction by making use of the idea of linear sketching. There are some options for selecting inducing points in the proposed method. The options can introduce test points of interest as inducing points, albeit at the cost of slightly higher computational complexity. The other options exploiting less informative inducing points can yield a significant reduction of the computational complexity. The proposed NAEIP is theoretically guaranteed to have consistency under certain conditions. Results of our computational experiments using synthetic and real data show that the proposed method achieves lower prediction error and even lower computing time than conventional methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Gaussian process regression (GPR or full GPR) (Rasmussen and Williams 2006) is a nonparametric regression model that assumes a Gaussian process prior on regression functions. Its application includes geostatistics (Cressie 1993; Stein 1999), data visualization (Lawrence 2005), reinforcement learning (Deisenroth et al. 2015), multitask learning (Ashton and Sollich 2012), distributed learning (Tavassolipour et al. 2020), to mention a few. Despite its advantage of allowing nonlinear regression, its computational complexity and required memory can be a serious problem in the cases where the number N of training data is large. Full GPR (which we mean the GPR with no approximation) includes inversion of an \(N\times N\) matrix, so that it takes \({\mathcal {O}}(N^3)\) time complexity^{Footnote 1} (via conventional methods like GaussJordan elimination and LU decomposition) for training. This would restrict the applicability of full GPR to problems with \(N\lesssim 10^4\).
In order to circumvent the limitation, various approximation methods have been proposed. These can be divided in two main categories, global and local approximations (Liu et al. 2020). The global approximations replace the global representation of the \(N\times N\) matrix with smallsized matrices, typically by using some training points or virtual points, called inducing points or pseudo datapoints (Snelson and Ghahramani 2005; QuiñoneroCandela and Rasmussen 2005; Wilson and Nickisch 2015; Bauer et al. 2016). The idea is categorically termed sparse GP approximation. Sparse GP using m inducing points can reduce time complexity to \({\mathcal {O}}(Nm^2)\). Locations of the inducing points can furthermore be optimized via stochastic variational inference (Hensman et al. 2013). However, the sparse GP methods are not suitable when the underlying function has quickvarying features because in such cases they require a large number of inducing points to achieve good performance, yielding high complexity (Bui and Turner 2014). On the other hand, the local approximations split training data into a number of subdatasets, assign an “expert” to each of them, and summarize local predictions made by these experts to arrive at the final prediction. The procedure enables us to capture such quickvarying features. One of the stateoftheart local approximations is the aggregation method, which includes productofexperts (PoE) (Hinton 2002), generalized PoE (GPoE) (Cao and Fleet 2014), Bayesian committee machine (BCM) (Tresp 2000), robust BCM (RBCM) (Deisenroth and Ng 2015), generalized RBCM (GRBCM) (Liu et al. 2018), queryaware BCM (QBCM) (He et al. 2019), and nested pointwise aggregation of experts (NPAE) (Rullière et al. 2018). Different aggregation methods summarize the local predictions of the experts by using different schemes. The time complexity of the aggregation methods except for NPAE with subdataset size \(n_0\) is reduced to \({\mathcal {O}}(Nn_0^2) + {\mathcal {O}}(C N n_0)\), where C is independent of N and varies depending on the method.
An important theoretical property for the aggregation methods is consistency, which means that the aggregated prediction converges to the value of the true underlying function when N approaches infinity. The aggregation methods without consistency do not necessarily yield good predictions even in largesample situations. NPAE and GRBCM are proven to have consistency under appropriate conditions (Bachoc et al. 2017, 2021; Liu et al. 2018). Furthermore, NPAE usually achieves better predictive performance than other methods by using richer information but at the same time requires higher computational complexity.
In this paper, we propose a novel aggregation method inspired by NPAE. We first generalize the prediction of NPAE by using the idea of lowdimensional projection known as sketching (Liberty 2013; Woodruff 2014) of the training samples, and then extend it to more informative versions by introducing inducing points. With its higher flexibility this method is expected to achieve a better tradeoff between predictive performance and computational complexity. We name the proposed method Nested Aggregation of Experts using Inducing Points (NAEIP). The Gaussian process approximation via sketching has also been considered by Calandriello et al. (2019) but their construction is based on matrix sketching and falls within the category of sparse GP methods. On the other hand, the dimensionality reduction in NAEIP is different from that of sparse GP in that NAEIP exploits linear sketching of signals, extending NPAE. Furthermore, NAEIP, similarly to NPAE, allows parallelization of some part of processing by employing a blockdiagonal sketching matrix. NAEIP is expected to be advantageous in two alternative fronts: one is that it prioritizes predictive performance at the cost of an increase of computational complexity, and another is that it uses a less informative set of inducing points while allowing reduction of the computational complexity. Furthermore, we prove that NAEIP has consistency under certain conditions. Simulation results show that the proposed method achieves lower prediction errors than conventional aggregation methods, while keeping less computing time than the original NPAE.
In the rest of the paper, we use the following notations. Boldface indicates vector or matrix. Superscripts \((\cdot )^{\mathrm {T}}\) and \((\cdot )^{\mathrm {T}}\) denote the transpose and the inverse of the transpose, respectively. \(\varvec{0}\), \(\varvec{O}\), and \(\varvec{I}\) stand for the zero vector, zero matrix, and identity matrix, respectively. \(\Vert \cdot \Vert\) represents \(\ell _2\)norm. \({\mathcal {N}}(\varvec{m},\varvec{\Sigma })\) is Gaussian distribution with mean \(\varvec{m}\) and covariance \(\varvec{\Sigma }\). \(\mathrm {Cov}[\varvec{x},\varvec{y}]\) means the covariance matrix of random vectors \(\varvec{x}\) and \(\varvec{y}\). \(\ker \varvec{A}:=\{\varvec{x}:\varvec{Ax}={\mathbf {0}}\}\) is the kernel (null space) of the matrix \(\varvec{A}\). \(\det {[\cdot ]}\) and \(\mathrm {Tr}[\cdot ]\) stand for the determinant and trace operator, respectively. \([\cdot ]_{ab}\), \([\cdot ]_a\), and \([\cdot ]_{[a][b]}\) denote the (a, b)th element of the matrix, the ath row of the matrix or ath element of the vector, and the (a, b)th block of the block matrix, respectively. \(\mathrm {diag}[\cdot ]\) is the diagonal matrix composed of the elements in the square brackets.
2 Gaussian process regression and aggregation methods
2.1 Full GPR
Consider the full GPR on a region \({\mathcal {Q}}\subset {\mathbb {R}}^D\). Given a training dataset with N samples, \({\mathcal {D}}=\{(\varvec{x}_n,z_n)\in {\mathcal {Q}}\times {\mathbb {R}}\}_{n=1,\ldots ,N}\), the regression model is
where the regression function f is assumed to follow a Gaussian process (GP), and where the residual error \(\epsilon _n\) is assumed to be a white Gaussian noise with mean 0 and variance \(\sigma ^2\), that is, one has \([\epsilon _1,\ldots ,\epsilon _N]^{\mathrm {T}}\sim {\mathcal {N}}({\mathbf {0}},\sigma ^2\varvec{I})\). The mean function of the GP can be assumed to be 0 without loss of generality. The covariance function \(k(\cdot ,\cdot )\) of the GP represents properties of the regression function. Commonly used covariance functions are the squared exponential (SE) function:
and the Matérn(\(\nu +1/2\)) function:
where \(r=\sqrt{(\varvec{x}\varvec{x}')^\mathrm {T}\varvec{L}^{1}(\varvec{x}\varvec{x}')}\) is the Mahalanobis distance between \(\varvec{x}\) and \(\varvec{x}'\) and covariance matrix \(\varvec{L}=\mathrm {diag}[\ell _1,\ldots ,\ell _D]\), where \(\nu \in {\mathbb {N}}^+\) is a model parameter for the Matérn function, and where \(\sigma _f^2>0\) and \(\ell _d>0~(d=1,\ldots ,D)\) are hyperparameters. The hyperparameters of these models are thus \(\varvec{\varTheta } = \{\sigma _f^2, \{\ell _d\}_{d=1,\ldots ,D}, \sigma ^2\}\), the values of which may be determined via maximizing the logmarginal likelihood
where \(\varvec{X} = [\varvec{x}_{1},\cdots ,\varvec{x}_{N}]^{\mathrm {T}} \in {\mathbb {R}}^{N \times D}\), \(\varvec{z} = [z_{1},\cdots ,z_{N}]^{\mathrm {T}} \in {\mathbb {R}}^N\), and where the covariance matrix \(\varvec{K}(\varvec{X},\varvec{X}')\) is such that \([\varvec{K}(\varvec{X},\varvec{X}')]_{nn'}=k(([\varvec{X}]_n)^{\mathrm {T}},([\varvec{X}']_{n'})^{\mathrm {T}})\).
Assume that we wish to estimate the values of f at \(N_T\) test points \(\{\varvec{x}_{t}^*\}_{t=1,\ldots ,N_T}\). All the test points and the corresponding outputs are summarized as \(\varvec{X}^*= [\varvec{x}_{1}^*,\cdots ,\varvec{x}_{N_T}^*]^{\mathrm {T}} \in {\mathbb {R}}^{N_T \times D}\) and \(\varvec{z}^*= [z_1^*, \ldots , z_{N_T}^*]^{\mathrm {T}} \in {\mathbb {R}}^{N_T}\), respectively. The values of the regression function corresponding to \(\varvec{X}\) and \(\varvec{X}^*\) are summarized as \(\varvec{f} = [f(\varvec{x}_{1}),\cdots ,f(\varvec{x}_{N})]^{\mathrm {T}} \in {\mathbb {R}}^N\) and \(\varvec{f}^*= [f(\varvec{x}_{1}^*),\cdots ,f(\varvec{x}_{N_T}^*)]^{\mathrm {T}} \in {\mathbb {R}}^{N_T}\), respectively. On the assumption that the prior of f is GP, the joint distribution of \(\varvec{z}\) and \(\varvec{f}^*\) is given by
The predictive distribution of \(\varvec{f}^*\) given \({\mathcal {D}}\) is obtained as \(p(\varvec{f}^{*}  \varvec{X}^{*}, {\mathcal {D}}) = {\mathcal {N}}\left( \varvec{\mu }_{\mathrm {full}}(\varvec{X}^*), \varvec{\Sigma }_{\mathrm {full}}(\varvec{X}^{*})\right)\), where
The prediction of \(\varvec{z}^*\) is similarly obtained as \(p(\varvec{z}^{*}  \varvec{X}^{*}, {\mathcal {D}}) = {\mathcal {N}}\left( \varvec{\mu }_{\mathrm {full}}(\varvec{X}^*), \varvec{\Sigma }_{\mathrm {full}}(\varvec{X}^{*}) + \sigma ^2\varvec{I}\right)\). The matrix inversion in Eqs. (6) and (7) has \({\mathcal {O}}(N^3)\) time complexity and \({\mathcal {O}}(N^2)\) memory consumption, so that various approximations have been proposed to circumvent the complexity.
2.2 Aggregation methods
2.2.1 Problem settings and training
In this subsection, we introduce the common settings among the aggregation methods. The whole training dataset is first divided into p subsets, \({\mathcal {D}}_i = (\varvec{X}_i,\varvec{z}_i) ~(i=1,\ldots ,p)\), where each subset has \(n^{(i)}\) data points, namely, \(\varvec{X}_i \in {\mathbb {R}}^{n^{(i)} \times D}\) and \(\varvec{z}_i \in {\mathbb {R}}^{n^{(i)}}\). Submodels that make predictions using the subdatasets are referred to as “experts”. Each expert \({\mathcal {M}}_i\) makes own predictions by using its own subdataset \({\mathcal {D}}_i\). The local prediction \(p_i(\varvec{z}^{*}  \varvec{x}^{*}, {\mathcal {D}}_i) = {\mathcal {N}}\left( \mu _i(\varvec{x}^*), \sigma ^2_i(\varvec{x}^*)\right)\) at a test point \(\varvec{x}^*_t :=\varvec{x}^*~(t=1,\ldots ,N_T)\) is obtained by applying full GPR to the subdataset \({\mathcal {D}}_i\), as
respectively, where \(\varvec{K}_{*i} = \varvec{K}(\varvec{x}^*,\varvec{X}_i)\), \(\varvec{K}_{i *} = \varvec{K}_{*i}^{\mathrm {T}}\), and \(\varvec{K}_{ij} = \varvec{K}(\varvec{X}_i,\varvec{X}_j)\) for \(i,j = 1,\ldots ,p\). Aggregation methods described in the subsequent sections integrate the experts’ predictions and yield the final prediction in different manners.
For learning hyperparameters \(\varvec{\varTheta }\), it is reasonable under these settings to introduce a factorized training process (Deisenroth and Ng 2015). In the process, the exact marginal likelihood (Eq. (4)) is approximated by assuming independence of the marginal likelihoods of the experts, i.e.,
where the experts share the same hyperparameters \(\varvec{\varTheta }\). The computational complexity for the training process can be reduced compared with full GPR thanks to the independence assumption.
2.2.2 Predictions that ignore some covariance of experts
PoE (Hinton 2002), GPoE (Cao and Fleet 2014), BCM (Tresp 2000), and RBCM (Deisenroth and Ng 2015) are aggregation methods that ignore covariance between experts. The original PoE and GPoE assume independence of experts \(\{{\mathcal {M}}_i\}_{i=1,\ldots ,p}\). BCM and RBCM assume conditional independence of the experts given the value \(f(\varvec{x}^*)\) of the regression function at a test point \(\varvec{x}^*\). The aggregated prediction \(p\left( \varvec{z}^{*}  \varvec{x}^{*}, \{\mu _i(\varvec{x}^*),\sigma _i^2(\varvec{x}^*)\}_{i=1,\ldots ,p}\right)\) with mean \(\mu _{\mathrm {poe/bcm}}(\varvec{x}^*)\) and variance \(\sigma ^2_{\mathrm {poe/bcm}}(\varvec{x}^*)\) can be collectively formulated as
where \(\sigma ^2_{**} = k(\varvec{x}^*,\varvec{x}^*) + \sigma ^2\), and where \(\beta _{i1}\) and \(\beta _{i2}\) are the weights assigned to expert \({\mathcal {M}}_i\). The choices of the weights recommended in the respective papers, as well as constraints, of those aggregation methods are summarized in Table 1.
Two extensions of RBCM, called GRBCM (Liu et al. 2018) and QBCM (He et al. 2019), are recently proposed. These methods assume existence of an informative “global expert” \({\mathcal {M}}_g :={\mathcal {M}}_1\), and that every expert can access, in addition to the subdataset assigned to it, the subdataset \({\mathcal {D}}_g = {\mathcal {D}}_1\) assigned to the global expert. Therefore, these methods take account of covariances between the global expert and other experts, but ignore covariance between nonglobal experts, and assume conditional independence \({\mathcal {D}}_i \perp {\mathcal {D}}_j ~~ z^*,{\mathcal {D}}_g\) for \(i,j=2,\ldots ,p\) and \(i\ne j\). Each expert \({\mathcal {M}}_i~(i=2,\ldots ,p)\) possesses subdataset \({\mathcal {D}}_{+i}={\mathcal {D}}_g\cup {\mathcal {D}}_i\) and makes own predictions with mean \(\mu _{+i}(\varvec{x}^*)\) and variance \(\sigma ^2_{+i}(\varvec{x}^*)\). The global expert also makes prediction with mean \(\mu _g(\varvec{x}^*)\) and variance \(\sigma _g^2(\varvec{x}^*)\) by using only the global subdataset \({\mathcal {D}}_g\). The aggregated predictions are given by the following mean \(\mu _{\mathrm {grbcm/qbcm}}(\varvec{x}^*)\) and variance \(\sigma ^2_{\mathrm {grbcm/qbcm}}(\varvec{x}^*)\),
where the experts’ weights \(\{\beta _i\}_{i=2,\ldots ,p}\) are chosen in the same manner as RBCM. For GRBCM, the global subdataset \({\mathcal {D}}_g\) is randomly selected from the entire training samples, and for QBCM, \({\mathcal {D}}_g\) is selected as the subdataset with its centroid closest to the test point.
2.2.3 NPAE: prediction that uses covariance between all experts
NPAE (Rullière et al. 2018) for GPR is also one of the aggregation methods but it yields “consistent” prediction by taking account of covariance between experts, at the cost of computational complexity. The consistency is discussed in the next subsection. In NPAE, the aggregated prediction is obtained as follows:
where \(\varvec{\mu }_*= [\mu _1(\varvec{x}^*), \ldots , \mu _p(\varvec{x}^*)]^{\mathrm {T}} \in {\mathbb {R}}^p\), \(\varvec{k}_{{\mathcal {A}}*} = \mathrm {Cov}[\varvec{\mu }_*,\varvec{z}^*] \in {\mathbb {R}}^p\), and \(\varvec{K}_{{\mathcal {A}}*} = \mathrm {Cov}[\varvec{\mu }_*,\varvec{\mu }_*] \in {\mathbb {R}}^{p \times p}\). This formulation means that NPAE uses the covariance between all experts, that is, uses richer information than those aggregation methods described in Sect. 2.2.2.
Note that the original NPAE is restricted to testpointwise processing and requires \(p \times p\) matrix construction and its inversion \(\varvec{K}_{{\mathcal {A}}*}^{1}\) for each test point, so that its computational complexity is higher than other aggregation methods. Rullière et al. (2018) have also proposed additional complexity reduction of NPAE by considering hierarchical organization of the experts, in which case the subsequent prediction becomes different from Eqs. (15) and (16).
2.2.4 Consistency
Consistency is one of the important properties for the aggregation methods, which means that the aggregated prediction converges to the value of the true underlying function when the number N of training points approaches infinity. It should be noted that the definition of consistency in this paper is such that an aggregation method for a finite number of test points is said to be consistent if the aggregated predictions provided by the method converges to the values of the true underlying function at those test points in probability, as \(N\rightarrow \infty\). In particular, the definition is different from, and much weaker than, the consistency in functional spaces (van der Vaart and van Zanten 2011): Consistency of a method in the above definition does not necessarily imply that the posterior on the functional space provided by the method converges to the Dirac measure at the true underlying function in the limit \(N\rightarrow \infty\).
NPAE in Sect. 2.2.3 is proven to be consistent in the noiseless case (\(\sigma ^2=0\)) (Bachoc et al. 2017) and the noisy case (\(\sigma ^2\ne 0\)) (Bachoc et al. 2021). For the latter case, NPAE is consistent when the placement of all input points is not too irregular on \({\mathcal {Q}}\) or when the training data is divided by typical clustering algorithms, e.g., kmeans. Consistency including noisy observations is also discussed in Liu et al. (2018), where they have concluded that GRBCM is consistent as long as the input points in the global subdataset are randomly selected on \({\mathcal {Q}}\). Bachoc et al. (2017, 2021) have also proven that, under some assumptions on the kernel^{Footnote 2}, there are cases where consistency of PoE, GPoE, BCM, and RBCM does not hold depending on the distribution of the input points.
3 NAE using inducing points
3.1 Reformulation of NPAE via sketching
In this subsection, we represent the predictions by NPAE (Eqs. (15) and (16)) in an alternative formulation, with the aim of extending it to a generalized method. As mentioned in Sect. 1, the high computational complexity of full GPR arises primarily from the necessity of inverting the Gram matrix \((\varvec{K}(\varvec{X},\varvec{X})+\sigma ^2\varvec{I})\) with size equal to the number N of training samples. Consequently, all the existing approximation schemes include some ideas of reducing the size of the matrix to be inverted, and accordingly, when evaluating the conditional mean in these schemes one projects \(\varvec{z}\) to a lowdimensional subspace determined by the matrix of reduced size. In this paper, rather than considering a reducedsize matrix to be inverted, we focus on the latter projection procedure. More specifically, we consider a linear sketch \(\varvec{u}=\varvec{A}\varvec{z}\in {\mathbb {R}}^{N_u}\) of \(\varvec{z}\), where \(N_u\) is the dimension of the linear sketch \(\varvec{u}\), and where \(\varvec{A}\in {\mathbb {R}}^{N_u\times N}\) is a sketching matrix, and study the problem of estimating the function values at test points not on the basis of \(\varvec{z}\) but on the basis of its sketch \(\varvec{u}\). As detailed in the following, this approach has advantages in that it provides a novel interpretation of NPAE as well as its extensions, and that it allows us to provide a full characterization of the optimal sketching matrix.
In what follows we assume, without loss of generality, that the rows of the sketching matrix \(\varvec{A}\) are linearly independent, as adding linearly dependent rows does not add any useful information of \(\varvec{z}\) to its linear sketch \(\varvec{u}\). The joint probability of \(\{\varvec{u},\varvec{z}^*\}\) is
The conditional distribution of \(\varvec{z}^*\) given \(\varvec{u}\) is calculated as
where
The matrix to be inverted in the above formulae is of size \(N_u\times N_u\), implying that the time complexity can be significantly reduced by taking \(N_u\ll N\). It should be noted that this reduction is different from that of sparse GP methods and that in Calandriello et al. (2019), where the matrix to be inverted is \(\varvec{K}(\varvec{X},\varvec{X}_u)\varvec{K}(\varvec{X}_u,\varvec{X}_u)^{1}\varvec{K}(\varvec{X}_u,\varvec{X})\) (\(\varvec{X}_u\in {\mathbb {R}}^{N_u\times D}\) is the set of inducing points) with size \(N\times N\) and the reduction is granted via Woodbury matrix identity.
For sketching matrix \(\varvec{A}\) with a general structure, the following proposition holds.
Proposition 1
Assume row independence of the sketching matrix \(\varvec{A}\). The conditional distribution of \(\varvec{z}^*\) given the linear sketch \(\varvec{u}=\varvec{A}\varvec{z}\) depends on \(\varvec{A}\) only through its kernel \(\ker \varvec{A}\).
Proof
Under the row independence, the size of the sketching matrix \(\varvec{A}\) is \(N_u\times N\) with \(N_u=N\mathop {\mathrm {dim}}\ker \varvec{A}\). For two matrices \(\varvec{A},\varvec{B}\in {\mathbb {R}}^{N_u\times N}\), \(\ker \varvec{A}=\ker \varvec{B}\) holds if and only if \(\varvec{A}\) and \(\varvec{B}\) are row equivalent, that is, there exists an invertible matrix \(\varvec{T}\in {\mathbb {R}}^{N_u\times N_u}\) satisfying \(\varvec{B}=\varvec{TA}\). The conditional distribution of \(\varvec{z}^*\) given a linear sketch \(\varvec{u}'=\varvec{B}\varvec{z}\) with \(\varvec{B}=\varvec{TA}\) is the same as that given the linear sketch \(\varvec{u}=\varvec{Az}\), as can be confirmed by the fact that replacing \(\varvec{A}\) with \(\varvec{B}=\varvec{TA}\) in Eqs. (19) and (20) with \(\varvec{z}\) fixed keeps \(\varvec{\mu }_{{\mathcal {A}}}(\varvec{X}^*)\) and \(\varvec{\Sigma }_{\mathcal {A}}(\varvec{X}^*)\) invariant. \(\square\)
One expects that the conditional mean with sketching given in Eq. (19) would give a good approximation of the conditional mean in the full GPR. Goodness of this approximation may be measured via the mean squared error \({\mathcal {E}}=\mathrm {E}[\Vert \varvec{\mu }_{\mathcal {A}}(\varvec{X}^*)\varvec{\mu }_{\mathrm {full}}(\varvec{X}^*)\Vert ^2)\) between the conditional means with and without sketching. It is evaluated as
The next proposition provides a full characterization of the optimal sketching matrix in the sense of minimizing \({\mathcal {E}}\).
Proposition 2
For a given dimension \(N_u\) of the linear sketching \(\varvec{u}=\varvec{Az}\in {\mathbb {R}}^{N_u}\) of \(\varvec{z}\), the optimal sketching matrix \(\varvec{A}\in {\mathbb {R}}^{N_u\times N}\) in the sense of minimizing the mean squared error \({\mathcal {E}}\) is such that the \(N_u\) row vectors of \(\varvec{A}\) span the subspace spanned by the eigenvectors of \((\varvec{K}(\varvec{X},\varvec{X})+\sigma ^2\varvec{I})^{1} \varvec{K}(\varvec{X},\varvec{X}^*)\varvec{K}(\varvec{X}^*,\varvec{X})\) corresponding to its \(N_u\) largest eigenvalues.
Proof
Since the first term on the righthand side of Eq. (21) is independent of \(\varvec{A}\), the optimal sketching matrix \(\varvec{A}\) minimizing the mean squared error \({\mathcal {E}}\) is the matrix that maximizes
The matrix \(\varvec{C} = \varvec{K}(\varvec{X},\varvec{X})+\sigma ^2\varvec{I}\) is symmetric and positive definite, so that it is diagonalized by an orthogonal matrix \(\varvec{V}\) as \(\varvec{C} = \varvec{V}^\mathrm {T}\varvec{\varLambda V}\), where \(\varvec{\varLambda }\) is a diagonal matrix with the diagonal elements consisting of the eigenvalues of \(\varvec{C}\). Letting \(\varvec{C}^{1/2}=\varvec{V}^\mathrm {T}\varvec{\varLambda }^{1/2} \varvec{V}\) and \(\varvec{A}'=\varvec{TAC}^{1/2}\), where \(\varvec{T}\) is an invertible matrix corresponding to the GramSchmidt orthogonalization applied to the row vectors of \(\varvec{AC}^{1/2}\) such that \(\varvec{A}'(\varvec{A}')^\mathrm {T}=\varvec{I}\) holds, one has
The cost function \(J(\varvec{A})\) can then be written as
Therefore, the optimal sketching matrix is such that the \(N_u\) row vectors of \(\varvec{A}'=\varvec{TAC}^{1/2}\) span the subspace spanned by the eigenvectors of \(\varvec{C}^{1/2}\varvec{K}(\varvec{X},\varvec{X}^*)\varvec{K}(\varvec{X}^*,\varvec{X}) \varvec{C}^{1/2}\) corresponding to its \(N_u\) largest eigenvalues. This coincides with the statement of the proposition. \(\square\)
Let \(\lambda _1\ge \lambda _2\ge \ldots \ge \lambda _N\ge 0\) be the eigenvalues of \((\varvec{K}(\varvec{X},\varvec{X})+\sigma ^2\varvec{I})^{1} \varvec{K}(\varvec{X},\varvec{X}^*)\varvec{K}(\varvec{X}^*,\varvec{X})\). Then, the mean squared error with the optimal sketching matrix is given by
Since \(\mathrm {rank}\varvec{K}(\varvec{X},\varvec{X}^*)=\mathrm {rank}\varvec{K}(\varvec{X}^*,\varvec{X})\le \min \{N,N_T\}\), one has \(\lambda _i=0\) for \(i>\min \{N,N_T\}\). Therefore, in order to make the mean squared error \({\mathcal {E}}\) smaller, it would make no sense to take \(N_u>N_T\) if there is no restriction in the choice of the sketching matrix \(\varvec{A}\), because \(N_u>N_T\) allows us to make \({\mathcal {E}}=0\) with the optimal choice of \(\varvec{A}\).
The approach of optimizing the sketching matrix with a general structure, however, would require inversion of \(\varvec{K}(\varvec{X},\varvec{X})+\sigma ^2\varvec{I}\) and/or solving a (generalized) eigenvalue problem with a large fullrank matrix, so that its computational complexity should be high.
We next consider blockstructured sketching, in which one assumes \(\varvec{A}\) to have the following block structure:
where \(\varvec{A}_i\in {\mathbb {R}}^{n_u^{(i)}\times n^{(i)}}\) with \(\sum _{i=1}^pn_u^{(i)}=N_u\) and \(\sum _{i=1}^pn^{(i)}=N\). This blockstructured sketching allows us to perform a certain fraction of the calculations in a distributed manner, with p computing agents (i.e., experts). The prediction in Eq. (18) exactly coincides with that of NPAE when \(n_u^{(i)}=1\) and \(\varvec{A}_i\) is chosen as
for all experts. In this case, \(\varvec{A}\varvec{K}(\varvec{X},\varvec{X}^*)\) and \(\varvec{A}\left( \varvec{K}(\varvec{X},\varvec{X})+\sigma ^2\varvec{I}\right) \varvec{A}^{\mathrm {T}}\) in Eq. (17) are replaced as \(\varvec{k}_{{\mathcal {A}}*}\) and \(\varvec{K}_{{\mathcal {A}}*}\), respectively. Thanks to this formulation, we can regard the choice of \(\varvec{A}_i\) in NPAE as a dimensionality reduction from the size \(n^{(i)}\) of the subdataset \({\mathcal {D}}_i\) to \(n_u^{(i)}=1\).
3.2 NAEIP
The choice of the matrix \(\varvec{A}_i\) in Eqs. (19) and (20) is not limited to that of NPAE (Eq. (22)). Furthermore, \(\varvec{A}_i\) does not even have to be dependent on \(\varvec{X}^*\). We then propose a novel aggregation method on the basis of Eq. (18) and name it Nested Aggregation of Experts using Inducing Points (NAEIP), which is not limited to be “pointwise,” that is, it allows simultaneous prediction on multiple test points. In the proposed method, we select the following choice for \(\varvec{A}_i\):
where \(\varvec{K}_{\eta _i i} = \varvec{K}(\bar{\varvec{X}}_i,\varvec{X}_i) \in {\mathbb {R}}^{n_u^{(i)} \times n^{(i)}}\) and \(\varvec{K}_{i \eta _i} = \varvec{K}_{\eta _i i}^\mathrm {T}\) for a collection \(\bar{\varvec{X}}_i \in {\mathbb {R}}^{n_u^{(i)}\times D}\) of \(n_u^{(i)}\) inducing points. It should be noticed that Eq. (23) is the same as Eq. (22) except that the test points \(\varvec{X}^*\) in the latter is replaced by the collection \(\bar{\varvec{X}}_i\) of inducing points. In other words, we consider the projection from the size \(n^{(i)}\) of the subdataset \({\mathcal {D}}_i\) to the number \(n_u^{(i)}\) of inducing points.
We show the prediction scheme using Eq. (23) in a way that follows NPAE. Assume that each expert \({\mathcal {M}}_i\) has a set of inducing points \(\bar{\varvec{X}}_i\) in addition to its own subdataset \({\mathcal {D}}_i\). There is no constraint on the choice of the inducing points but their total number \(N_u\) is assumed to be less than N for achieving dimensionality reduction. First, each expert defines an estimator \(\bar{\varvec{\mu }}_i\) on the basis of its observation \(\varvec{z}_i\) as
Second, the estimators are concatenated to form a random vector \(\bar{\varvec{\mu }} = [\bar{\varvec{\mu }}_1^{\mathrm {T}}, \ldots , \bar{\varvec{\mu }}_p^{\mathrm {T}}]^{\mathrm {T}} \in {\mathbb {R}}^{N_u}\). The covariances involving \(\bar{\varvec{\mu }}\) and \(\varvec{z}^*\) are calculated as
Finally, the predictive mean \(\bar{\varvec{\mu }}_{{\mathcal {A}}}\) and covariance \(\bar{\varvec{\Sigma }}_{{\mathcal {A}}}\) of NAEIP are derived as
These formulae correspond to \(\varvec{\mu }_{{\mathcal {A}}}(\varvec{X}^*)\) and \(\varvec{\Sigma }_{{\mathcal {A}}}(\varvec{X}^*)\) in Eq. (18), respectively, when the choice of Eq. (23) for \(\varvec{A}_i\) is employed.
The following proposition holds and is used for proving the consistency of NAEIP discussed later.
Proposition 3
\(\bar{\varvec{\mu }}_{{\mathcal {A}}}(\varvec{X}^*)\) in Eq. (27) is the best linear unbiased estimator of \(\varvec{f}^*\) on the basis of \(\bar{\varvec{\mu }}\), where the coefficient matrix \(\varvec{\phi } = [\varvec{\phi }_1^\mathrm {T} \ldots \varvec{\phi }_p^\mathrm {T}]^\mathrm {T}\) of \(\bar{\varvec{\mu }}_{{\mathcal {A}}}(\varvec{X}^*)=\varvec{\phi }^\mathrm {T} \bar{\varvec{\mu }} = \sum _{i=1}^p \varvec{\phi }_i^\mathrm {T} \bar{\varvec{\mu }}_i\) is given by \(\bar{\varvec{K}}_{{\mathcal {A}}}^{1}\bar{\varvec{k}}_{{\mathcal {A}}}\). The mean squared error \(v(\varvec{X}^*) = \mathrm {E}\left[ \Vert \varvec{f}^* \bar{\varvec{\mu }}_{{\mathcal {A}}}(\varvec{X}^*)\Vert ^2\right]\) of the estimator \(\bar{\varvec{\mu }}_{{\mathcal {A}}}(\varvec{X}^*)\) of \(\varvec{f}^*\) is given by \(\mathrm {Tr}\left[ \varvec{K}(\varvec{X}^*,\varvec{X}^*)  \bar{\varvec{k}}_{{\mathcal {A}}}^{\mathrm {T}} \bar{\varvec{K}}_{{\mathcal {A}}}^{1} \bar{\varvec{k}}_{{\mathcal {A}}}\right]\).
Proof
Using Eqs. (27) and (28), the mean squared error of an estimator \(\varvec{\phi }^\mathrm {T}\bar{\varvec{\mu }}\) of \(\varvec{f}^*\), with \(\bar{\varvec{\mu }}\) defined above, is written as
The value of \(\hat{\varvec{\phi }}\) minimizing it is found by differentiation: \(2\bar{\varvec{k}}_{{\mathcal {A}}}^\mathrm {T} + 2\hat{\varvec{\phi }}^\mathrm {T}\bar{\varvec{K}}_{{\mathcal {A}}} = {\mathbf {O}}\), which leads to \(\hat{\varvec{\phi }} = \bar{\varvec{K}}_{{\mathcal {A}}}^{1}\bar{\varvec{k}}_{{\mathcal {A}}}\) and \(\bar{\varvec{\mu }}_{{\mathcal {A}}}(\varvec{X}^*)=\hat{\varvec{\phi }}^\mathrm {T}\bar{\varvec{\mu }}\). Then, \(v(\varvec{X}^*) = \mathrm {Tr}\left[ \varvec{K}(\varvec{X}^*,\varvec{X}^*)  2\hat{\varvec{\phi }}^\mathrm {T}\bar{\varvec{k}}_{{\mathcal {A}}} + \hat{\varvec{\phi }}^\mathrm {T}\bar{\varvec{K}}_{{\mathcal {A}}}\hat{\varvec{\phi }}\right]\) and the statement follows. \(\square\)
One may perform prediction on the \(N_T\) test points in a pointwise manner, repeating predition on a single point \(N_T\) times, or all at once, predicting for the \(N_T\) test points simultaneously. In view of the computational complexity to be discussed later, we consider a more general framework in which the \(N_T\) test points are partitioned into S subsets \(\varvec{X}_1^*,\ldots ,\varvec{X}_S^*\) with \(\bigcup _{s=1}^S\varvec{X}_s^*=\varvec{X}^*\) and \(\varvec{X}_s\cap \varvec{X}_{s'}=\emptyset\) for \(s\not =s'\), and the prediction is performed on each of these subsets separately. Assume now that the prediction is to be made on the target subset \(\varvec{X}_s^*\) of \(n_t^{(s)}\) test points. Then, there are 5 possible options of inducing points \(\bar{\varvec{X}}_i\) for expert i in NAEIP:

1.
\(\bar{\varvec{X}}_i=\varvec{X}^*_s\): Use the test points themselves as the inducing points. In this case \(n_u^{(i)}=n_t^{(s)}\). [Blockwise Test points (BT)]

2.
\(\bar{\varvec{X}}_i = \{\varvec{x}\in \varvec{X}^*_e~  ~\varvec{X}^*_e \subset \varvec{X}^*, \varvec{X}^*_e\ne \varvec{X}^*_s\}\): Use a part of test points, \(\varvec{X}^*_e\), which is not equal to the target subset \(\varvec{X}^*_s\). We can set \(n_u^{(i)}\) arbitrarily while satisfying \(n_u^{(i)}<N_T\). [Blockwise Test points and Other Test points (BT+OT), Arbitrary Test points (AT)]

3.
\(\bar{\varvec{X}}_i = \{\varvec{x}\in (\varvec{X}_o\cup \varvec{X}^*_s)~  ~\varvec{X}_o\cap \varvec{X}^*=\emptyset , \varvec{X}_o\ne \emptyset \}\): Use both the target subset of test points, \(\varvec{X}^*_s\), and nontest points. We can set \(n_u^{(i)}\) arbitrarily while satisfying \(n_u^{(i)}> n_t^{(s)}\). [Blockwise Test points and NonTest points (BT+NT)]

4.
\(\bar{\varvec{X}}_i =\{\varvec{x}\in \varvec{X}_o~  ~\varvec{X}_o\cap \varvec{X}^*=\emptyset , \varvec{X}_o\ne \emptyset \}\): Use only nontest points as the inducing points. We can set \(n_u^{(i)}\) arbitrarily. [NonTest points (NT)]

5.
\(\bar{\varvec{X}}_i =\{\varvec{x}\in (\varvec{X}_o\cup \varvec{X}^*_e)~  ~\varvec{X}_o\cap \varvec{X}^*=\emptyset , \varvec{X}^*_e \subset \varvec{X}^*, \varvec{X}^*_e\ne \varvec{X}^*_s, \varvec{X}_o\ne \emptyset \}\): Use both a part of test points, \(\varvec{X}^*_e\), which is not equal to the target subset \(\varvec{X}^*_s\), and nontest points. We can set \(n_u^{(i)}\) arbitrarily.
Option 1 is an extension of the original NPAE (Rullière et al. 2018) to multiple dimensions. As option 2, we can consider two natural choices, one that completely includes test points themselves (BT+OT) and another that partially or never includes them (AT). BT+OT and BT+NT use higherdimensional sketching at each expert by incorporating auxiliary points as its inducing points. Extension of sketching dimensions employed in these options is expected to improve prediction accuracy, at the expense of increased computational complexity. The idea of BT, BT+OT, and BT+NT are known as transduction (QuiñoneroCandela and Rasmussen 2005) that uses the test points of interest for prediction. The transduction could be beneficial because the test points should have some information about the corresponding outputs. A drawback with these options is that the covariance matrix \(\bar{\varvec{K}}_{{\mathcal {A}}}\) depends on all or some test points in the target subset \(\varvec{X}_s^*\), so that one has to construct it, as well as to perform matrix inversion, for every target subset. On the other hand, AT and NT require the construction of \(\bar{\varvec{K}}_{{\mathcal {A}}}\) only once for all the target subsets of test points, as long as the inducing points are fixed. It brings about a significant reduction of the complexity. Option 5 might not yield a better prediction than BT+OT or BT+NT. Therefore we focus on BT, BT+OT, AT, BT+NT, and NT in the rest of this paper and expect an improvement of the predictive performance by using the extended dimensions \(n_u^{(i)}\ge n_t^{(s)}\).
3.3 Summary of proposed algorithm
In this subsection, we summarize the procedure of the proposed NAEIP. Definitions of symbols used for NAEIP are summarized in Table 2. We write the covariance matrix as \(\varvec{K}_{\varvec{\theta }}(\cdot ,\cdot )\) in order to make explicit its dependence on the hyperparameters \(\varvec{\theta }\) of the covariance function.
The whole training dataset is divided into p subsets by using some clustering algorithms or at random. Each subdataset \((\varvec{X}_i,\varvec{z}_i)\) is assigned to an expert. To learn hyperparameters, we adopt the factorized training process (Deisenroth and Ng 2015). We first specify an option from BT, BT+OT, BT+NT, AT, or NT and construct inducing points \(\{\bar{\varvec{X}}_i\}_{i=1}^p\) by Algorithm 1. We then perform NAEIP as shown in Algorithm 2.
3.4 Consistency of NAEIP
We study consistency of NAEIP in the noisy case by extending the proof of consistency of NPAE in Bachoc et al. (2021). The following assumption is necessary only for NAEIP.
Assumption 4
For a test point \(\varvec{x}^*\in {\mathcal {Q}}\), estimation of \(f(\varvec{x}^*)\) is done by including the test point \(\varvec{x}^*\) as an inducing point of all experts.
For \(N\in {\mathbb {N}}\), let \(p_N\) be the number of experts, which may depend on N, and let \(\varvec{X}_1,\ldots ,\varvec{X}_{p_N}\) be the subdatasets, where \(\varvec{X}_i\) (\(i=1,\ldots ,p_N\)), being a subset of \(\varvec{X}\), is the subdataset assigned to expert i. We also require the following assumption on the subdatasets.
Assumption 5
There exists a sequence \(\{j_N(\varvec{x}^*)\}_{N\in {\mathbb {N}}}\) of indices \(j_N(\varvec{x}^*) \in \{1,\ldots ,p_N\}\) depending on a given test point \(\varvec{x}^*\) such that, for any \(\rho >0\), the number of the input points in \(\varvec{X}_{j_{N}(\varvec{x}^*)}\) lying within the \(\rho\)ball \(B_\rho (\varvec{x}^*)=\{\varvec{x}\in {\mathcal {Q}}:\Vert \varvec{x}\varvec{x}^*\Vert <\rho \}\) centered at \(\varvec{x}^*\) goes to infinity as \(N\rightarrow \infty\).
Under these assumptions, Proposition 6 below establishes consistency of NAEIP at a fixed test point \(\varvec{x}^*\in {\mathcal {Q}}\).
Proposition 6
Let \({\mathcal {Q}}\) be a compact nonempty subset of \({\mathbb {R}}^D\). Let f be a Gaussian process on \({\mathcal {Q}}\) with mean zero and continuous covariance function k. Let \(\{\varvec{x}_{N n}\}_{1\le n\le N, N\in {\mathbb {N}}}\) be a triangular array of input points, all of which lie in \({\mathcal {Q}}\). For \(N\in {\mathbb {N}}\), let \(\varvec{X} = [\varvec{x}_{N1}, \ldots , \varvec{x}_{NN}]^{\mathrm {T}}\), and let \(\bar{\varvec{\mu }}_1,\ldots , \bar{\varvec{\mu }}_{p_N}\) be the collection of \(p_N\) experts’ estimates defined in Eq. (24) on the basis of respective subdatasets \((\varvec{X}_1,\varvec{z}_1), \ldots , (\varvec{X}_{p_N},\varvec{z}_{p_N})\) of training points. Assume that each row of \(\varvec{X}\) is a row of at least one \(\varvec{X}_i\). For a test point \(\varvec{x}^*\in {\mathcal {Q}}\), assume further that \(\varvec{X}_1,\ldots ,\varvec{X}_{p_N}\) satisfy Assumption 5. For such a test point \(\varvec{x}^*\), under Assumption 4we have
where \({\bar{\mu }}_{{\mathcal {A}}}(\varvec{x}^*)\) is as in Eq. (27).
Proof
By Assumption 4, expert \(j_N(\varvec{x}^*)\) has the test point \(\varvec{x}^*\) as its inducing point, that is, \(\varvec{x}^*\) is a component of \(\bar{\varvec{X}}_{j_{N}(\varvec{x}^*)}\). Let \(a_j(\varvec{x}^*)\) be the index of the test point \(\varvec{x}^*\) in \(\bar{\varvec{X}}_{j_{N}(\varvec{x}^*)}\).
With these notations, since \({\bar{\mu }}_{{\mathcal {A}}}(\varvec{x}^*)\) is a linear combination of the elements of \(\bar{\varvec{\mu }}\) with minimal square prediction errors from Proposition 3, its square prediction error is not larger than that of any single element of \(\bar{\varvec{\mu }}\). We hence have
From Assumption 5, for any fixed \(\rho >0\), the number \(\iota\) of input points lying within \(B_\rho (\varvec{x}^*)\) goes to infinity as \(N\rightarrow \infty\). Let \(\varvec{x}_{j_{N}^{(1)}(\varvec{x}^*)}, \ldots , \varvec{x}_{j_{N}^{(\iota )}(\varvec{x}^*)}\) and \(z_{j_{N}^{(1)}(\varvec{x}^*)}, \ldots , z_{j_{N}^{(\iota )}(\varvec{x}^*)}\) be such input points and the corresponding observations, respectively. Since \([\bar{\varvec{\mu }}_{j_{N}(\varvec{x}^*)}(\varvec{x}^*)]_{a_j(\varvec{x}^*)} = \varvec{K}_{*j_{N}(\varvec{x}^*)}\left( \varvec{K}_{j_{N}(\varvec{x}^*) j_{N}(\varvec{x}^*)}+\sigma ^2\varvec{I}\right) ^{1}\varvec{z}_{j_{N}(\varvec{x}^*)}\) is also a linear combination of the elements of \(\varvec{z}_{j_{N}(\varvec{x}^*)}\) with minimal square prediction errors, we have, similarly as above,
From the independence of the noise process and CauchySchwarz inequality, the righthand side can further be bounded as
The second term on the rightmost side of Eq. (32) converges to zero as \(\iota \rightarrow \infty\) because \(\sigma ^2\) is finite. From the continuity of k as in Bachoc et al. (2021, Appendix E), one has
The fact that the limit supremum of a nonnegative sequence converges to 0 implies that the limit also converges. We therefore obtain the statement of the proposition. \(\square\)
Note that Proposition 6 proves convergence of \({\bar{\mu }}_{{\mathcal {A}}}(\varvec{x}^*)\) to \(f(\varvec{x}^*)\) in the mean square sense, which in turn implies convergence in probability, hence establishing the desired consistency.
Options BT, BT+OT, and BT+NT satisfy Assumption 4 from their definitions. Options AT and NT can also include the test point as an inducing point of all experts under the additional assumptions.
Corollary 7
Under the conditions of Proposition 6, NAEIPBT, BT+OT, and BT+NT have consistency.
Corollary 8
Under the conditions of Proposition 6and the assumption that \(n_u^{(i)}\rightarrow N_T~(i=1,\ldots ,p_N)\) as \(N\rightarrow \infty\), NAEIPAT has consistency.
Corollary 9
Under the conditions of Proposition 6and the assumption that \(n_u^{(i)}\rightarrow \infty ~(i=1,\ldots ,p_N)\) as \(N\rightarrow \infty\), NAEIPNT has consistency.
Proof
In NAEIPNT, let \(\{\varvec{x}_{\eta _i u^{(i)}}\}_{1\le u^{(i)} \le n_u^{(i)},1\le i\le p}\) be an array of inducing points. For each \(\varvec{x}^*\in {\mathcal {Q}}\) and for \(i=1,\ldots ,p\), there exists at least one inducing point such that \(\lim _{n_u^{(i)}\rightarrow \infty } \min _{u^{(i)}} \Vert \varvec{x}_{\eta _i u^{(i)}}\varvec{x}^*\Vert =0\) and then the estimation of \(f(\varvec{x}^*)\) satisfies Assumption 4. \(\square\)
Assumption 5 holds in typical division of the training data into subdatasets (Bachoc et al. 2021). When we adopt clustering algorithms such as kmeans for the division, the condition holds under the assumption \(\min _{i=1,\ldots ,p_N} n^{(i)}\rightarrow \infty\) as \(N\rightarrow \infty\). In the case where the input points are distributed with strictly positive density on \({\mathcal {Q}}\), it also holds under the assumption that the number \(p_N\) of experts is o(N) as \(N\rightarrow \infty\).
3.5 Time complexity
The complexity in time is one of the main interests of approximated Gaussian process regression. We show the complexity of the conventional aggregation methods and our methods proposed in this paper in Table 3 under a sufficiently large N, where, for simplicity, we consider the case of equal dimensions \(n_u^{(i)} = n_u\) and equally divided subdatasets \(n^{(i)} = N/p\) among the experts, and of equally divided partitions \(n_t^{(s)}=N_T/S=n_t\). GRBCM and QBCM require slightly higher complexity than PoE, GPoE, BCM, and RBCM because each expert uses the modified subdataset \({\mathcal {D}}_{+i}\). As mentioned in Sect. 2.2.3, the complexity of NPAE is higher than these methods but keeps lower than that of full GPR when \(N_T<N\). Rullière et al. (2018) reported the complexity as \({\mathcal {O}}\left( \frac{N^3}{p^2}\right) + {\mathcal {O}}\left( N_T N^2\right)\) which is the same as NAEIPBT, but the frequency for memory access can be reduced by a factor of \(n_t\) in the proposed methods. NAEIPBT+OT and NAEIPBT+NT require higher complexity than the original NPAE by considering \(n_u\) dimensions. On the other hand, the complexity of NAEIPAT and NAEIPNT can be lower than that of not only the original NPAE but also the other methods, depending on the choice of \(n_u\).
In the following, we briefly describe a proof sketch of the complexity of NAEIP. The main factors that affect the complexity are calculation of the inverse \(\left( \varvec{K}_{ii}+\sigma ^2\varvec{I}\right) ^{1}\) in Eq. (24), which is to be performed by every expert, and the construction of \(\bar{\varvec{K}}_{{\mathcal {A}}}\) in Eq. (26). The former takes \({\mathcal {O}}((n^{(i)})^3)\) at p experts, thus resulting in \({\mathcal {O}}(p(n^{(i)})^3)={\mathcal {O}}(N^3/p^2)\). Next, each block of \(\bar{\varvec{K}}_{{\mathcal {A}}}\) includes the product of \(n_u\times n^{(i)}\) matrix and \(n^{(i)}\times n^{(i)}\) matrix, and there are \(p^2\) blocks in \(\bar{\varvec{K}}_{{\mathcal {A}}}\), so that these amount to \({\mathcal {O}}(p^2n_u (n^{(i)})^2) = {\mathcal {O}}(n_uN^2)\) computation. The construction of \(\bar{\varvec{K}}_{{\mathcal {A}}}\) is repeated \(S=N_T/n_t\) times for BT, BT+OT, and BT+NT, resulting in \({\mathcal {O}}(N_T N^2)\) for BT and \({\mathcal {O}}(N_T n_uN^2/n_t)\) for the others, whereas it is performed only once for AT and NT, resulting in \({\mathcal {O}}(n_uN^2)\).
4 Numerical experiments
4.1 Datasets and settings
We evaluated the predictive performance and the computing time of NAEIP in comparison with conventional methods. All the results were obtained by using GPML MATLAB Code^{Footnote 3} (Rasmussen and Williams 2006). We measured total CPU time on a linux computer with two CPUs (Intel Xeon Gold 5222, 4 cores, 3.8 GHz base clock) and 768 GB RAM. The datasets used in the numerical experiments are summarized below.

1D Synthetic data: Synthetic data generated by \(z_n = \mathrm {sinc}(x_n) + \epsilon _n,~n=1,\ldots ,N\), where the training points lie in the interval \([4,4]\) uniformly, where \(N_T\) test points are uniformly chosen in \([5,5]\), and where \([\epsilon _1,\ldots ,\epsilon _N]^{\mathrm {T}}\sim {\mathcal {N}}({\mathbf {0}},0.04\varvec{I})\).

8D KIN8NM dataset^{Footnote 4} (Vanschoren et al. 2013): The data related to the forward dynamics of an 8link robot arm. There are 8,192 samples in total. We randomly split them into 7,373 samples for training and 819 samples for testing.

21D SARCOS dataset^{Footnote 5} (Rasmussen and Williams 2006): The data related to the inverse dynamics problem of robot arms. There are 44,484 training samples and 4,449 test samples.

26D POL dataset^{Footnote 6}: Pole telecom dataset. There are 10,000 training samples and 5,000 test samples.
For each experimental condition, we performed 10 trials or more, each consisting of training and prediction procedures. We have used two performance measures, mean squared error (MSE):
and mean standardized log loss (MSLL):
where \({\hat{\mu }}(x_{t,i}), ~{\hat{\sigma }}^2(x_{t,i}), ~z(x_{t,i})\) are the predictive mean, variance, and true value at the test point \(x_{t,i}\), respectively. MSLL is the mean of pointwise negative log losses of the Gaussian models with mean \({\hat{\mu }}(x_{t,i})\) and variance \({\hat{\sigma }}^2(x_{t,i})\) given data \(\{z(x_{t,i})\}\), and takes into account uncertainty of the predictions via the posterior variances. The lower MSE and MSLL imply the better prediction.
For the proposed methods, we have assumed equal dimensions \(n_u^{(i)}=n_u\) among all experts and almost equally divided partitions \(n_t^{(s)}=n_t\) of the test points. Other test points for NAEIPOT and arbitrary test points for NAEIPAT have been chosen randomly from the remaining test points and from the entire test points, respectively.
4.2 Synthetic data
For the synthetic data, the training data were divided into subdatasets by kmeans. The SE function (Eq. (2)) was employed as the covariance function of GP. The nontest points of each expert in NAEIPBT+NT and NAEIPNT were generated from the multivariate Gaussian distribution with the same mean and covariance as those of the subdataset assigned to that expert.
First, we investigate the influence on NAEIP’s performance of the dimension \(n_u\) and the number of test points \(n_t\) processed at once. Figures 1, 2 and 3 show the performance measures and computing time versus \(n_t\) when \(N=10^4, N_T=100\), and \(p=20\). We set the dimension \(n_u\) to be \(1.2\times n_t, ~1.5\times n_t, ~2\times n_t\), and \(4\times n_t\). When \(n_t\le 20\), the larger \(n_t\) and \(n_u\) showed the better predictive performance, and the performance became stable in most cases. The higher dimensions required the more computing time, and the computing time of BT, BT+OT, and BT+NT decreased as \(n_t\) increased. On the other hand, the computing time of AT and NT was kept small regardless of \(n_t\). Note that, depending on the value of p or \(N_T\), the larger \(n_t ~(\le n_u\) in this paper) does not always yield the shorter computing time because the complexity required for evaluating the inversion \(\varvec{K}_{{\mathcal {A}}*}^{1}\) is \({\mathcal {O}}(n_u^3 p^3)\) and it could be higher than that of other factors.
Second, we compare the predictive performance and computing time of NAEIP with those of conventional aggregation methods, PoE, GPoE, BCM, RBCM, GRBCM, QBCM, and the original NPAE in 30 trials. We evaluated the cases \(N = 10^4, 5\times 10^4,\) and \(10^5\) with \(N_T=N\times 10^{2}\) and \(p=N/500\), and chose \((n_t,n_u)=(50,75)\) for NAEIP except for NAEIPBT. Fig. 4a–c shows the performance measures versus N. Figure 4d summarizes the results of statistical significance testing for a difference between the bestperforming method and each of the other 11 methods. We first checked the normality of data via the onesample twosided KolmogorovSmirnov test (\(P<0.05\)), and then employed paired ttest with Bonferroni multiple testing correction (\(P<0.05/11\)) for the statistical significance testing. The results for the cases of \(N=5\times 10^4\) and \(10^5\) reveal that the MSE of the proposed NAEIP was equal to or better than the other methods and the MSLL was the lowest among the methods. This indicates that NAEIP can obtain not only better predictive means but also smaller predictive variances. Moreover, NAEIP took less time than the original NPAE. This might be ascribed to difference in memory access patterns. Especially for NAEIPBT, NAEIPAT, and NAEIPNT, the computing time was shorter than QBCM. Figure 5 shows examples with the predictive means and 95% confidence intervals in the case of \(N=10^4\), \(n_t=20\), and \(n_u=30\) except for NAEIPBT. Note that we omitted the results of PoE, GPoE, BCM, and RBCM from Figs. 4 and 5 because the difference between those and the other methods were significant and those showed the comparable computing time with GRBCM.
4.3 Real data
For SARCOS and POL datasets, the training data were divided respectively by constrained kmeans (Bradley et al. 2000), which can avoid generating weak submodels by setting the minimum size of clusters. We set the minimum size to 300 for SARCOS dataset and 200 for POL dataset. For KIN8NM dataset, the training data were divided by kmeans. The Matérn5/2 function (Eq. (3)) was employed as the covariance function of GP. For NAEIPBT+NT and NAEIPNT, we employed the optimization of each expert’s nontest points by Hensman et al. (2013) under fully independent training conditional assumption (Snelson and Ghahramani 2005; QuiñoneroCandela and Rasmussen 2005). We used the assumption only in the optimization of inducing points, and not in the predictions. The minibatch size and the number of epochs were set to 100 and 10, respectively. It should be noted that the computational complexity of the optimization is \({\mathcal {O}}((n_u^{(i)})^3)\), so that we can ignore the complexity as long as we set \(n_u^{(i)}\) to be smaller than \(n^{(i)}\).
We compare the predictive performance of NAEIP with that of the conventional methods in 10 trials by using the real datasets. We set \((p,n_t)=(8,20)\) for KIN8NM dataset, (72, 20) for SARCOS dataset, and (25, 50) for POL dataset. The dimension \(n_u\) for NAEIP except for NAEIPBT was set to \(n_u = 1.5 \times n_t\). Table 4 summarizes the performance measures of the aggregation methods and the results of statistical significance testing for a difference between the bestperforming method and each of the other 11 methods (Wilcoxon signed rank test with Bonferroni multiple testing correction, \(P<0.05/11\)). For KIN8NM and POL, the extension of sketching dimensions in NAEIPBT+NT or NAEIPBT+OT improved the performance compared with that of NAEIPBT, and those methods achieved better performance than the other methods. On the other hand, for SARCOS, the performance of the conventional methods was the best. The fact that the performance of NAEIPAT and NAEIPNT was worse might reflect the lack of consistency of these methods under the setting of the dimension \(n_u\).
5 Conclusion
We have introduced the idea of linear sketching into approximate Gaussian process regression and have proposed NAEIP (Nested Aggregation of Experts using Inducing Points) with five options for the choice of the inducing points. The proposed method inherits consistency under the conditions on the number of inducing points depending on the option. We conducted numerical experiments with synthetic and real datasets. The experimental results show that NAEIP with the options that include test points as the inducing points achieves a lowest prediction error than the conventional methods. Moreover, the computing time of NAEIP has been shorter than that of the approximation methods: QBCM and the original NPAE. Future work includes the optimization of a blockstructured sketching matrix that projects observations to a lowdimensional subspace.
Availability of data and material
All the real data used in this work are available on the web at https://www.openml.org/d/189, http://www.gaussianprocess.org/gpml/data/, and https://cims.nyu.edu/~andrewgw/pattern/.
Notes
In what follows we consider time complexity under the assumption that arithmetic with matrix elements has complexity \({\mathcal {O}}(1)\).
Many stationary kernels including the Matérn kernel satisfy the assumption, but the SE kernel does not.
References
Ashton, S.R.F., & Sollich, P. (2012). Learning curves for multitask Gaussian process regression. In Advances in Neural Information Processing Systems 25, pp 1393–1428.
Bachoc, F., Durrande, N., Rullière, D., & Chevalier, C. (2017). Some properties of nested Kriging predictors. arXiv preprint arXiv:1707.05708v1.
Bachoc, F., Durrande, N., Rullière, D., & Chevalier, C. (2021). Properties and comparison of some Kriging submodel aggregation. arXiv preprint arXiv:1707.05708v2.
Bauer, M., van der Wilk, M., & Rasmussen, C.E. (2016). Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems 29, pp. 1533–1541.
Bradley, P.S., Bennett, K.P., & Demiriz, A. (2000). Constrained kmeans clustering. Tech. rep., MSRTR200065, Microsoft Research, Redmond, WA.
Bui, T.D., & Turner, R.E. (2014). Treestructured Gaussian process approximations. In Advances in Neural Information Processing Systems 27, pp. 2213–2221.
Calandriello, D., Carratino, L., Lazaric, A., Valko, M., & Rosasco, L. (2019). Gaussian process optimization with adaptive sketching: Scalable and no regret. Proceedings of the ThirtySecond Conference on Learning Theory, PMLR, 99, 533–557.
Cao, Y., & Fleet, D.J. (2014). Generalized product of experts for automatic and principled fusion of Gaussian process predictions. arXiv preprint arXiv:1410.7827.
Cressie, NAC. (1993). Statistics for Spatial Data, Revised Edition. Wiley, New York, NY, https://doi.org/10.1002/9781119115151.
Deisenroth, M.P., & Ng, J.W. (2015) Distributed Gaussian processes. In Proceedings of the 32th International Conference on Machine Learning, PMLR, pp. 1481–1490.
Deisenroth, M. P., Fox, D., & Rasmussen, C. E. (2015). Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2), 408–423. https://doi.org/10.1109/TPAMI.2013.218
He, J., Qi, J., & Ramamohanarao, K. (2019). Queryaware Bayesian committee machine for scalable Gaussian process regression. In Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 208–216.
Hensman, J., Fusi, N., & Lawrence, N.D. (2013). Gaussian processes for big data. In Proceedings of the 29th Conference on Uncertainly in Artificial Intelligence, pp. 282–290.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. https://doi.org/10.1162/089976602760128018
Lawrence, N. (2005). Probabilistic nonlinear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783–1816.
Liberty, E. (2013). Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 581–588.
Liu, H., Cai, J., Wang, Y., & Ong, Y.S. (2018). Generalized robust Bayesian committee machine for largescale Gaussian process regression. In Proceedings of the 35th International Conference on Machine Learning, PMLR, pp. 3131–3140.
Liu, H., Ong, Y. S., Shen, X., & Cai, J. (2020). When Gaussian process meets big data: A review of scalable GPs. IEEE Transactions on Neural Networks and Learning Systems, 31(11), 4405–4423. https://doi.org/10.1109/TNNLS.2019.2957109
QuiñoneroCandela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6, 1939–1959.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Process for Machine Learning. Cambridge: MIT Press.
Rullière, D., Durrande, N., Bachoc, F., & Chevalier, C. (2018). Nested Kriging predictions for datasets with a large number of observations. Statistics and Computing, 28, 849–867.
Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian processes using pseudoinputs. In Advances in Neural Information Processing Systems 18, pp. 1257–1264.
Stein, M. L. (1999). Interpolation of spatial data: Some theory for kriging. New York: Springer,. https://doi.org/10.1007/9781461214946.
Tavassolipour, M., Motahari, S. A., & Shalmani, M. T. M. (2020). Learning of Gaussian processes in distributed and communication limited systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8), 1928–1941. https://doi.org/10.1109/TPAMI.2019.2906207
Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12(11), 2719–2741. https://doi.org/10.1162/089976600300014908
van der Vaart, A., & van Zanten, H. (2011). Information rates of nonparametric Gaussian process methods. Journal of Machine Learning Research, 12, 2095–2119.
Vanschoren, J., van Rijn, J. N., Bischl, B., & Torgo, L. (2013). OpenML: Networked science in machine learning. SIGKDD Explorations, 15(2), 49–60. https://doi.org/10.1145/2641190.2641198
Wilson, A., & Nickisch, H. (2015). Kernel interpolation for scalable structured Gaussian processes (KISSGP). In Proceedings of the 32nd International Conference on Machine Learning, PMLR, pp. 1775–1784.
Woodruff, D. P. (2014). Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(12), 1–157. https://doi.org/10.1561/0400000060
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Contributions
Conceptualization: Ayano NakaiKasai and Toshiyuki Tanaka; Methodology: Ayano NakaiKasai and Toshiyuki Tanaka; Formal analysis and investigation: Ayano NakaiKasai and Toshiyuki Tanaka; Writing  original draft preparation: Ayano NakaiKasai and Toshiyuki Tanaka; Writing  review and editing: Ayano NakaiKasai and Toshiyuki Tanaka; Funding acquisition: None; Resources: Ayano NakaiKasai and Toshiyuki Tanaka; Supervision: Ayano NakaiKasai and Toshiyuki Tanaka.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Code availability
The codes for NAEIP are available at GitHub repository https://github.com/anakaik/NAEIP. Factorized training and predictions of conventional methods are implemented relating to Liu et al. (2018) at https://github.com/LiuHaiTao01/GRBCM.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.
Rights and permissions
About this article
Cite this article
NakaiKasai, A., Tanaka, T. Nested aggregation of experts using inducing points for approximated Gaussian process regression. Mach Learn 111, 1671–1694 (2022). https://doi.org/10.1007/s10994021061018
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994021061018