A similarity-based Bayesian mixture-of-experts model

We present a new nonparametric mixture-of-experts model for multivariate regression problems, inspired by the probabilistic k-nearest neighbors algorithm. Using a conditionally specified model, predictions for out-of-sample inputs are based on similarities to each observed data point, yielding predictive distributions represented by Gaussian mixtures. Posterior inference is performed on the parameters of the mixture components as well as the distance metric using a mean-field variational Bayes algorithm accompanied with a stochastic gradient-based optimization procedure. The proposed method is especially advantageous in settings where inputs are of relatively high dimension in comparison to the data size, where input–output relationships are complex, and where predictive distributions may be skewed or multimodal. Computational studies on five datasets, of which two are synthetically generated, illustrate clear advantages of our mixture-of-experts method for high-dimensional inputs, outperforming competitor models both in terms of validation metrics and visual inspection.


Introduction
Large parts of contemporary research in machine learning presume the availability of massive datasets and focus on pure prediction.In problems where data is complex as well as scarce, however, and where it is of high importance to quantify uncertainties of predictions, rigorous probabilistic modeling is essential for achieving good generalization performance.The Bayesian formalism provides a natural setting for such modeling-by specifying a model p({y n } n , y * ) jointly over observed data {y n } n ⊂ Y and future data y * ∈ Y, denoting by Y the space of outputs, obtaining the predictive distribution p(y * | {y n } n ) reduces to an integration problem.When using parametric models, prior assumptions on the parameters act as regularizations preventing overfitting, but the requirement of an explicit parameterization may be problematic when outputs depend on inputs through highly nonlinear relationships.Nonparametric alternatives, on the other hand, are associated with other issues-simple models such as Gaussian processes struggle with multivariate, skewed and multimodal outputs, and more sophisticated nonparametric Bayesian regression models based on Dirichlet processes require ad hoc constructions to incorporate input dependency and specially developed Markov chain Monte Carlo algorithms for posterior inference.To address these challenges, in this paper, we propose a novel nonparametric mixture-of-experts model designed to be able to model any predictive density while being robust against small datasets and overfitting, equipped with a mean-field variational Bayes algorithm for posterior inference.
For modeling irregular densities, Gaussian mixtures are a popular choice as this class of distributions is dense in the set of all probability distributions (Nguyen and McLachlan 2019), thus being able to approximate any target density arbitrarily well.Parameter inference for finite Gaussian mixtures is particularly tractable with algorithms such as expectation-maximization (EM), mean-field variational Bayes and Gibbs sampling (K Murphy 2012).In a supervised learning context, if a discriminative approach is preferred, an extension from mixture models are mixture-of-experts models, first introduced by Jacobs et al. (1991), which allow the mixture components-the experts-as well as the mixture weights-the gate-to depend on inputs x ∈ X from an input space X .A simple example is the model where θ = ({(µ c , Σ c )} c , Φ), Φ is some transformation matrix and z is a categorical latent variable with probability masses given by the softmax of Φx.Note that here, only the gate is input-dependent, which is sometimes known as a gating network mixture-of-experts model (K Murphy and TB Murphy 2020)-however, as a more complex example, we could also let the experts be input-dependent by e.g.letting the µ c and Σ c be defined through parameterized transformations of x, known as a full mixture-of-experts model.More advanced designs have been investigated in the literature-for example, Jordan and Jacobs (1994) and Bishop and Svensén (2003) use a hierarchical gating network for increased model complexity, and Xu et al. (1995), Ueda and Ghahramani (2002), Baldacchino et al. (2016) and Ingrassia et al. (2012) use a gate which itself is Gaussian or mixture-Gaussian.While some of these (Jordan andJacobs 1994, Xu et al. 1995) use EM for maximum-likelihood parameter inference, others (Baldacchino et al. 2016, Bishop and Svensén 2003, Ueda and Ghahramani 2002, Waterhouse et al. 1996) have employed variational Bayesian methods to approximate a posterior over the parameters, thereby reducing the imminent risk of overfitting and collapsing variances associated with maximumlikelihood (Bishop 2006, Waterhouse et al. 1996).For an extensive review of various forms of mixture-of-experts models and methods for parameter inference, see Yuksel et al. (2012).
Common for the above models is that they are parametric, assuming conditional independence across the data pairs {(x n , y n )} n ⊂ X × Y given some finite-dimensional parameter θ.For exchangeable data, such a model is guaranteed to exist without loss of generality in the infinite-data limit by de Finetti's theorem (Schervish 2012).However, due to the form of the posterior predictive distribution of the output y * ∈ Y corresponding to a new input x * ∈ X , written as all knowledge learned from the training data must be encapsulated in the posterior p(θ | {(x n , y n )} n ).While this does not present any practical limitations for most types of datasets, problems typically arise when the relationships between inputs and outputs are complex, when the dimension dim X of the inputs is medium-high or high, and when data is scarce-in general, the former two situations force θ to also be of relatively high dimension which, in combination with the latter, renders inference especially hard.Nonparametric models, in contrast, characterized by their growth in complexity along with increasing data availability and often being based on comparing similarities between data points rather than learning a mapping from input to output, do not suffer from the same problem.Such similarity-based models are most easily understood from a predictive perspective-when inferring y * given x * and training data {(x n , y n )} n , we first determine similarities between x * and each x n and base our prediction on these similarities and the y n .The simplest example is the k-nearest neighbors algorithm, which has been extended to a probabilistic framework by Holmes and Adams (2002), who replace the pure averaging over the k-nearest neighbors by a softmax weighting and sample from the posterior of k and the softmax sharpness parameter using Markov chain Monte Carlo (MCMC) methods.
The approach has since been revisited and refined by Cucala et al. (2009) and, subsequently, Friel and Pettitt (2011).For regression problems, we have similar models such as kernel machines (Bishop 2006) and the more probabilistically rigorous Gaussian processes (Rasmussen and Williams 2006).Although several extensions of Gaussian processes (Bonilla et al. 2008, Rasmussen andGhahramani 2002) have been proposed to address, for example, their limitation to univariate outputs and Gaussian likelihoods, no combined method yet exists for handling arbitrary predictive distributions.
Other nonparametric Bayesian methods for conditional density estimation involve Dirichlet process mixture models (DPMMs) (Antoniak 1974, Ferguson 1973, Sethuraman 1994).Here, the Dirichlet process is used as a nonparametric prior in an infinite mixture model where DP(α, G 0 ) is the Dirichlet process with concentration parameter α and base measure G 0 and p(y | θ) refers to some likelihood model.Several methods exist for using the DPMM in a regression setting, which has been reviewed by Müller et al. (2015).In particular, while simpler approaches are based on estimating the regression means and residuals separately, in a fully nonparametric regression method, the dependency on an input x may be introduced by replacing G with an x-indexed analogue G x and instead placing the nonparametric prior over {G x } x∈X .Known as the dependent Dirichlet process (DDP) (MacEachern 1999), this model has been investigated with various choices of priors for {G x } x∈X -for example, De Iorio et al. (2004) proposed the ANOVA-DDP model in which locations in the stick-breaking representation of the Dirichlet process (Sethuraman 1994) are linear in the inputs, which was further developed into the linear DDP model (De Iorio et al. 2009); Dunson and Park (2008) proposed a kernelized stick-breaking process in which the breaking probabilities are input-dependent and similarity-based; Jara and Hanson (2011) instead proposed to replace the usual beta priors in the stick-breaking representation of the Dirichlet process by logistic-transformed Gaussian processes, deviating from the DPMM.Common drawbacks of these models, however, are that they require the modeling of an input-to-parameter process G x , which is in general multivariate, that the construction of a prior over {G x } x∈X essentially amounts to yet another multivariate regression problem, and that the available posterior MCMC inference algorithms exploit the respective specific forms of the priors.For instance, if using a normal likelihood p(y | θ) = N(y | µ, Σ) with θ = (µ, Σ), it is not clear how one would go about specifying the (dim Y + (dim Y) 2 )dimensional process G x and its prior in such a way that the input-to-parameter relation is adequately modeled and posterior inference is feasible.As an alternative perspective, known as the conditional DPMM (Cruz-Marcelo et al. 2013, Müller et al. 1996), one may model the data generatively and fit a DPMM to the set {(x n , y n )} n ⊂ X ⊕ Y of concatenated input-output pairs, from which conditional distributions are easily obtained due to the mixture-Gaussian form of the likelihood.While MCMC methods, which are known to scale poorly with model size and data dimensionality, are used for posterior inference in all previous cases, an alternative variational Bayes algorithm (Blei and Jordan 2006) is possible in the this model.On the other hand, the generative approach of the conditional DPMM relies on modeling both the input and output spaces using a mixture model, which may be unsuitable if the dimensionality of the inputs is large compared to that of the outputs.Moreover, the conditional DPMM relies on an approximate independence assumption across the data (Müller et al. 2015), and the model is therefore not similarity-based in the sense that x * is compared with each x n when predicting y * .
In this paper, we present a new similarity-based mixture-of-experts model combining the advantages of similarity-based nonparametric models with the flexibility of Gaussian mixtures, all while avoiding generative modeling and MCMC posterior inference.Specifically, the model uses multivariate Gaussian experts and a gating network comprising two layers of transitions when making predictions: a first layer where similarities between x * and each x n are used to compute transition probabilities, and a second layer where the conditional probabilities of belonging to each expert c given each y n are computed.As such, in contrast to aforementioned DPMM-based nonparametric Bayesian regression methods, in which input-to-parameter relations must be explicitly modeled by G x and its prior, all input-output mappings are handled through the observed data pairs {(x n , y n )} n .The model entails a predictive likelihood on the form of a multivariate Gaussian mixture, capable of modeling marginals as well as dependencies across the components of the output variable.Parameters are inferred using a mean-field variational Bayes algorithm, where local approximations of the likelihood are introduced to make variational posteriors analytically tractable and where the corresponding variational parameters may be updated using EM iterations.A gradient-based way of optimizing the similarity metric of the first transition is presented, which uses the reparameterization trick (Kingma and Welling 2014) to make stochastic gradients available through Monte Carlo sampling.The result is an algorithm generally suitable for datasets for which input-output relationships are intricate, the number of observations is relatively small, the output variables are multivariate, the predictive distributions may be multimodal as well as skewed, and for which it is desirable to estimate the predictive distributions as exactly as possible.The proposed method was tested on two artificially generated datasets, a dataset from medical physics containing dose statistics from radiation therapy treatments, the California housing dataset (Pace and Barry 1997) comprising geographic location and features of housing districts in California, and a dataset (Vågen et al. 2010) of soil functional properties and infrared spectroscopy measurements in Africa.In terms of both visual inspection and validation metrics, our model outperformed a conditional DPMM on all but the radiation therapy dataset, where it gave similar results, and performed consistently better than a Gaussian process baseline model.In particular, the experiments serve to illustrate the advantages of the proposed mixture-of-experts model on data with relatively high-dimensional inputs and irregularly shaped output distributions.
The sequel of this paper is organized as follows: the assumptions of the model are accounted for in Section 2, details of the mean-field variational Bayes algorithm are given in Section 3, results from the computational study are presented in Section 4 and discussed in Section 5, and derivations of some key facts and other algorithm details are given in Appendices A, B, C, D and E.

Model
Let {(x n , y n )} N n=1 ⊂ X × Y be exchangeable pairs of random variables consisting of inputs x n and outputs y n , where X and Y are some appropriate vector spaces of random variables.Given observations of these data pairs, which are referred to as the training data, our main task is to be able to, for each new input x * ∈ X , obtain the predictive distribution p(y * | x * , {(x n , y n )} n ) of the corresponding output y * ∈ Y.All random variables are modeled on a common probability space with probability measure P, using p for densities or masses induced by P. We will frequently use p({z i } I i=1 ) as shorthand for the joint density p(z 1 , z 2 , . . ., z I ) and E p (z) for the marginalization over a random variable z ∼ p(z).
Before introducing our model, it will be instructive to first formulate a conventional kernel regression model using latent variables: Example 1 (Nadaraya-Watson estimator).Consider a model where prediction of y * given x * may be written as a noisy linear combination of the training outputs y n , where ε * ∼ N(0, 2 ) is the regression error and k : X 2 → R is some positive definite kernel-this is sometimes known as the Nadaraya-Watson estimator (Bishop 2006).For simplicity, we will use the radial basis function kernel k(x, x ) = exp(− x − x 2 /(2 2 )) with standard Euclidean norm • and lengthscale parameter .We can now reformulate this using latent variables.In particular, let u * be a random variable supported on {1, . . ., N } such that I being the identity matrix, which may be interpreted as u * "choosing" one of the observed data points n and setting up a normal distribution centered around x n .If, conditional upon the choice u * = n, the prediction of y * is y n with some normal uncertainty of variance 2 -that is, From the radial basis form of k, it is easy to see that this is equivalent to the conventional kernel regression formulation (1).

Main setup
To assemble our model, we will start from a predictive perspective.We use C normal distributions {N(µ c , Σ c )} C c=1 as experts, where the means µ c and covariances Σ c are parameters not depending on the inputs.The gate then consists of two layers of transitions-one between the new input x * and each training input x n , and one between each training output y n and each expert c-which are represented by, respectively, latent variables u * and z * supported on {1, . . ., N } and {1, . . ., C}.In particular, we let where Λ is a precision matrix (as a generalization from −2 I in Example 1), and for the collection of all parameters.Furthermore, we let for all n and c.The complete-data predictive likelihood is then written as leading to the observed-data predictive likelihood which is a mixture of Gaussians.The first and second fraction within the parentheses of the above display are recognized as probabilities associated with first and second transitions, respectively.The precision matrix Λ naturally induces a Mahalanobis-form distance metric d X (x, x ) = (x − x ) T Λ(x − x ) on X , whereby the first transitions are obtained as a softmax transformation of the vector (−d X (x * , x n )/2) n .Note, in particular, that this is a gating network mixture-of-experts model as the mean-covariance pairs (µ c , Σ c ) are not input-dependent.This is a deliberate choice in light of our preference to avoid stipulating an input-to-parameter map, which is in general a high-dimensional and potentially complex transformation, as discussed in Section 1-instead, we let all input-output transitions occur through the observed data pairs {(x n , y n )} n .
Having established the predictive pipeline, we now face a similar problem as Holmes and Adams (2002) originally did in the probabilistic k-nearest neighbors context of constructing a joint complete-data likelihood p({(y n , u n , z n )} n | {x n } n , θ) given the training inputs.It is intuitively reasonable to require that the full conditionals p(y n , u n , z n | x n , {(x n , y n )} n =n , θ) have forms analogous to the predictive likelihood (2).As pointed out by Cucala et al. (2009), however, directly defining the joint likelihood as the product of all full conditionals as in Holmes and Adams (2002) will lead to an improperly normalized density.Noting that the asymmetry of the neighborhood relationships is the main cause of difficulties in specifying a well-defined model, Cucala et al. (2009) instead use a symmetrized version and view the training data as Markov random field (K Murphy 2012), defining a joint distribution up to a normalizing constant corresponding to a Boltzmann-type model.The idea is reused in Friel and Pettitt (2011), who replace the arguably superficial symmetrized neighborhood relationships with distance metrics.Both papers present a pseudolikelihood approximation (Besag 1974) of the joint distribution as a possible approach, although focus is reserved for other methods of handling the intractable normalizing constant.
Historically, there has been considerable debate regarding the use of a proper joint likelihood versus a conditionally specified model, for which a joint likelihood may not even exist (Besag andKooperberg 1995, Cucala et al. 2009).As argued by Besag (1974), the appeal of the latter choice is mainly due to the increased interpretability of specifying relationships directly through full conditionals, whereby the pseudolikelihood is always available as a viable surrogate for the joint likelihood.Thus, despite a certain lack of statistical coherence, this methodology allows the translation of any predictive pipeline, such as that of the k-nearest neighbor algorithm, into a probabilistic framework.Subsequent investigations of the model by Holmes and Adams (2002) gave different motivations for using this approach- Manocha and Girolami (2007) noticed the resemblance of parameter inference in such a model with leave-oneout cross-validation, whereas Yoon and Friel (2015) explained that the Boltzmann-type model by Cucala et al. (2009) would lead to difficulties due to the intractable normalizing constant.The interpretation of using such a pseudolikelihood is that dependencies between the data points beyond first order are ignored (Friel and Pettitt 2011), which may be a reasonable simplification in many applications.
Given this, we settle on the pseudolikelihood approach and proceed to define the full conditionals of our model as , recognized as the analogue of ( 2), where the latent variables u n , z n are now supported on {1, . . ., N } \ {n} and {1, . . ., C}, respectively.The joint complete-data pseudolikelihood is then written as -while it is important to remember that this is not a proper joint likelihood, we will drop the subscript in the following for convenience of notation.To complete our Bayesian model, we assume that the precision matrix Λ and the mean-covariance pairs (µ c , Σ c ) are a priori independent with Wishart and normal-inverse-Wishart distributions, respectively.This is written as where Λ 0 , η 0 , µ 0 , κ 0 , Σ 0 and ν 0 are hyperparameters.In order to ensure that the distributions are well-defined, we must require that Λ 0 and Σ 0 be positive definite, that η 0 > dim X − 1 and ν 0 > dim Y − 1, and that κ 0 > 0 (Anderson 1984).The discussion of general-purpose default methods to set hyperparameters is deferred to Appendix E-in particular, the number C of experts should generally be high to allow flexibility in the set of possible predictive distributions.

Mean-field variational Bayes
Motivated by the Gaussian mixture form of the full conditionals, we will base our parameter inference on a mean-field variational Bayes algorithm.In general, given some observed variables y and hidden variables z, the main idea of variational inference is to approximate an intractable posterior p(z | y) by a variational posterior q(z) from some family where the evidence lower bound (ELBO) is defined as its name originating from the fact that the bound log p(y) ≥ ELBO(q) always holds.By mean-field variational inference, we mean using as Q the family of all distributions on the fully factorized form q(z) where we have block-decomposed z as z = (z i ) i .In this case, one can show that the optimal distribution q * (z i ) for each component i has log-density equal up to an additive constant to the log-joint with {z i } i =i marginalized out-that is, This is used to construct a coordinate ascent algorithm, where a sequence {q l } l≥0 of distributions are devised to approximate q increasingly better and updates are performed according to the above by, at each l, iterating through the components of z and taking expectations always with respect to the latest versions of the q(z i ).For a detailed review of variational inference methods, see Blei et al. (2017).
In our case, counting both the latent variables u n , z n and the parameters θ as hidden variables, we propose the factorization An issue with the terms containing logarithms of summed normal densities in ( 4) is that they do not lead to tractable variational posteriors.Xu et al. (1995) proposed a generative model, which was also used by Baldacchino et al. (2016), where each p(x n ) was on the form of a Gaussian mixture, leading to analytically solvable variational posteriors at the cost of introducing additional parameters to the model and thus also accompanying hyperparameters.Trying to avoid this, we settle for a slightly different approach and instead approximate the log-sum-exp function LSE by the linearization where ξ = (ξ i ) i and η = (η i ) i .In particular, we let s = (s nc ) n,c and t = (t nn ) n,n =n be the softmax-transformed locations of the linearizations-that is, for each n, we have and Here, s n = (s nc ) c and t n = (t nn ) n =n , and c ≈ denotes approximate equality up to a constant not depending on θ.We regard s and t as variational parameters-the presentation of how they may be optimized is deferred to Section 3.2.It will be shown that, provided that the values of s and t are appropriate, our model will become conditionally conjugate under the assumptions above, resulting in variational posteriors from the same parametric families as our priors.
Letting {q l } l≥0 be a sequence of variational posterior approximations, we use the assumed factorization to obtain update equations according to These will be referred to as the E-step, the M-step I and the M-step II, respectively, due to the strong resemblance with the standard EM algorithm for fitting Gaussian mixtures (K Murphy 2012) and the expectation-conditional maximization algorithm (Meng and Rubin 1993).Let also s l , t l be the values of the variational parameters s and t at iteration l, and let for each c, n and n .We can then summarize the details of the updates in Propositions 1, 2 and 3 below, for which derivations are given in Appendix A. The overall algorithm, including the steps described in Section 3.2, is outlined in Algorithm 1.
Proposition 1 (Variational posterior, E-step).For all l, we have the decomposition Updates are given by where ψ is the digamma function.
Remark.The values of ω l c,nn may be found using the fact that c n =n ω l c,nn = 1.Proposition 2 (Variational posterior, M-step I).For all l, we have the decomposition updates are given by Moreover, each Σ l c is positive definite if r l nc > 0 for all n.Proposition 3 (Variational posterior, M-step II).For all l, we have and updates are given by

Linear programming for M-step I
Proposition 2 provides a sufficient condition for all scale matrices Σ l c to be positive definite, which is required for the inverse-Wishart distribution to be well-defined.We shall combine this with the local variational approximation method outlined in Bishop (2006) and Watanabe et al. (2011) to set up an optimization problem to be solved at each iteration for updating each row s n in the second transition variational parameters s.
Recall, in particular, that the purpose of introducing s and the associated linearization ( 6) is to render variational posteriors analytically tractable through a mean-field approximation.We would like to optimize the accuracy of the approximation by maximizing the linearized complete-data pseudolikelihood p({(y n , u n , z n )} n | {x n } n , θ), which depends on s.Although the exact solution is trivial with θ fixed, since θ is uncertain, we instead resort to marginalizing out θ according to the latest variational posterior q l (θ) and maximize the corresponding logpseudoevidence log The form of this objective motivates the use of an EM algorithm.These EM updates will be interleaved with the variational Bayes updates, similarly to the scheme in Bishop and Svensén (2003), using the previous iterate as initialization and performing a single iteration at each l.
In addition, we will constrain each update so that the M-step is taken constrained such that the condition r l+1 nc > 0 from Proposition 2 holds.Removing terms not depending on s and noting that the problem separates in such a way that each s n can be optimized separately, we can devise an update equation where s l+1 is set to the solution of the linear program minimize with A detailed motivation for ( 8) is given in Appendix B. Note that since the constraints are always feasible.

Stochastic gradient method for M-step II
For the first transition variational parameters t, note that unlike Proposition 2, Proposition 3 does not come with a guarantee that the double sum added to Λ −1 0 is positive definite-thus, the Wishart distribution constituting the variational posterior of Λ may not be well-defined.A solution is to use the unconstrained analogue minimize of the optimization problem (8), where, if the resulting inverse scale matrix Λ −(l+1) is not positive definite, its eigenvalues will be thresholded to be at least some small positive number.This is, however, somewhat unsatisfactory in that the eigenvalue thresholding step would become another layer of indirection besides from introducing t.An alternative method would be to instead specify beforehand, motivated by Proposition 3, that q l (Λ) = W(Λ | Λ l , η 0 ) and find the variational parameter Λ l by maximizing ELBO(q l ) without the linearization of the log-sumexp expression associated with the first transition, according to the generic approach described in Blei et al. (2017).While it is always possible to approximate analytically intractable expectations by Monte Carlo methods, since the ELBO is an expectation taken with respect to q l itself, one will in general need additional manipulations to obtain the corresponding estimate of the gradient.Various methods for such situations exist, examples including the score function gradient, which works for very general cases and can be used in conjunction with variance reduction techniques such as Rao-Blackwellization and control variates (Ranganath et al. 2014), and the reparameterization trick (Kingma and Welling 2014).Since q l (Λ) depends on Λ l continuously, we will use the reparameterization trick, which is understood to be the preferred method in such cases due to its lower variance compared to that of the score function gradient, even when using variance reduction (Kucukelbir et al. 2017).With the Cholesky factorization Λ l = L l L lT , we will reparameterize according to the Bartlett decomposition (Anderson 1984), which states that Λ ∼ W(Λ l , η 0 ) may be written as i,i =1 is a lower triangular random matrix with all independent entries such that A 2 ii ∼ χ 2 η0+1−i and A ii ∼ N(0, 1) for i > i .Extracting all terms depending on Λ l in the ELBO (see Appendix C) and using the Bartlett decomposition, we get the following result: Proposition 4 (Stochastic gate optimization).If q l (Λ) = W(Λ | Λ l , η 0 ), where Λ l = L l L lT is the Cholesky factorization of Λ l , updates are given by We will refer to the matrix in the parentheses between L T and L as the center matrix.Note that since p(A) does not depend on L, a standard Monte Carlo approximation allows both the function value and its gradient to be estimated.This information may then be used in any gradient-based optimization algorithm to find the update L l+1 and thus Λ l+1 , serving as an alternative to Proposition 3. In this case, it is straightforward to derive an analogue to Proposition 1 by replacing the terms corresponding to the first transitions with their equivalents from the Monte Carlo estimation.Optionally, as evaluation of the involved quantities is relatively computationally expensive, we may reduce the variance of the standard Monte Carlo estimator by replacing it with an unbiased estimator with control variates introduced-for details, see Appendix D.1.Moreover, the center matrix, containing a double sum of outer vector products, may be evaluated in the alternative form displayed in Appendix D.2.
Algorithm 1: Mean-field variational Bayes for the proposed model Optimize variational lower bound parameters s l according to (8) and, if not using stochastic gate optimization, t l according to (9) E-step: according to Proposition 3 or 4 end q(θ) ← q lmax (θ)

Posterior predictive distribution
Having obtained the variational posterior approximation q(θ), prediction of the corresponding output y * given an out-of-sample input x * is determined by the posterior predictive distribution As the marginalization integrals with respect to q(Λ) and q(µ c , Σ c ) are intractable, we will resort to Monte Carlo approximations.Note, in particular, that one may sample from q(Λ) and each q(µ c , Σ c ) independently due to the assumed factorization.Letting be independent samples from q(Λ) and each of the q(µ c , Σ c ), respectively, we can write This results in a K e C-component Gaussian mixture as an approximate posterior predictive distribution.

Computational study
To illustrate the advantages of the proposed model, we evaluate its performance when applied on five datasets.The first two are artificially generated, intended to demonstrate the ability of the mixture-of-experts method to adapt to complex dependencies between the inputs and their corresponding predictive distributions and capture characteristics such as skewness and multimodality.The third is a dataset of dose statistics of radiation therapy treatment plans for postoperative prostate cancer, which is small, noisy and relatively high-dimensional, intended to showcase the stability of the estimated uncertainties even in such cases.The fourth dataset is the California housing dataset (Pace and Barry 1997), consisting of features for California housing districts along with geographic location-here, focusing on multimodal outputs, the task is to predict the latitude-longitude pairs for each district given its other features.The fifth dataset consists of infrared spectroscopy measurements for prediction of soil functional properties at different locations in Africa, aimed at illustrating the case of probabilistically predicting more than two output variables.
In the following computational study, attention was restricted to the stochastic gradient method described by Proposition 4 in Section 3.2.2 for fitting the variational parameters associated with the gate, as the projection method described in Section 3.2.2 was found to be insufficiently accurate.The method was implemented in Python-in particular, the stochastic gradient optimization used the automatic differentiation features and the Adam optimizer (Kingma and Ba 2020) in Tensorflow 2.3.1.The hyperparameters of the priors were set and the variational parameters were initialized according to the default implementation outlined in Appendix E-in particular, an agglomerative clustering algorithm (Pedregosa et al. 2011) run on the standardized outputs {y n } n was used to initialize the expert means.For all experiments, the maximum number of iterations in the mean-field variational Bayes algorithm was set to 20, each outer iteration comprising 50 stochastic gradient sub-iterations for the gate optimization.An early-stop mechanism was employed, terminating the algorithm if the inner optimization leaves the gate scale matrix non-significantly changed for three consecutive iterations-here, a non-significant change is defined as a run for which the inner iteration number and objective value has positive estimated correlation and for which non-correlation cannot be rejected at significance level 0.01.All of the reported execution times are with respect to a computational service setup with a 12-core Intel Cascade Lake CPU platform, 85 GB of RAM and an NVIDIA A100 GPU with 40 GB of memory.
A comparison was made between the results obtained from the mixture-of-experts model and those from a conditional Gaussian DPMM model (Cruz-Marcelo et al. 2013), with posterior inference performed using a variational Bayes algorithm (Blei andJordan 2006, Pedregosa et al. 2011), as well as a Gaussian process baseline model (GPy 2012, Rasmussen and Williams 2006).For the DPMM, we used a Gaussian likelihood p(y | θ) = N(y | µ, Σ) with θ = (µ, Σ), θ | G ∼ G and G ∼ DP(α, G 0 ), where α = 1 and G 0 was set to a conjugate normalinverse-Wishart distribution, with hyperparameters selected analogously to the priors in our mixture-of-experts model.Using the stick-breaking representation of the Dirichlet process and a mean-field variational Bayes algorithm, the posterior predictive distribution may be Monte Carlo-approximated by a truncated Gaussian mixture (Blei and Jordan 2006).In particular, regarding each input-output concatenation (x n , y n ) in our dataset {(x n , y n )} n as a data point in the DPMM, the posterior predictive distribution p(x * , y * | {(x n , y n )} n ) of a new pair (x * , y * ) is mixture-Gaussian, and the associated conditional distribution p(y * | x * , {(x n , y n )} n ) is thus again mixture-Gaussian (Cruz-Marcelo et al. 2013).On the other hand, the Gaussian process baseline model used a radial basis function covariance k : X 2 → R with automatic relevance detection (Neal 1994), written as where the hyperparameters {w i } i and σ 2 were fitted by evidence maximization.For the second and third datasets, where the output is multidimensional, Gaussian processes were fitted for the marginal distributions of the components separately, assuming independence between the components-this is done purely for simplicity, as the Gaussian process model mainly serves as a baseline method in our experiments.Apart from the KL divergence, the Hellinger distance d H (p, q) = √ p − √ q L 2 / √ 2 and the total variation distance d TV (p, q) = p − q L 1 /2 were used as evaluation metrics for the test cases using synthetic data.The results of the comparison are shown in Table 1.
Table 1: Comparison of evaluated performance metrics between the proposed similarity-based mixtureof-experts (SBMoE) model, the conditional Dirichlet process mixture model (C-DPMM) and the baseline Gaussian process (GP) model.For the synthetic datasets, the KL divergence, the Hellinger distance and the total variation distance were used, whereas the mean negative log-likelihood was used for the other datasets due to the true predictive distribution being unknown.

Dataset
Evaluation

One-dimensional synthetic data
In this example, the observed data {(x n , y n )} N n=1 consists of N = 2000 exchangeable copies of (x, y) = ((x 1 , x 2 ), y) following the generative model with τ independent of x and ζ, where we use LN, Be and Γ to denote the log-normal, the Bernoulli and the gamma distribution (with shape-rate parameterization), respectively.As such, the predictive distributions are both skewed and possibly bimodal.The intent of choosing a relatively large data size was to demonstrate the capability of the proposed mixture-of-experts model to approach the complex predictive distributions with growing data-for smaller datasets, while the methods still work well, we found it hard to evaluate and compare the methods in any meaningful way.Using C = 32 experts for our proposed mixture-of-experts model (note that the expert locations are fixed with respect to x, thus the need for a relatively high number), approximate posteriors were computed using the proposed mean-field variational Bayes algorithm.The training of the mixture-of-experts, Gaussian process and conditional DPMM models took 553, 14 and 75 seconds on the aforementioned computational setup.A comparison of the estimated predictive distributions is shown in Figure 1 alongside a kernel density estimate of the true predictive density using 5000 samples from the true predictive distribution, a Gaussian kernel and Scott's rule (Scott 1992) for bandwidth selection.The evaluated performance metrics are summarized in Table 1.It is apparent that our mixture-of-experts model performed better than the conditional DPMM-which was, in turn, better than the Gaussian process model-in terms of performance metrics as well as visual inspection.Specifically, the flexibility of our model allows for much better fit to the bimodal and skewed distributions even compared to the conditional DPMM, which was often not able to capture bimodal behavior despite also having mixture-Gaussian predictive distributions.

Two-dimensional synthetic data
To illustrate the advantages of our model in estimating predictive distributions with intercomponent dependencies, we now use a 32-dimensional input x = (x i ) 32 i=1 and a two-dimensional output y = (y 1 , y 2 ).Define the transformation ϕ(ξ) = (N(0.7i| ξ, 4)) 7 i=0 , and denote by F the discrete Fourier transform.The observed data consists of N = 2000 data points drawn from the generative model where τ is independent of η and ζ.In particular, the log-normal generator η is observed through its obsfucated counterpart x-the map ϕ creates spikes centered at its inputs η 1 , η 2 represented as eight-dimensional vectors, whose real and imaginary Fourier components are concatenated into a 32-dimensional x.Given η, we then draw ζ as an inverse-Wishart matrix whose diagonal components are log-transformed and put into y, with the sign of the second component flipped with probability 0.5.Again, although all methods remained stable for smaller datasets, a relatively large data size was found necessary for the comparison to clearly showcase key qualities of the methods.Using C = 64 classes and running the mean-field variational Bayes algorithm, we obtain the results visualized in Figure 2, where estimated densities for our mixtureof-experts model, the conditional DPMM and the per-component Gaussian process model are shown in comparison to a kernel density estimate of the true predictive density, using 10000 samples from the true predictive distribution and again a Gaussian kernel and Scott's rule for bandwidth selection.The training of the mixture-of-experts, Gaussian process and conditional DPMM models took 1116, 311 and 245 seconds.The evaluated performance metrics are summarized in Table 1.Again, besides from having better validation metrics, our model is clearly more successful than the Gaussian process model in capturing the characteristics of the bimodal shape of the true predictive distribution.More surprisingly, the conditional DPMM, while also being able to recognize bimodal distributions, yielded much worse performance metrics than both our and the Gaussian process model.This may be explained by the conditional DPMM's generative modeling using mixture models, which presents difficulties in settings in which inputs are of relatively high dimension but sparsely distributed-note that each 32-dimensional x is uniquely determined by the associated two-dimensional η.In turn, this may lead to the underestimation of uncertainty we observe for the conditional DPMM, causing the considerably worse performance metrics.In contrast, the mixture-of-experts and Gaussian process models are both discriminative and similarity-based and therefore relatively insensitive to the complexities of the input space.

Radiation therapy data
We also demonstrate our method on a dataset of historically delivered radiation therapy treatment plans for postoperative prostate cancer patients from the Iridium Cancer Network in Antwerp, Belgium.Prediction of radiation dose statistics based on patient geometry is of particular interest in fields such as automated treatment planning and quality assurance (Ge and Wu 2019).The dataset consists of 94 treatment plans on which features have been extracted from patient images and dose statistics have been computed for the treatment plans, of which 84 were used as training data.For our purposes, attention was restricted to the bladder region-the  features we used are signed distance transforms computed from the bladder region to itself, the prostate planning target volume, the seminal vesicles planning target volume and the rectum region, binned into a total of 44 features; the dose statistics we used are the dose-at-volume values at 40 and 20 percent in the bladder region, corresponding to the 0.6and 0.8-level quantiles of the dose delivered to the region.Using C = 32 mixture classes, for the mixture-ofexperts, Gaussian process and conditional DPMM models, the training took 393, 2 and 1 seconds, respectively.The results of the comparison to the conditional DPMM and the Gaussian process model are presented in Table 1 and Figures 3 and 4. In terms of mean negative loglikelihood over the test dataset, the mixture-of-experts model had comparable and only slightly better performance to the conditional DPMM the Gaussian process model, respectively.More importantly, however, it managed to follow the S-shaped distribution of the data points in its predictions, whereas the Gaussian process model sometimes predicted far outside the range of reasonable values.From Figure 4, we also see that the learned distance metric d X produced a varying number of near neighbors, implying an accordingly varying degree of uncertainty.It is especially interesting to note that this was possible using such a small training dataset with relatively high input dimensions, which is seen as a sign of robustness against data scarcity.

California housing data
Here, we consider the California housing price dataset (Pace and Barry 1997), which comprises variables for 20640 housing districts in California, United States.In particular, for each district, we have the longitude-latitude pair of its geographical location, the median age, the total number of rooms, the total number of bedrooms, the population, the number of households, the median  income, the median house value and the ocean proximity.Removing irregular data points, the categorical ocean proximity variable and replacing the total number of rooms and bedrooms by the corresponding mean per household, 18299 data points remain, with seven input and two output dimensions.Furthermore, subsampling with uniform probability, we use 3000 training and 100 test data points for the numerical experiments.For our purposes, we will try to predict the latitude-longitude pair of each out-of-sample district given the other seven features.This is a non-straightforward task for several reasons: the signal-to-noise ratio is relatively small with the available input features, the map from features to latitude-longitude pair is likely rather complex, and the predictive distributions are in general multimodal as similar housing districts may occur in areas far away from each other.Using C = 32, training of the mixture-of-experts model took 194 seconds, compared to 196 seconds for the Gaussian process and 229 seconds for the conditional DPMM.Table 1 contains performance metrics for the different methods, and Figure 5 shows example predictions for six random out-of-sample test inputs.We see that the mixture-of-experts model and the conditional DPMM, in contrast to the Gaussian process, are able to produce multimodal predictive distributions following the geographical shape of the data, successfully identifying housing district clusters such as cities.Moreover, the mixture-ofexperts model performs better in terms of mean negative log-likelihood than the other models.
In particular, as is the case for the two-dimensional synthetic dataset, the conditional DPMM tends to slightly underestimate the predictive uncertainties.

Africa soil data
In this last experiment, we consider the prediction of soil functional properties from infrared spectroscopy measurements as a means of rapid, low-cost analysis of soil samples.Especially in data-sparse regions such as Africa, such tools are important for planning sustainable agricultural intensification and natural resource management.The dataset (Vågen et al. 2010) contains 1157 data points, each having 3578 dimensions of diffuse-reflectance infrared spectroscopy measurements at different wavelengths, 16 features related to the geography of the location at which the sample is collected, and five output variables comprising the soil organic carbon, the pH value and the calcium, phosphorous and sand content-for more details, see Vågen et al. (2010).Using 1000 data points for training and excluding the rest for testing, the 3578-dimensional spectroscopy measurements are dimensionality-reduced by a multi-level discrete wavelet transform (Mallat 2008) using a Daubechies 4 wavelet into 74-dimensional representations-thus, the problem considered amounts to having dim X = 90 and dim Y = 5.
With C = 64 classes, the trainings for the mixture-of-experts model, the Gaussian process and the conditional DPMM ran for 803, 1279 and 24 seconds, respectively.Table 1 shows the evaluated performance metrics, and Figure 6 shows the single and pairwise marginal predictive distributions for a randomly chosen test input.Again, the mixture-of-experts model performs better than the competitor models, with the conditional DPMM obtaining a particularly bad score.We can see from the example prediction in Figure 6 that our mixture-of-experts model is able to follow the data distribution well and predict the ground truth accurately with varying degrees of uncertainty for the different output dimensions-in comparison, the Gaussian process is only able to output an ill-fitted Gaussian predictive distribution, and the predictions from the conditional DPMM are mostly far away from the ground truth.Indeed, it is reasonable to believe that the particularly bad performance of the conditional DPMM is due to the relatively high dimension of the data, where fitting a mixture model to dim X + dim Y = 95 dimensions and taking the conditional given a 90-dimensional input is likely to be extremely sensitive to calibration errors. -2

Discussion
Motivated by the disadvantages of commonly used parametric models when applied to data with complex input-output relationships, in this paper, we have built upon previous work on probabilistic k-nearest neighbors by Holmes and Adams (2002), Cucala et al. (2009) and Friel and Pettitt (2011) by extending the ideas to a regression setting, using Gaussian mixtures as a flexible means of representing predictive distributions.We use a conditionally specified model in which full conditionals are defined and the joint likelihood is replaced by a pseudolikelihood in order to render further computations tractable.In particular, in contrast to Friel and Pettitt (2011), we also regard the precision matrix Λ in the distance metric d X as a model parameter on which posterior inference may be performed.With a nonparametric approach, we avoid the need to explicitly learn the mapping between input and output as, for example, in conventional mixture-of-experts models such as those described in Jacobs et al. (1991), Bishop and Svensén (2003) and Xu et al. (1995), all while maintaining a Bayesian approach.Also in contrast to, e.g., the probabilistic k-nearest neighbor models by Holmes and Adams ( 2002 our method relies on discriminative modeling by comparing similarities between inputs.The computational study demonstrates on several synthetic as well as real-world datasets the ability to model irregular, multivariate predictive densities, while being able to quantify uncertainties accurately for medium-sized to very small datasets.Specifically, we showed that our mixture-ofexperts model outperforms the conditional DPMM and the Gaussian process baseline in certain settings-for example, those in which the input space is relatively high-dimensional yet sparse and thus unsuitable for generative modeling using mixture models.While generally being more computationally demanding than the conditional DPMM and Gaussian process, the execution times for training the mixture-of-experts model on the given computational setup were certainly short for a high-dimensional Bayesian model. Based on the method we describe in this paper, there are ample opportunities for interesting future work.A drawback of the method is its lack of scalability to massive data, an issue common for all nonparametric models where the training data is explicitly used in the predictive pipeline.In our case, the main difficulties stem from the need to compute the posterior probabilities ω l c,nn at each iteration l, which requires storage in an array of CN 2 elements.One possible approach to addressing this is to use Bayesian coresets (Huggins et al. 2016), which is a small weighted subset of the data constructed so as to optimally summarize the data; another is to take inspiration from the sparse pseudo-input Gaussian process literature (Quiñonero-Candela and Rasmussen 2005).A further drawback is the subtle but important assumption that the input space X is a vector space-while this may be true for all datasets considered in our computational study, when handling e.g.image data, one would need some form of preprocessing by a feature-extracting transformation φ : I → X from the image space I to the vector space X .If, for example, φ is a neural network, one can optimize the network weights using the same optimization problem as in Proposition 4 by regarding the weights as hyperparameters.One may also investigate sparse representations of the scale matrix Λ l for the case of dim X being too large to store the full Cholesky factorization of Λ l .Lastly, to address the problem of choosing an appropriate number of experts, yet another direction of future work would be to explore the possibility of combining the present model with Dirichlet process mixtures, further conforming to the philosophy of Bayesian nonparametrics.
which is identified as the probability mass function of a (three-dimensional) categorical distribution-note, in particular, that this conclusion may be drawn regardless of the forms of the distributions q l (Λ) and q l ({(µ c , Σ c )} c ). Computing the expectations according to Lemma 1, the expressions for log ω l+1 c,nn follow.

A.2 Proof of Proposition 2
Again, starting from the log-pseudojoint in (4), we apply the local variational approximations (6), ( 7), use the fact that E q l (u n ,z n ) 1 z n = c 1 u n = n = ω l c,nn , collect relevant terms and exchange orders of summation to obtain Note that this leads to the sought decomposition q l ({(µ c , Σ c )} c ) = c q l (µ c | Σ c )q l (Σ c ) for all l.In particular, this implies that ) according to the definitions of κ l c and µ l c .Also, we can now see that Regarding the positive definiteness of Σ l c in relation to the r l nc , suppose first that r l nc > 0 for all n.Defining by Cauchy-Schwarz-here, y is the row-augmented matrix (µ 0 , (y n ) n ) and the inner product and norm are defined as x, x = κ 0 x 0 x 0 + N i=1 r l nc x i x i and x 2 = x, x , which are welldefined as all r l nc and κ 0 are positive.This shows the positive definiteness of K l c and thus also Σ l c .

A.3 Proof of Proposition 3
By the same procedure as in Appendices A.1, A.2, the desired result follows from where we have used the facts that n =n c ω l c,nn = 1 and n =n t l nn = 1.

Appendix B EM updates for local variational approximation
Following Section 3.2.1,we are to maximize log is the linearized pseudolikelihood, with respect to the variational parameters s corresponding to the second transitions.The form of the objective as an integral over the optimization parameters motivates the use of an EM algorithm, of which only one iteration is needed in Algorithm 1.Note that the rightmost expression in ( 5) is invariant under uniform shifts of η, so we may assume without loss of generality that i e ηi = 1.Letting ζ i = e ηi for all i, where ζ i ≥ 0 and i ζ i = 1, we can rewrite (5) as the equivalent approximation Using this, the first EM update at iteration l in the mean-field variational Bayes algorithm may then be approximated as the minimizer of over s with nonnegative entries such that c s nc = 1 for all n-that is, we should set the update s l+1 to the s for which each s n maximizes c s nc E q l (µc,Σc) log N(y n | µ c , Σ c ) − log s nc .
However, this amounts to solving N separate constrained nonlinear optimization problems at each l, which would be too computationally expensive for most practical applications.We can circumvent this by removing the entropy term − c s nc log s nc above so that the optimization problem becomes linear-indeed, since the entropy term must be in the interval [0, log C] whereas the other linear term is unbounded, the expression may be regarded as approximately linear in s n .Finally, in order to enforce the requirement of r l nc = n =n (ω l c,nn + ω l c,n n − s l nc Ω l n n ) > 0 in Proposition 2, we add the constraint to each M-step in our EM algorithm and threshold the resulting r l nc to be at least some small value.Thus, we recover the optimization problem (8) to optimize the second transition variational parameters.In particular, the fact that follows directly from Lemma 1.

Appendix C Evidence lower bound
We derive an explicit expression for ELBO(q l ), which may be evaluated at each iteration l to monitor the progress of the variational approximation.Using the pseudolikelihood (3), we have the decomposition For the contribution from the full conditional, we again use the local variational approximations ( 6) and ( 7).With computations analogous to those in Appendices A.1, A.2 and A.3, we can write Furthermore, the contribution from the prior is computed as and that from the variational posterior is computed analogously as

Appendix D Stochastic gate optimization D.1 Control variates for variance reduction
In Proposition 4 and in the gradient-based versions of Propositions 1 and the ELBO, expressions such as As the computation of the expression for each sample of A is performance-critical, we devise control variates to reduce the variance compared to that of the standard Monte Carlo approximation.Since variance-minimizing weights cannot be obtained in closed forms, we will subtract a linearized version of the log-sum-exp expression and adjust with its analytically tractable expectation.In particular, note that where (η i ) i is the location of the linearization and (ζ i ) i = softmax((η i ) i ).In our case, introducing (ζ nn ) n =n as the softmax-transformed location of the linearization for each n, the linearized log-sum-exp expression is equal up to a constant to with expectation −(η 0 /2) n =n ζ nn (x n − x n ) T LL T (x n − x n )-thus, the corresponding estimator with control variates is written as With L l being the Cholesky factor of the scale matrix at coordinate ascent iteration l, for the optimization of L l+1 , we may set the location of the linearization to the expected values of iteration l, leading to

D.2 Center matrix computation
In the optimization objective in Proposition 4, the center matrix needs to be computed and stored in memory.To avoid evaluating the double sum of outer vector products by looping, it is easy to check that the following expression produces the same matrix, where x = (x n ) n is the rowwise input data matrix and Ω l = (Ω l nn ) n,n .This allows for the use of more efficient vectorized matrix operations.
the Wishart-distributed Λ is set to dim X + δ g , where δ g is referred to as the excess degrees of freedom for the gate prior.In turn, the scale matrix Λ 0 is set to the inverse of the sample covariance of the input variable x, scaled by λ g /η 0 so that Λ has expectation equal to the inverse of the sample covariance of x up to a factor λ g .A larger δ g will imply a larger η 0 , which will amplify the logarithmic determinant and trace terms in the objective in Proposition 4 and thus entail a higher degree of regularization of the "likelihood" term.Also, a higher λ g will amplify Λ −1 0 in relation to n n =n Ω l nn (x n − x n )(x n − x n ) T and, eventually, make the former dominate over the latter, in which case the scaling λ g will roughly determine the magnitude of the center matrix.As arg min A (− log det A + tr(AB)) = B −1 for all positive definite B, the logarithmic determinant and trace terms will drive the scale matrix candidate LL T toward the inverse of the center matrix.Thus, increasing λ 0 will in general have the effect of reducing the number of near neighbors of each input x n .Similarly, for the experts, the degree-of-freedom parameter ν 0 > dim Y − 1 is set to dim Y + δ e , where δ e is the excess degrees of freedoms for the expert prior.The scale matrix Σ 0 is then set to (λ e ν 0 )/C times the sample covariance of the output variable y, where λ e is a scaling factor, so that each inverse-Wishart-distributed Σ c has expectation equal to the sample covariance up to a factor λ e /C.Moreover, the prior mean µ 0 is set to the sample mean of y, and the related hyperparameter κ 0 > 0 is set to a small number to indicate a relatively weak belief that each µ c will actually lie near µ 0 .
Regarding the number C of experts, as they are anchored at different locations across the output distribution, C controls the fine-grainedness of the predictive distributions.Thus, in general, C should be set as high as the computational resources allow, which is in the order 10 2 for common computer setup-in the implementation, default is C = 32.If C is on the large side of the actual model complexity, experts will conglomerate to produce a model which is effectively the same as one with a smaller C. Hence, reducing C will reduce the set of possible predictive distributions, whereas an exaggerated C will generally not affect the model outputs much as long as computational aspects are disregarded.An important point is that while the choice of C is indeed a parameter to manually specify, thus moving away from a purely nonparametric paradigm, the situation is similar to that described in (Blei and Jordan 2006), where a finite mixture is used as a variational approximation to the infinite-mixture DPMM and the number of mixture classes needs to be explicitly specified.
Apart from the hyperparameters of the priors, we also need to initialize the variational parameters for the mean-field variational Bayes iterations.At iteration l = 0, we set η 0 = η 0 and Λ 0 = Λ 0 for the gate.For the expert means, a clustering algorithm with a fixed number of C clusters is used for assigning to each µ 0 c the mean in cluster c.For each expert c, we then set Σ 0 c , ν 0 c and κ 0 c to the sample covariance of y, ν 0 and a relatively high number, respectively, the latter to reflect the assertion that we do not want excessive movement in the expert means in order to avoid premature conglomeration of the mixture components.The variational parameters s and t are set such that s nc ∝ 1 and t nn ∝ 1 with the required normalizations.

Figure 1 :
Figure 1: Comparison between a kernel density estimate of the true predictive density p(y * | x * , {(x n , y n )}n) and the estimated counterparts from the mixture-of-experts model, the conditional DPMM and the Gaussian process, shown for six out-of-sample inputs x * of the one-dimensional synthetic data.The gray bars below the graph represent the training data points {y n }n, and the blue bars represent samples from the true predictive distribution.

Figure 2 :
Figure 2: Comparison between a kernel density estimate of the true predictive density p(y * | x * , {(x n , y n )}n) and the estimated counterparts from the mixture-of-experts model, the conditional DPMM and the Gaussian process, shown for six out-of-sample inputs x * of the two-dimensional synthetic data.The gray dots represent the training data points {y n }n, and the blue dots represent samples from the true predictive distribution.

Figure 3 :
Figure 3: Comparison between the estimated predictive densities p(y * | x * , {(x n , y n )}n) from the mixture-of-experts model, the conditional DPMM and the Gaussian process, shown for six random outof-sample inputs x * of the radiation therapy data.The gray dots represent the training data points {y n }n, and the blue diamond shows the ground truth y * .Axes represent fraction of the prescription dose.

Figure 4 :
Figure 4: Visualization of the computed distances dX (x * , x n ) to each training input x n , shown for the same out-of-sample inputs x * of the radiation therapy data as in Figure 3.A darker purple color corresponds to a smaller distance.Axes represent fraction of the prescription dose.

Figure 5 :
Figure 5: Predictive distributions of geographic location for the SBMoE, C-DPMM and GP models versus ground truth for the California housing data.

Figure 6 :
Figure 6: Predictive distributions for the SBMoE, C-DPMM and GP models versus ground truth for the Africa soil data.