Bayesian shrinkage in mixture-of-experts models: identifying robust determinants of class membership

A method for implicit variable selection in mixture-of-experts frameworks is proposed. We introduce a prior structure where information is taken from a set of independent covariates. Robust class membership predictors are identified using a normal gamma prior. The resulting model setup is used in a finite mixture of Bernoulli distributions to find homogenous clusters of women in Mozambique based on their information sources on HIV. Fully Bayesian inference is carried out via the implementation of a Gibbs sampler.


Introduction
Modeling heterogeneity in datasets is a common problem in applied statistics.The task is to find underlying clusters of similar observations that can be used to describe the data.A widespread and known method to accomplish this is finite mixture modeling, where the main idea is to model a single probability distribution as the weighted sum of a finite number of mixture densities.This technique can be used for model based clustering as well as density estimation.Finite mixtures are widely used in different research fields -a rather common application in marketing research is discussed in Lenk and DeSarbo (2000), who employ mixture modeling techniques to find clusters of customers with similar behaviour.Earlier references discussing marketing applications are Allenby and Ginter (1995) and Rossi et al. (1996).Lubrano and Ndoye (2016) use mixtures to find homogenous groups in a study of the income distribution of the United Kingdom.However, the model family also extends to time series analysis naturally as shown in Frühwirth-Schnatter and Kaufmann (2008).An applied example is the Markov mixture model that Frühwirth-Schnatter et al. (2012) use to model the earning dynamics in the Austrian labour market.For a comprehensive overview of mixture models and estimation strategies, see Frühwirth-Schnatter (2006).
The main contribution of this article lies, however, in a popular extension of the standard mixture framework.In the most basic Bayesian mixture models, prior class membership is modeled using the component weights, that is the relative size of the mixture clusters.Essentially, this means that the highest prior membership probability is assigned to the largest group in the population.This assumption implicitly claims that each observation has the same prior probability of belonging to a specific group, neglecting other observable characteristics of the data point.To make use of additional information, it is also possible to model mixture parameters as a function of external covariates.Such a specification usually allows for a richer interpretation of the model output and might permit a more holistic use of datasets.This modeling technique is usually referred to as a mixture of experts (MOE).Despite the name originating in the machine learning literature1 , mixture of experts models have a wide range of applications, similar to standard mixture models.Gormley and Murphy (2008) develop a MOE model for rank data and Gormley and Murphy (2010) use the framework to model network data.MOE models also apply to time series (Huerta et al., 2003;Frühwirth-Schnatter et al., 2012) and longitudinal data (Tang and Qu, 2016).Related models have been discussed under different labels for quite some time now, for instance switching regression models (Quandt, 1972) or concomitant variable latent-class models (Dayton and Macready, 1988).For a comprehensive overview of mixture of experts models refer to Gormley and Frühwirth-Schnatter (2018).
This article focuses on a specific problem arising when dealing with mixture of experts models: There is severe model uncertainty regarding the relevant covariates to include to model prior class membership (as pointed out by Anderson et al., 2016).Both estimated coefficients and class membership estimates might be sensitive to the particular choice of explanatory variables included in the model.
One way to resolve this issue is to rerun the model using cross validation as a crude sensitivity analysis.However, the process of choosing which variables to include remains arbitrary.
We propose the use of a continous shrinkage prior in latent class mixture modeling to isolate ro-bust determinants of class membership to overcome this problem.More specifically, we specify a Normal-Gamma prior (Griffin and Brown, 2010) and use the Pólya-Gamma sampler from Polson et al. (2013) for computations to isolate important predictors of class membership.This results in more efficient shrinkage and improved clustering when compared to related methods like standard stochastic search variable priors (George and McCulloch, 1993; introduced to latent class models in Ghosh et al., 2011), especially when the number of possible predictors is large and/or the sample size is small.The basic idea of the model is thus for instance related to the general framework presented in Villani et al. (2012).We illustrate this approach through simulation studies and an empirical example using Demographics and Health Survey (DHS) data from Mozambique.
The remainder of this work is organized as follows: Section 2 introduces the modeling framework.
In Section 3, a simulation study is conducted to evaluate the performance of the proposed prior setup.Section 4 illustrates the framework in an application to HIV information source data from Mozambique.Section 5 concludes.

Mixture of Experts Models
Let y i denote an observation of data point i = 1, . . ., N .This dependent variable can be univariate or multivariate, discrete or continuous, or of a more general structure such as time series or network data.Let x i be a set of P (for p = 1, . . ., P) covariates of y i .Assume K (for k = 1, . . ., K) clusters exist in the population that follow known probability density functions f k (•|Ξ k ) with component specific parameters Ξ k .Denote the component weights as η k (x i ) where η k (x i ) 0 and (1) We assume the component weights η k (x i ) and thus the component specific parameters Ξ k (x i ) to be a function of the concomitant variables x i .These covariates influence the distribution of y i indirectly via the individual prior class membership probabilities P r(S i = k|x i ) = η k (x i ).S i is the latent class membership indicator of individual i, where S i = k if y i belongs to cluster k.The relationship is modeled via a multinomial logit link with where we set β K = 0 to achieve identification of the model.This directly results in the interpretation of the coefficients in terms of a change in the log odds relative to the baseline category K.

Prior Specification
There are several ways to model Bayesian multinomial logistic regression.We choose the method proposed by Polson et al. (2013) where τ 2 k,p denotes the local shrinkage parameter of coefficient p in regression k.As opposed to Griffin and Brown (2010), who apply this prior to a standard regression model, we have to deal with K − 1 separate sets of coefficients in the multinomial logit framework.Thus, we do not use a single global shrinkage parameter λ, but introduce global shrinkage parameters λ k per equation.This allows for more flexibility and allows for conducting variable selection for each group of the multinomial logit separately.This might be sensible, taking into consideration that the relevant variables responsible for accurately describing class membership might well alter between classes.This idea has been implemented into Bayesian time series analysis recently (see for example Bitto and Frühwirth-Schnatter, 2018;Kastner, 2018 or Huber andFeldkircher, 2017).To complete the prior setup, we specify the hierarchical prior structure for λ k and τ 2 k,p to be The priors on the component parameters Ξ k (x i ) are application specific.Linearity in parameters is assumed.The choice of values for the hyperparameters c 0 , c 1 and θ is discussed in appendix A.

Posterior Simulation
We implement a Gibbs Sampler to sample the parameters from the full conditional posterior distributions using Markov Chain Monte Carlo (MCMC) methods (Robert and Casella, 2013).The posterior of the latent class membership indicator S i is drawn from a multinomial distribution M (1; p i,1 , . . ., p i,K ) with success probabilities (p i,1 , . . ., p i,K ) where where f k denotes the probability density function of the components of the mixture distribution.A posteriori, the regression coefficients are normally distributed with The parameters m k and V k of this normal distribution can be derived as 5 where 1(•) denotes the indicator function.
ω i,k is a latent auxiliary variable that is conditionally Pólya-Gamma2 distributed with Finally, the posterior distributions of λ k and τ 2 k,p are both of well-known form and can be derived as where P is the number of covariates entering the model and G I G denotes the Generalized Inverse Gaussian distribution.Hörmann and Leydold (2014) provide an efficient adaptive rejection sampling algorithm that makes it possible to easily draw from the G I G.This algorithm is implemented in the R package GIGrvg (Leydold and Hörmann, 2015) which we use in our computations.This completes the simulation setup.

Model Selection
Selecting the number of mixture components still remains a challenging issue.Proposed solutions are the use of reversible jump MCMC algorithms (Green, 1995) or shrinkage on the component weights (Malsiner-Walli et al., 2016).A further commonly used approach is to estimate the marginal likelihoods of models with different number of components and use these likelihoods to decide how many components are suitable (see Celeux et al., 2018 for a review).
Estimating the marginal likelihood is a non-trivial integration problem that envolves a number of possible numerical and computational issues.Starting with the purely statistical problem, we need to compute the marginal likelihood given by where M G denotes the model with G components and Θ = (Ξ 1 , . . ., Ξ K , β 1 , . . ., β K−1 ) denotes the set of model parameters.In the overwhelming majority of cases, this integral does not have a closed form solution.However, several methods may be employed to estimate the value of this integral.
We use full permutation bridge sampling to estimate the marginal likelihood for model selection purposes.Bridge sampling was first introduced by Meng and Wong (1996) and has been thoroughly described for Markov switching and mixture models by Frühwirth-Schnatter (2004), who concludes that the bridge sampling estimator is the preferable estimator for the marginal likelihood of this model class and superior to related approaches like importance sampling (Geweke, 1989) or the harmonic mean estimator (Newton and Raftery, 1994).
Let Θ G denote the set of model parameters of the model with G components.First, we need to construct an importance density q(Θ G ) and generate L i.i.d.draws from this density, denoted by G .This importance density should have the same domain as the posterior distribution and closely resemble the posterior distribution (Gronau et al., 2017).As shown by Frühwirth-Schnatter (2006), the bridge sampling estimator can then be derived as where G are the M posterior draws from the Gibbs sampler output using G components and p (•) denotes the non-normalized posterior distribution.The choice of α(Θ G ) is arbitrary, however, Meng and Wong (1996) discuss an asymptotically optimal choice which minimizes the expected relative error of the estimator.It is given by The bridge sampling estimate of the marginal likelihood pBS can be obtained using the following algorithm: 1. Run the MCMC sampler and save M posterior draws G from the importance density.
3. Choose a starting value for pBS,0 .4. Run the following recursive process until convergence is achieved: Note that in order to evaluate the non-normalized posterior distribution, it is necessary to use the marginal prior densities of the parameters that are specified using a hierarchical prior setup.The marginal prior of β k,p is available in closed form and can be derived as where K x (•) is the modified Bessel function of the second kind with index x and Γ (•) is the gamma function (see for instance Bitto and Frühwirth-Schnatter, 2018).
A thorough discussion of the bridge sampling technique is out of scope of this article.However, so far literature has been rather sparse on the practical computation of bridge sampling estimates in the context of mixture models and especially mixture of experts models.An exception is the recent review by Gormley and Frühwirth-Schnatter (2018, Section 12.3.3)who give details on the procedure for mixture of experts models.In general, both the construction and the evaluation of the importance density for mixture of experts models are challenging due to the multimodal nature of the likelihood function.In addition, the large number of likelihood evaluations involved in the bridge sampling procedure are subject to computational issues like numeric over-and underflow.
Details on the implementation of a bridge sampler for the proposed model are therefore provided in appendix B.

Label Switching and Identification
Parameter estimation in this model family poses various difficulties, especially in a Bayesian framework.Label switching is a known issue when estimating mixture models (Jasra et al., 2005).It is the result of the multimodal likelihood function being invariant to relabeling the components.This can be problematic when implementing a MCMC sampler as switching labels during the sampling process might result in heavily distorted posterior distributions and biased point estimates.Early approaches deal with label switching by introducing simple restrictions on the mixture parameters such as η 1 < . . .< η K (see for instance Lenk and DeSarbo, 2000).However, identifying simple restrictions in high-dimensional models might be cumbersome or infeasible.In addition, if the restriction does not provide balanced label switching, estimates of the marginal likelihood of the model might be biased according to Frühwirth-Schnatter (2004).Therefore, we follow Frühwirth-Schnatter (2006) and identify the posterior draws using a postprocessing procedure.To force balanced label switching, random permutation sampling is introduced.
The algorithm employed is based on the idea of clustering the parameter draws using distance based measures in the point process representation of the MCMC output.After M saved unconstrained MCMC iterations, k-means clustering is applied to all M K posterior draws within a suitable parameter subset.The idea is that draws belonging to the same mixture component will be sorted into the same group by the clustering algorithm.The permutation sequence that results from this kmeans procedure can then be used to reorder the posterior draws and obtain unique identification for further parameter inference.More formally, we use the following two block algorithm: 1. MCMC Sampling (a) Simulate parameters Θ (t) conditional on the classification sequence S (t−1) .
(b) Classify each observation y i conditional on Θ (t) .
(c) Select one of the K! possible permutations of the component labels randomly.Use the resulting labeling sequence ρ t (1), . . ., ρ t (K) to relabel both the parameter draw Θ (t) and the classification sequence S (t) .

Identification
(a) Arrange the MCMC draws in a matrix with M K rows and r columns, where r denotes the number of parameters deemed necessary to identify the model after for instance visually inspecting the MCMC output.3(b) Cluster all M K draws using k-means centroid analysis.This provides a classification sequence ρ (t) containing information on cluster membership of each parameter draw.
(c) Check whether ρ (t) is a permutation of (1, . . ., K).If this is not the case, remove the draw.
(d) All remaining draws can be identified through reordering using the classification sequences ρ (t) , which guarantees unique labeling.Consequently, the identified draws can be used for parameter inference.
Step 2(c) is implemented to ensure that we only use draws where a unique labeling can be found.By removing draws where ρ (t) is not a permutation of (1, . . ., K), we remove draws where clusters are overlapping in the point process representation and thus no unique labeling is achievable.The ratio of removed draws to the number of saved MCMC draws can be used as an indicator for how well the model is able to separate the mixture clusters.A high rate of non-permutations usually points in the direction of an over-fitting model.
Note that this identification approach carries a substantial benefit during the estimation process regarding the multinomial logit step.By construction, the coefficients of a multinomial logistic regression will not change when the baseline group remains the same and the labeling of the other groups changes.Hence, it is sufficient to find one dimension of the parameter space through which it is possible to identify one group in every MCMC draw.This group can then be used as baseline.
The identification of all other groups is achieved through the algorithm described above in a fully automatic manner.

Simulation Study
To illustrate the performance of the proposed prior structure, we conduct a simulation study to compare our approach to other possible model setups.We replicate and extend the simulations presented in Ghosh et al. (2011), who propose the stochastic search variable selection prior (George and McCulloch, 1993) for latent class models.Their statistical framework is similar to ours both in terms of model structure and computational approach.Therefore, a simulation based comparison of the two models seems advisable.The stochastic search variable selection prior relies on the idea of specifying a mixture of two normal densities as prior for each multinomial logit coefficient.Both normal densities are centered at 0. One has a large variance ("slab") while the other one has a small variance ("spike").Using standard mixture modeling techniques, it is possible to estimate whether a particular coefficient will be drawn from the slab or from the spike component of the mixture.
Formally, we specify where ζ 2 1 << ζ 2 2 and δ k,p is the binary inclusion indicator of covariate p in group k.For details see George and McCulloch (1993).As in Ghosh et al. (2011), we set γ 2 2 = 1.Instead of using a point mass at zero as first mixture component we specify a normal spike component where γ 2 1 = 0.01.The approach of Polson et al. (2013) applying no shrinkage (hereafter "Standard Prior"4 ) and the proposal of Ghosh et al. (2011) (SSVS) are then compared to the normal gamma prior structure (NG) as proposed in this article.
Using the data generating process in Eq.(2), we simulate four groups with 750 observations and 20 explanatory variables to recreate the framework used in Ghosh et al. (2011).The true parameter vectors are chosen to be sparse, thus creating the need for considerable shrinkage within the estimation of the multinomial logistic regression.The true coefficient values are β 1 = (0.8, 1, 2, 0.5, 0, . ..) , β 2 = (0.3, 0, 0, 0, −1, 1.7, −2, 0, . ..) and β 3 = (0.3, 1, −2, 0.8, 0.9, 0, . ..) .All explanatory variables are drawn from a standard normal distribution.Note that this setup implies that we need to deal with group specific relevant membership predictors.Thus, to obtain good estimates, group specific shrinkage is necessary.As this simulation uses a large number of observations, a quite informative likelihood results.Hence, we extend the original simulation study by two scenarios using 300 and 100 observations, respectively.This should enable us to evaluate the prior performance in an environment with comparatively uninformative data5 .
In line with Ghosh et al. (2011), we implement a Gibbs sampler using 25000 draws after a burn-in period of 5000 draws.The mean estimates of 25 simulation runs are then compared.Note: The estimates correspond to the average value across 25 runs.
RMSEs are multiplied by a factor of 100.
Table 1 reports the root mean squared error (RMSE) with respect to the coefficients that are truly zero, the coefficients that are truly different from zero, all coefficients and the predicted probabilities resulting from the estimation.This enables us to separately examine how well the priors are able to shrink unimportant coefficients to 0, how precise the point estimates are and whether they are able to give useful estimates of the predicted probabilities.These predicted probabilities are of utter importance in the mixture of experts framework, as they will directly influence class membership and therefore all estimated model parameters.The results suggest that the first simulation using 3000 observations is not a very competitive environment.The likelihood is quite informative, resulting in precise estimates even for the standard setup without introducing shrinkage.Figure 1 plots the true values against the posterior mean estimates of the respective models.Scatterplots suggest that all three models are able to revocer the true coefficient values well.Nevertheless, the NG setup performs particularly well and even outperforms the SSVS setup from Ghosh et al. (2011) in terms of precision.However, it comes at the cost of a slightly prolonged computation time.
Using just 10% of the observations, estimation becomes more difficult as the data becomes less informative as seen in Figure 2. The point estimates become considerably worse.The standard prior has problems to recover the true values, as the enlarged RMSEs indicate.The NG prior shows slight advantages in terms of shrinkage and in predicting cluster membership probabilities.However, the SSVS setup is able to provide more accurate point estimates and therefore has a slightly lower RMSE with respect to the true non-zero coefficients and regarding the overall coefficient RMSE.
Further reducing the number of observations to N = 100 leads to inflated coefficient estimates when applying no shrinkage, as depicted in Figure 3.This results in enlarged RMSEs.The performance of the SSVS and NG prior remains rather similar to the case with N = 300.The SSVS prior produces We want to stress that it is one of the main priorities in model based clustering and in particular in mixture of experts frameworks to get precise estimates of the class membership probabilities for each observation.It is therefore of great importance that a model is able to provide rather tedious estimates of the prior class membership probabilities η k (x i |β) as given in Eq. 5.In addition, when dealing with variable selection, the ability to shrink unimportant coefficients to zero has more priority than the precision of the point estimates of the non-zero coefficients.Thus, the argument presented is that in a mixture of experts framework with model uncertainty the proposed NG prior setup offers advantages as compared to the standard framework and the SSVS prior.

HIV information sources in Mozambique
Mozambique is a country in Southeastern Africa that is considered one of the poorest and most underdeveloped countries in the world, scoring low in both economic and human development rankings.
In the year 2008, Mozambique had the 8th highest HIV prevalence in the world with 1,600,000 people infected (11.6% of the population) of whom around 990,000 were women and children.
According to the Joint United Nations Programme on HIV/AIDS, there are around 590,000 HIV orphans living in Mozambique, 180,000 of whom are infected with the virus themselves, a large part due to mother-child transmission.75% of the infected population between the age of 15 and 19 is female.Moreover, a large gender disparity regarding the level of information on the disease can be observed.While around half of the male adolescent population has comprehensive knowledge on HIV, only 27.4% of adolescent women have enough information to adjust their behaviour to protect themselves and their children according to the United Nations Children's Fund.This disparity is suspected to be largely due to socioeconomic and sociocultural reasons, with the main drivers being traditional gender roles and religious involvement (Agadjanian, 2005).
Consequently, it is crucial to isolate channels that can be used by the government and non-governmental organizations to disseminate vital information on HIV, especially to the female population.Informing females about HIV has proven not only to decrease the infection rate but also increase the economic and social independence of women (Audet et al., 2010).Our empirical example contributes to this relevant and important issue by clustering women in Mozambique into groups that are relatively homogenous with respect to their information sources on HIV, similar to Dias (2010).In addition, we use a large dataset of potential geographic and socioeconomic explanatory variables and isolate the most important factors that determine membership in those information clusters.The results may be used to derive for instance information campaign strategies for respective subgroups.

Bayesian inference for mixtures of Bernoulli distributions
We use a set of binary variables that indicates whether a particular woman uses a specific source to gather information on HIV or not.A convenient choice of mixture distribution is the Bernoulli distribution, which proves useful when clustering binary vectors (see for example the vast literature on market segmentation; Wedel and Kamakura, 2012).
Let y i = ( y i,1 , . . ., y i, j ) be a J = 1, . . ., j-dimensional vector of 0s and 1s that describe the HIV information sources used by woman i. Assume that this vector is the realization of a binary multivariate random variable Y = (Y 1 , . . ., Y j ).Now suppose there exist K groups in the population that cause differences in occurence probabilities γ k, j = Pr(Y j = 1|S i = k) in K different classes for J different binary variables.We can rewrite Eq. 1 where y i follows the mixture distribution The K components correspond to the latent classes in the population.This model is widely used in various research fields, starting as early as Lazarsfeld (1959).For details, see Frühwirth-Schnatter (2006).We assume that all probabilities γ k, j are a priori independent and specificy a beta prior of the form and derive the posterior distribution conditional on the latent class indicators S i , given by where

Data Description
We

HIV Information Sources
Information Source Type of HIV information sources the women use.
Dummies for TV, Newspapers / Magazines, Posters, Clinic / Healthworker, Church, School, Community Meetings, Friends / Relatives and Working Place.

Results
We estimate the model with a shrinkage prior based on the NG with different values of K and compare the resulting models using the marginal likelihood estimates obtained via bridge sampling. 6We choose the model that maximizes the marginal likelihood.The bridge sampling estimates of the log marginal likelihood for K = 2, . . ., 6 are provided in Figure 4.The model with K = 5 scores highest and is therefore discussed below. 7 6 The model has been implemented in R (R Development Core Team, 2008).Computational time for K = 5 is around 50 minutes for 5000 draws after a burn in period of 1000 draws on an Intel i7 @ 2.4 GHZ.
7 This is of course not the only way to proceed here.Especially in a development context, other, more informal model selection criteria that take into account long term campaigning strategies or financial constraints may be employed.For example, the groups "Modern & Educated" and "TV" could be merged as they are rather similar.However, in this paper the statistical possibilities of the proposed model are emphasized and hence we make use of the purely statistical approach.The estimates for γ k, j are presented in Table 3.At first glance we find that the radio as well as friends and relatives seem to be important information sources for all groups.For the purpose of further interpretation of the model results, we name the groups with respect to their most distinctive HIV information source.Around 10% of the population use modern information sources such as television, newspapers and posters.In addition, this group obtains a relatively high amount of information from schools.Thus, we label this group as "Modern & Educated".A similarly sized group relies mostly on TV but is highly unlikely to inform themselves in schools.In contrast, the smallest group (around 6%) shows a very high dependence on schools when it comes to information on HIV.The fourth group, which comprises around a quarter of the female population of Mozambique, relies both on churches and local community meetings for obtaining information on HIV.The largest group (51.4%) has an above average dependence on friends and relatives in terms of information on the disease.Figure 5 provides a plot of the point estimates of the logit coefficients.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Modern & Educated  "Friends/Relatives").
When interpreting the multinomial logit coefficients, one has to keep in mind that the effects are always interpreted with respect to a baseline group.For convenience in the estimation process, we choose the largest group as baseline group ("Friends/Relatives").In terms of the other categorical variables, the "baseline woman" is residing in Maputo City, has no religion, is married and is a member of the richest wealth group.
Strong effects of the wealth distribution on the probability of being a member of the "Modern & Educated" and "TV" group are observable.These groups also share a high probability of having access to a flush toilet and electricity.In addition, it is relatively unlikely that a woman lives in the countryside and is a member of one of those groups.These findings are in line with what theory suggests in a poverty plagued country like Mozambique.
Unmarried women with above average education are most likely to rely on schools as HIV information sources.This seems puzzling, as female education is a primary development issue in various African countries.However, one should keep in mind that this group is extremely small and comprises just above 6% of the female population.
Interestingly, the socioeconomic variables are not strongly correlated with the probability of being a member of the "Church/Community" group.We can see that only the geographic variables determine prior class membership for this group, implying that we can find spatially clustered religious communities throughout specific provinces.Moreover, women who gather their HIV information predominantly from the church or local community meetings seem to be mostly Catholics, Protestants or African Zionists as opposed to for instance Muslims or other religions.
These results are particularly relevant to policy makers.Important insights that can be derived are, for instance, that HIV information campaigns that are targeted on disseminating educational materials via churches are likely to be more effective in Zambézia as compared to Sofala.It also might be a good idea to target folders that are distributed in schools towards single women as opposed to married women.However, these are mere examples.A detailed discussion of the policy implications of the results is out of scope of this article.

Concluding Remarks
Finite mixture models are a commonly used tool for model based clustering and density estimation.
They can be extended to mixture of experts models, allowing to use information from several covariates when clustering dependent variables of arbitrary form.We propose the usage of continous shrinkage priors to find robust predictors of class membership in this context.This enables us to simultaneously identify underyling groups in a population, cluster said observations into these groups and find the important predictors of being a group member.In particular, we suggest a combination of the normal gamma prior (Griffin and Brown, 2010) and the Pólya-Gamma sampler (Polson et al., 2013) for implicit variable selection in a multinomial logistic regression that is used to model prior class membership.
This setup solves the issue of model uncertainty that arises in this context and reduces the sensitivity of the model with respect to included variables.The proposed framework slightly outperforms related approaches and makes more precise clustering in setups with a large number of predictor variables possible.
We illustrate the model in a real data application where we apply the model with a mixture of Bernoulli distributions to HIV information sources of women in Mozambique.Model selection is based on the bridge sampling estimate of the marginal likelihood.We find five clusters of women who are relatively homogenous with respect to their HIV information sources.Generally speaking, we find that wealth plays an important role in the access to information on HIV.Moreover, there seem to be spatially clustered religious or local communities that are one of the most prevalent information sources for about 25% of the women of Mozambique.
Further research may be pointed into the direction of comparing the performance of different shrinkage priors in this context in a more detailed way as seen in Frühwirth-Schnatter and Wagner (2011).
One promising candidate is for example the Dirichlet-Laplace prior from Bhattacharya et al. (2015).
It might also be possible to extend various other Bayesian variable selection methods to mixture of experts frameworks, for example Bayesian compression (Guhaniyogi and Dunson, 2015) or its extension using targeted random projections (Mukhopadhyay and Dunson, 2017).Another interesting problem is how to apply the idea of shrinkage introduced through the prior class membership weights (e.g.Malsiner-Walli et al., 2016) for model selection purposes into a mixture of experts framework.
Also, the evaluation of the forecasting performance of the model was not attempted in this article and is left for further research.

A Choice of Hyperparameters & Priors
To enable estimation, it is necessary to choose values for various (hyper-)parameters appearing in the model setup.We set c0 = c1 = 0.01 to obtain a rather uninformative prior distribution as for instance seen in Huber and Feldkircher (2017).However, the only choice of parameter that influences inference is in fact the choice of θ .Using values close to zero induces rather heavy shrinkage whereas higher values correspond to significantly less shrinkage.As the motivation of the empirical example is to isolate robust determinants of class membership and not to find precise point estimates, we set θ to the comparably small value of 0.05 and take the risk of overshrinking some parameters.
We set θ to 0.1 in the simulation study.A thorough discussion of the choice and influence of θ can be found in Bitto and Frühwirth-Schnatter (2018).

B Bridge sampling in Mixture of Experts Models
A first step in computing the bridge sampling estimate for the proposed model is to construct an importance density that approximates the modes of the posterior density.As the posterior density of a mixture model will have multiple modes, this problem turns out to be challenging.As one of the proposed model's benefits is that all posterior distributions are available in closed form, we can make use of the unsupervised importance density construction that has been suggested by Frühwirth-Schnatter (1995) and extended by Frühwirth-Schnatter (2004).The idea is to choose a random subsample of S posterior densities from the M available permutated MCMC draws and use them to automatically construct the importance density.As we use random permutation sampling, this importance density will be multimodal as well.
In practical terms, it is necessary to save the posterior distribution parameters of S randomly selected MCMC draws during the sampling process.Note that this implies that the S saved parameters of the posterior distributions are not part of the ex post identification procedure.If one has chosen a suitable number of importance densities S and number of draws from the importance density L, we can proceed and draw from the importance density.The idea is to draw from a uniform mixture of S posterior densities.We implement this step as follows.For l = 1, . . ., L: 1. Choose a random index out of the 1, . . ., S saved posterior density parameters.
2. For k = 1, . . ., K, generate one draw from the posterior densities with the parameters that have been randomly chosen in the previous step.Iterate this procedure.
The obtained M MCMC draws and L importance density draws can be used in the recursive iteration scheme that has been described in Section 2.4.
To run the iterative process, several likelihoods have to be evaluated: 1. Evaluate the importance density draws in the prior densities.
2. Evaluate the importance density draws in the importance density.
3. Evaluate the complete data likelihood using the importance density draws.
4. Evaluate the MCMC draws in the prior densities.
5. Evaluate the MCMC draws in the importance density.
6. Evaluate the complete data likelihood using the MCMC draws.
For a detailed and more formal description for this procedure, see Frühwirth-Schnatter (2004) and Celeux et al. (2018).

C Numerical Stability of Bridge Sampling Estimate
Depending on the sample size N , the number of MCMC draws M and both the number of densities chosen to construct the importance density S and the number of importance density draws L, the vectors and matrices that result from evaluating the likelihoods will be large.Hence, the evaluated log-likelihoods may be small in absolute values (e.g.−0.1), but summing over a large number of log likelihoods and exponentiating this sum is prone to numerical underflow.Therefore, we suggest a specific evaluation scheme that has proved numerically stable in our computations.It is based on the idea that we can rewrite the log of the bridge sampling estimate of the marginal likelihood as a double log sum of exponentials.Then we can exploit the following identity: log( i e x i ) = max(x i ) + log( i e x i −max(x i ) ).
This LogSumExp function can be used to generate an exact and numerically stable estimate of the logarithm of the sum of exponential terms.To employ this function in the bridge sampling procedure, we rewrite the equation of the bridge sampling estimate as follows.
log pBS,t+1 )) Outer LogSumExp where the evaluated log likelihoods and the LogSumExp function defined above can be used to generate estimates of the logarithm of the marginal likelihood that are reasonably robust to numeric under-and overflow.

Figure 1 :Figure 2 :Figure 3 :
Figure 1: Posterior mean estimates vs. true values for different model setups (with dashed 45 • line) for N = 3000 apply the proposed model to data compiled from the Demographics and Health survey (DHS) for Mozambique from 2003.The DHS is a nationally representative household survey on a wide range of topics, including HIV information sources and various socioeconomic, geographic and health related variables.The dataset includes information on 11,922 women.Ten different information sources are used to cluster these women into groups and a set of around 40 external covariates enters the model to explain class membership.These variables cover socioeconomic characteristics like age and education, region of residence, relationship status and sexual behavior as well as poverty related measures and dwelling characteristics.

Figure 5 :
Figure 5:Posterior medians for multinomial logistic regression coefficients (Baseline: for simplicity and efficiency reasons.The Bayesian framework requires the specification of a prior on β k .As we are interested in implicit variable selection (i.e.shrinking coefficients of unpromising explanatory variables to zero) we implement a modified version of the normal gamma prior, a global local shrinkage prior introduced in Griffin and Brown (2010):

Table 1 :
Simulation Study Results

Table 3 :
Information source estimates for each cluster.
Note: Estimated values correspond to the posterior means of γ k, j and to the actual group sizes.

Table 6 :
Simulation Study Results for N=100