Semiparametric finite mixture of regression models with Bayesian P-splines

Mixture models provide a useful tool to account for unobserved heterogeneity and are at the basis of many model-based clustering methods. To gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates. In this Paper, a semiparametric finite mixture of regression models is defined, with concomitant information assumed to influence both the component weights and the conditional means. In particular, linear predictors are replaced with smooth functions of the covariate considered by resorting to cubic splines. An estimation procedure within the Bayesian paradigm is suggested, where smoothness of the covariate effects is controlled by suitable choices for the prior distributions of the spline coefficients. A data augmentation scheme based on difference random utility models is exploited to describe the mixture weights as functions of the covariate. The performance of the proposed methodology is investigated via simulation experiments and two real-world datasets, one about baseball salaries and the other concerning nitrogen oxide in engine exhaust.

subpopulations, it may be reasonable to assume that the unknown parameters of such model may vary across these subpopulations. Finite Mixtures of Regression (FMR) models deal with this kind of data, whenever the information about subpopulation membership is missing (i.e., when subpopulation membership is a source of unobserved heterogeneity). Since their introduction (Goldfeld and Quandt 1973), mixtures of regression models have been extensively employed in many research fields (see, for example, Wedel and DeSarbo 1993;Wang et al. 1996;Turner 2000;Green and Richardson 2002;Ding 2006;Tashman and Frey 2009;Dyer et al. 2012;Van Horn et al. 2015;McDonald et al. 2016).
According to their basic formulation, FMR models are characterised by the socalled assignment independence: namely, it is assumed that subpopulation membership does not depend on the regressors. Finite Mixtures of Regression models with Concomitant covariates (FMRC), also known as mixtures of experts models (Jacobs et al. 1991), overcome this limitation by specifying not only the component conditional expected values but also the component weights as functions of two (sub)sets of regressors, which can be disjoint, coinciding, or overlapping. In particular, a multinomial logistic regression structure is commonly chosen to link the component weights to the regressors. Applications of FMRC models are described in the statistical, econometric and machine learning literature (see, for example, Weigend and Shi 2000;Lu 2006; Gormley and Murphy 2008;Villani et al. 2009;Lê Cao et al. 2010;Li et al. 2010Li et al. , 2011Frühwirth-Schnatter et al. 2012;Gormley and Frühwirth-Schnatter 2019;. It is worth mentioning that some of these applications consider multivariate regressors and/or multivariate dependent variables. Alternatively, Xu et al. (1994) and Ingrassia et al. (2012) show how assignment dependence in the conditional distribution of the dependent variable can be achieved by resorting to cluster-weighted models (Gershenfeld 1997). This latter approach, however, requires the specification of the joint distribution of both dependent variable and regressors (which is typically assumed to be a mixture, whose components are represented as the product between a component conditional distribution for the dependent variable and a component marginal distribution for the regressors).
In order to enhance the flexibility of FMR/FMRC models, recently several authors have focused their attention on providing semiparametric or nonparametric extensions of such models (see Xiang et al. 2019, for a recent review). In the context of models with Gaussian components, Young and Hunter (2010) and Huang and Yao (2012) have suggested FMRC models where the component weights are assumed to be smooth functions of a univariate covariate, while retaining a linear structure for the conditional expected values. This latter assumption has been relaxed by Huang et al. (2013), who have considered models with conditional expected values and conditional variances allowed to vary smoothly according to the value of a covariate. Furthermore, Xiang and Yao (2018) and Zhang and Zheng (2018) have proposed a semiparametric representation of the conditional expected values. It is worth noting that in all the papers just mentioned, estimation has been carried out through modified versions of the Expectation-Maximization (EM) algorithm.
To the best of the Authors' knowledge, none of these flexible models have been examined from a Bayesian perspective, despite the fact that Bayesian algorithms to estimate their parametric counterparts have been extensively studied in the literature (see, for example, Frühwirth-Schnatter 2006; Gormley and Frühwirth-Schnatter 2019). The aim of this Paper is to fill this gap by considering semiparametric FMRC models within the Bayesian framework. In particular, this Paper focuses on models where the log-odds of component weights and the conditional means are smooth functions of a univariate covariate. Following the approach detailed in Berrettini et al. (2021), Bayesian P-splines (Lang and Brezger 2004) are exploited to obtain a parsimonious representation of these smooth functions, and a new Gibbs sampler algorithm is developed to perform inference based on: (i) an adaption of a data augmentation scheme with a (partial) difference Random Utility Model (dRUM) representation; (ii) an approximation of the logistic distribution through a Gaussian mixture (Frühwirth-Schnatter et al. 2012). Differently from Berrettini et al. (2021), where mixtures of multinomials are considered and smoothness is only allowed on the component weights, whereas the other parameters are assumed constant, in this Paper: • models for the conditional distribution of continuous dependent variables are examined, • and the component means are also assumed to be a function of the covariate.
The remainder of the Paper is organised as follows. Model specification is provided in Sect. 2, while in Sect. 3 the associated Bayesian inference procedure is elicited; results from simulation studies are presented in Sect. 4, while the ones about real data applications are reported in Sect. 5. Finally, Sect. 6 is devoted to discussion and conclusions.

Model specification
Suppose {y i }, i = 1, . . . , n, is a random sample from a population clustered into G components, and that each observation i has an associated quantitative covariate x i . For simplicity, both y i and x i are assumed univariate throughout this Paper. Let c i ∈ {1, . . . , G} be the component indicator for the i-th unit having discrete distribution Pr(c i = g|x i ) = p g (x i ) > 0, for g = 1, . . . , G, such that G g=1 p g (x i ) = 1, for i = 1, . . . , n. In addition, suppose that, conditioning on c i and x i , y i follows a Gaussian distribution with mean μ c i (x i ) and variance σ 2 c i . It is further assumed that each μ g (·) is an unknown smooth function of the covariate x. Hence, given x i , the random variable y i follows a finite mixture of Gaussian components: where f N (·) denotes the Gaussian density function, and p 1 (·), . . . , p G (·) can be referred to as the component (or mixture) weights. Conditions for identifiability of Model (1), whose corresponding graphical representation is reported in Fig. 1, can be derived by Theorem 1 in Huang et al. (2013) by taking into account that each variance σ 2 g is assumed independent -and, thus, constant-with respect to covariate x: • p g (x) > 0 are continuous functions and μ g (x) are differentiable functions, for g = 1, . . . , G;  Jacobs et al. (1991) model the component weights p g (x i ) using a multinomial logistic regression model, expressing the log-odds of these probabilities, with respect to the reference one (e.g., the G-th), as linear functions of the covariate x. In this Paper, similarly to Berrettini et al. (2021), each of these G − 1 linear predictors is replaced with a smooth function of x, represented by a linear combination of m cubic B-spline bases B ρ (·) and coefficients γ gρ : By defining the n × m design matrix B, where the element in row i and column ρ is given by B ρ (x i ), and after some algebra, Eq.
(2) can be rewritten as: where γ g = (γ g1 , . . . , γ gm ) corresponds to the vector of unknown regression coefficients, where the exponential is applied elementwise. To guarantee identifiability, the vector of coefficients corresponding to the reference group G are all set equal to 0. Regarding the components' normal densities, each mean μ g (·) is also assumed to be an unknown smooth function of covariate x, represented through B-splines: with β g = (β g1 , . . . , β gm ) vector of unknown regression coefficients.

Bayesian inference
Adopting B-splines to represent a smooth function requires the specification of what is known as the number of knots (or, equivalently, the number of B-spline bases), which governs how the bases behave and the flexibility of the resulting function. In the Bayesian framework, Lang and Brezger (2004) suggest a large number of knots (between 20 and 40) to ensure enough flexibility; additionally, they show how to define priors for the regression parameters γ g1 , . . . , γ gm and β g1 , . . . , β gm in terms of a random walk: Using this representation is equivalent to inducing a penalisation, based on differences of adjacent B-spline coefficients, and leads to the definition of "penalised" B-splines, commonly abbreviated to "P-splines". Through this approach, the amount of smoothness is controlled by the variance parameters δ 2 g and τ 2 g : their presence protects against possible overfitting when a larger than needed number of knots is chosen. In particular, small values for δ 2 g and τ 2 g lead to approximately constant log-odds and conditional mean, respectively. Hyperpriors are assigned to the variances τ 2 g , δ 2 g and σ 2 g , selecting Inverse Gamma distributions IG(a, b), with a = 1 and a small value for b, for example b = 0.005, leading to almost diffuse priors. The priors in (5) and (6) can be equivalently written in the form of global smoothness priors: where the penalty matrix K is given by K = 1 1 , with 1 being the first order difference matrix ( Rue and Held 2005, Chapter 2, p. 52). Because K is rank deficient with rank(K) = m − 1 for a first-order random walk, the priors are improper. It is worth mentioning that in the literature these kind of priors are usually referred to as intrinsic Gaussian Markov random fields (Rue and Held 2005).
The multinomial model in Equation (2) can be conveniently represented as a binary formulation in the partial dRUM representation proposed by Frühwirth-Schnatter et al. (2012). Conditional on each λ g (x i ) = exp(η g (x i )), the random utilities are defined as where z gi are latent variables, D gi = 1(c i = g) are the allocation indicators and gi are i.i.d. errors following a Logistic distribution (g = 1, . . . , G, i = 1, . . . , n). Given λ 1 (x i ), . . . , λ G (x i ) and the latent indicator variables D 1i , . . . , D Gi , the latent variables z 1i , . . . , z Gi are distributed according to an Exponential distribution and can be easily sampled in a data augmented implementation. To avoid any Metropolis-Hastings step, Frühwirth-Schnatter and Frühwirth (2010) approximate, for each gi , the Logistic distribution by a finite scale mixture of H Gaussian distributions, with zero means and variances {s 2 1 , . . . , s 2 H } drawn with fixed probabilities {w 1 , . . . , w H }. The same authors obtained their finite scale mixture approximation by minimizing the Kullback-Leibler divergence between the densities, and recommend choosing H = 3 in larger applications, where computing time matters, and to work with H = 6 whenever possible. In a second step of data augmentation, the component indicators r gi (g = 1, . . . , G − 1, i = 1, . . . , n), each taking value h = 1, . . . , H , are introduced as yet another level of latent variables. Conditional on z g = (z g1 , . . . , z gn ) and r g = (r g1 , . . . , r gn ) , the binary logit regression model (7) reduces to a Gaussian regression model.

MCMC algorithm
Based on the representation of Sect. 3, a new MCMC algorithm is implemented, for a fixed G, by integrating the scheme proposed by Frühwirth-Schnatter et al. (2012) with the Bayesian P-spline approach by Lang and Brezger (2004), similarly to Berrettini et al. (2021). A sketch of the algorithm is comprised of the following steps: 1. Sample the regression coefficients' vector γ g conditional on z g and r g , g = 1, . . . , G − 1. Using the prior in Equation (5), the full conditional of γ g is given by a multivariate Gaussian density. Straightforward calculations (Brezger and Lang 2006) show that the precision matrix P g and the mean m g of γ g |· are given by respectively, where W g is a n × n diagonal matrix with nonzero elements equal to the randomly drawn variances (ω 1g = s 2 r g1 , . . . , ω ng = s 2 r gn ) for the g-th group, the i-th element of λ −g (x) is l =g λ l (x i ) and the logarithm is applied elementwise; 2. Sample the G − 1 variance parameters δ 2 g conditional on γ g : 3. For each unit i = 1, . . . , n, sample all the (partial) differences of utilities z 1i , . . . , z G−1,i simultaneously from: with U gi ∼ Unif(0, 1); 4. For g = 1, . . . , G − 1 and i = 1, . . . , n, sample the component indicators r gi conditional on z gi from: 5. Sample the regression coefficients' vector β g , g = 1, . . . , G from a multivariate Gaussian density with covariance matrix V g and mean ν g where the superscript (g) is applied throughout this section to any matrix or vector to indicate the rows of that matrix (or the elements of that vector) corresponding to the units allocated to the g-th group; 6. Sample the G variance parameters τ 2 g conditional on β g : 7. Sample the G variance parameters σ 2 g conditional on μ g (x) = Bβ g : 8. Classify each unit i according to Bayes' rule: draw D gi (g = 1, . . . , G, i = 1, . . . , n) from the following discrete probability distribution which combines the likelihood and the prior: It is worth mentioning that Steps 1 to 4 of this algorithm are similar to those proposed by Berrettini et al. (2021) to sample the parameters related to the component weights, while Steps 3 to 5 are specific to the models considered in this Paper, and are needed to sample the parameters associated with the component means.

Label switching, posterior inference and model selection
The MCMC algorithm described in the previous section can be prone to label switching (see Frühwirth-Schnatter 2006, Section 3.5 for a review). A possible solution to deal with this problem, which exploits k-means clustering (with G clusters) of the posterior draws to identify a unique labeling, has been proposed by Frühwirth-Schnatter et al. (2012). This solution is readily available to the semiparametric FMRC models introduced in this Paper.
Posterior inference is carried out after completing the prefixed number T of iterations. As usual, parameter's posterior mean can be computed considering averages of the last T − T 0 draws of the chains, with T 0 defining the burn-in phase. As far as the smooth functions are concerned, posterior quantities are obtained by exploiting their representation as linear combinations of spline bases and the corresponding regression coefficients' estimates. Pointwise percentiles (usually 2.5-97.5 or 5-95) computed over the last T − T 0 posterior draws can be used to quantify uncertainty associated to the estimated smooth functions.
The Maximum-A-Posteriori (MAP) rule is adopted to partition the observations into G groups, by allocating them to the G components. In particular, each unit i = 1, . . . , n is assigned to the componentĉ i such that Occasionally, the use of the MAP rule can lead to empty groups, when one or more components could have no units assigned to them. In such situations, it might be worth distinguishing between the number of components G and the number of nonempty components, denoted asG is the number of observations assigned to group g (g = 1, . . . , G).
A relevant issue related to mixture models is the choice of the number of components, which originated many efforts in the statistical literature. The proposed MCMC algorithm requires the value of G to be fixed in advance. Thus, the algorithm should be run for different values of G and the obtained results should be compared in order to select the optimal number of components. Several model selection criteria are available to perform these comparisons (see Celeux et al. 2019, for a recent review). Many of these criteria require the determination of the number of free parameters of each candidate model. However, the quantification of this number for the semiparametric FMRC models described in this Paper can be difficult due to the regularisation induced by the prior distributions on the spline coefficients. A solution to circumvent this problem is proposed by Raftery et al. (2007). They suggest the use of 2s 2 l as an estimate of this unknown quantity, where s 2 l is the sample variance of the log-likelihoods computed as D i denoting the vector of estimated parameters for the component unit i is allocated to, at iterations t = T 0 , . . . , T , after the burn-in. Using this estimate, they derive two model selection criteria, whose values depend only on the log-likelihoods from the posterior simulation, that are readily available: wherel is the sample mean of the sequence of log-likelihoods l (t) , for each iteration t = T 0 , . . . , T , after the burn-in. As pointed out by Raftery et al. (2007), the AICM is connected to the DIC criterion (Spiegelhalter et al. 2002). More specifically, it coincides with the DIC definition provided by Gelman et al. (2004, Sect. 6.7). Concerning the BICM, it can be related to an approximation of the log-marginal likelihood. Successful applications of the AICM in the mixture modelling context are described, for example, by Erosheva et al. (2007), Gormley and Murphy (2010), Gormley and Murphy (2011), and Mollica and Tardella (2017). The BICM is exploited, for example, by Ranciati et al. (2017),  and Redivo et al. (2020).

Simulation study
The performance of the proposed approach is investigated in a simulated environment, considering two scenarios that differ in terms of the true number of components and the distribution of the manifest variable. In both scenarios, the manifest variable y and the concomitant covariate x are assumed to be univariate, for simplicity. The quality of the estimates for the covariate effects on the conditional means is evaluated through a comparison between the true effects and their posterior estimates, after fitting each of the following mixture of regression models: • Semiparametric Finite Mixture of Regressions models with Concomitants (SFMRC), with flexible specification of both the mixture weights π g (x) and the conditional means μ g (x), g = 1, . . . , G; • Semiparametric Finite Mixture of Regression (SFMR) models, with constant mixture weights π g and flexible specification of the conditional means μ g (x), g = 1, . . . , G; • (parametric) FMRC, with linearity assumption for the effect of x on both the log-odds of the mixture weights log(π g (x)/π G (x)) = η g (x) and the conditional means μ g (x), g = 1, . . . , G; • (parametric) FMR, with constant mixture weights π g and linearity assumption for the effect of x on the conditional means μ g (x), g = 1, . . . , G.
Additionally, by adapting to the univariate Gaussian case the models discussed in Berrettini et al. (2021), two mixture models with concomitants are also considered: • Semiparametric Finite Mixture models with Concomitants (SFMC), with flexible specification of the mixture weights π g (x) and constant conditional means μ g , g = 1, . . . , G; • (parametric) FMC, with linearity assumption for the effect of x on the log-odds of the mixture weights log(π g (x)/π G (x)) = η g (x) and constant conditional means μ g , g = 1, . . . , G; For every class of models, G is initially set equal to the true number of components.
In particular, the pointwise means of the estimated μ g (x * ), denotedμ g (x * ), are plotted together with the pointwise 2.5 and 97.5 percentiles among all samples, where {x * i }, i = 1, . . . , n, are grid points taken evenly in the range of covariate x. To quantitatively assess the performance of the estimators of the unknown regression functions μ g (x), the same measure employed in Huang and Yao (2012), Huang et al. (2013) and Xiang and Yao (2018) is adopted, that is the square Root of the Average Squared Errors (RASE), computed as in practice, the RASE measures the (Euclidean) pointwise distance between the "true" curve and the estimated one. The same graphical and quantitative evaluations are carried out for the covariate effects on the mixture weights, this time by restricting the analysis to the class of semiparametric and parametric models with concomitants. Regarding the clustering performance, a comparison between true allocations and inferred ones is made in terms of Adjusted Rand Index (ARI) (Hubert and Arabie 1985) and soft ARI (sARI) (Flynt et al. 2019). For each method and each value of G, 4000 MCMC draws are simulated after a burn-in of as many draws. Both AICM and BICM are considered to select the optimal number of components, and the number of nonempty componentsG is computed according to Equation (17). For each of the competing classes of models, a proper MCMC algorithm has been implemented in R (R Core Team 2020). The R codes for the four algorithms are available on GitHub at the following link: github.com/MarcoBerrettini/sMoE.

First simulation experiment: G=2
A batch of 100 independent datasets is generated with n = 1000 from a two-component mixture of regression models with weights where x is the only covariate, sampled from a standard uniform distribution: x i ∼ Unif(0, 1), i = 1, . . . , 1000. The functional form of η 1 (x) = log π 1 (x) 1−π 2 (x) , coupled with the specific range of values for x i , leads to a nonmonotonic concave log-odds.
Conditional on x and the component indicators, each component density is a Gaussian distribution, with means μ 1 (x), μ 2 (x), and variances σ 2 1 , σ 2 2 given by: Figure 2 shows one of the 100 independent samples. Figure 3 highlights the limits of the parametric approach when a nonmonotonic function, symmetric about x = 0.5, has to be approximated. In particular, for fixed number of components G = 2, both FMC and FMRC tend to fit a constant function with an associated RASE η (and its standard deviation), averaged over the 100 simulations, equal to 2.402 (0.209) and 10.091 (31.020), respectively. On the other hand, SFMRC seems to catch the underlying trend even though some oversmoothing is present around the peak of the function. For this model, the average RASE η drops to 0.209, with standard deviation of 0.665. A peculiar behaviour emerges when examining the estimates for η 1 (x) obtained with SFMC, characterised by an average RASE η equal to 4.646 (and a standard deviation equal to 0.776). This might be caused by an unsuccessful attempt to counterbalance the evident model misspecification, related to the assumption of constant conditional means, by using the flexible mixture weights' specification.
Regarding the estimated conditional means, the SFMRC shows good performance in the left panel of Fig. 4, apart from some oversmoothing in the lower component μ 2 (x), for central values of x. In this area, the probability of observing units from component 2 reaches its minimum, as previously shown in Fig. 3. Around this region, most of the observations come from component 1, with only few observations from component 2. This disproportion, coupled with a certain degree of overlap of the two components, seems to have led the MCMC algorithm to assign erroneously some units from component 1 to component 2, with a consequent slight upward bias inμ 2 (x). This explains also the oversmoothing observed when estimating the effect of the covariate x on the log-odds η 1 (x) of the mixture weights. This issue becomes way more evident if constant weights are assumed without considering the effects of the concomitant covariate x, as for the SFMR; see the second plot in Fig. 4. Again, the main problem regards mostly the lower component, whose true mean is not fully included in the bands, even though they widen considerably in the overlap region.
No assumption of constant weights is made when fitting the FMRC, but, as previously shown in Fig. 3, this model estimates a constant effect of the covariate, making it practically equivalent to a FMR. This is evident in Fig. 5, where the conditional means estimated by the two models are compared. Since these two functions are generated to be quadratic and symmetric about x = 0.5, both parametric regression models fit horizontal lines, effectively collapsing to a simple mixture of Gaussians not involving the effect of the covariate for the conditional distribution of the dependent variable, thus leading to estimated conditional means very close to those obtained with FMC and SFMC (Fig. 6). It is worth mentioning that the fitted constant means obtained by these four class of models are not even centered around the true average group means. Table 1 summarises this comparison among the conditional mean functions estimated by the six competing models from a quantitative point of view, by displaying, for each combination of method and component g = 1, 2, the average RASE μ g and the standard deviation over 100 simulated datasets. Quality of the estimates are strictly related to the quality of the allocations, as Table 2 confirms. The SFMRC, in fact, outperforms its competitors in terms of AICM, BICM, ARI and sARI for fixed number of components G = 2, followed by the SFMR. Given the previous considerations about the results, both the parametric approaches and those assuming constant conditional means prove to be not satisfactory in this simulation setting. A comparison among the six competing models is performed also by examining the best models selected according to AICM and BICM when considering a number of components ranging from 1 to 4. Table 3 reports the distribution of the number of nonempty componentsG resulting from the selection based on AICM.G = 2 is always the best choice according to the SFMRC, while 26 times out of 100 the SFMR provides   Table 3 to those in Table 4, the tendency of the BICM to select models with fewer nonempty components is apparent. This seems to be consistent with what has been already reported in the literature (see, for example, Redivo et al. 2020). Only SFMRC models do not seem to suffer from this systematic underestimation.
Examining the best models (according to AICM) fitted with each method for each simulated dataset, rather than fixing the number of components, the conclusions does not seem to change. All AICM averaged values reported in Table 5 improve with respect to those in Table 2, also for the SFMRC, because sometimes having an extra component, even if it is emptied during the posterior allocation, slightly decreases AICM. For the same reason, both the average ARI and sARI appear to slightly improve for the SFMRC, while they worsen for models that tend to pick the wrong number of components; see Table 5. Similar conclusions can be drawn when considering the best models selected using BICM (data not shown).

Second simulation expertiment: G=3
A batch of 100 independent datasets is generated with n = 1000 from a threecomponent mixture of regression models, with log-odds of mixture weights η g (x) = log π g (x)/π 3 (x), g = 1, 2, defined as: where x is the only covariate, sampled from a uniform distribution: x i ∼ Unif(0, 1), i = 1, . . . , 1000. Conditional on x and the component indicators, y follows a univariate Gaussian distribution, with means μ 1 (x), μ 2 (x), μ 3 (x), and variances σ 2 1 , σ 2 2 , σ 2 3 ,  Figure 7 shows one of the 100 independently generated samples for this second scenario. Figure 8 shows that the SFMRC is able to catch almost perfectly the effects of the covariate x on both predictors η 1 and η 2 . On the contrary, due to nonlinearity, the linear approximation by the FMRC is worse, so that the true effects exceed the bands at the boundaries of the range of x. As expected, the average RASE scores for the SFMRC (together with the associated standard deviations) reported in Table 6 are lower than those of the parametric competitor. Table 6 contains also information about the performance of SFMC and FMC. In this second simulation experiment, it is evident that imposing constant conditional means has a dramatic impact on the ability of these two class of models in recollecting the effect of the covariates on the mixture weights: the average RASE associated with these two classes of mixture models with concomitants are almost ten times larger than those obtained with SFMRC.
Regarding the estimates of the conditional means, the SFMRC seems to outperform the competitors, despite some overlap present between Cluster 1 and Cluster 3 for low values of x, and between Cluster 2 and Cluster 3 for high values of x (see Fig. 9). The FMRC is unable to properly approximate the nonlinear trends, especially where there are fewer observations (i.e. in Cluster 1 for high values of x, and in Cluster 2 for low values of x). Nevertheless, Fig. 10 shows that, thanks to the good estimates of the mixture weights, the FMRC discriminates almost perfectly among groups in the aforementioned overlapping areas.
Results for SFMR are reported, in detail, in Fig. 11. The flexibility allowed for the estimates of the conditional means, combined with the impossibility to include the  effect of covariate x into the estimates of the mixture weights, results into overlapping estimated functions and wide bands. The performance of the FMR is just slightly worse with respect to the ones obtained by the FMRC. The main differences can be observed in the overlapping regions of Fig. 12, where the estimated conditional means intersect each other. As far as SFMC and FMC are concerned, these two class of models show a similar behaviour in the estimated (constant) conditional means (Figs. 13 and 14). Due to the specific settings considered in this second experiment, the impact of the mispecification error seems less severe. However, it is apparent that both models suffers from a large sampling variability in the estimated conditional means. This is particularly evident in the estimates for component 3. It is worth noting that component 3 is used as baseline to define the log-odds. Thus, the extremely large variability in the estimates for the conditional mean of this component can be connected with the previously mentioned poor performance of SFMC and FMC in estimating the logodds. All the conclusions drawn from a graphical point of view are confirmed by the quantitative results in terms of RASE μ reported in Table 7. Table 8 shows that, in terms of both AICM and BICM, the SFMRC is evidently better than its competitors with G = 3. However, this result does not correspond to an equal gap in the quality of the allocations, expressed in terms of both ARI and sARI. Indeed, either mixture of regression models with concomitants perform well, even though the semiparametric one slightly prevails.
Regarding model selection, for each competing method and each simulated dataset, the best model is considered among different mixture models with G = 1, . . . , 5.   Table 9 shows that, when the selection is based on AICM, the SFMRC is the only one which is able to consistently pick the correct number of nonempty components. This leads to more favorable results for the SFMRC, if ARI and sARI computed with reference to the best models selected by each method are compared; see Table 10. On the contrary, results worsen when BICM is considered as model selection criterion. As shown in Table 11, the tendency of BICM to underestimate the actual number of nonempty components is confirmed, and in this second simulation experiment also SFMRC models suffer from this. As a consequence, the values of ARI and sARI computed for the best models selected using BICM are negatively impacted (data not shown). Watnik (1998) provides a dataset consisting of information about players for the 1992 Major League Baseball season. In particular, their salaries are considered as the response, along with measures of the 337 players' previous year's performances. Notice that this dataset is already well known in the literature on FRMC (see, for example, Khalili and Chen 2007;Chamroukhi and Huynh 2018). For simplicity, the analysis described in this section focuses on one of the quantitative covariates, the number of runs, taken as a measure of a player's contribution to the team. More specifically, the effect of this variable on player salaries is studied, by fitting the six different mixture of regression models considered in Sect. 4 for a fixed number of components ranging from G = 1 to G = 4. As suggested by Watnik (1998), due to asymmetry, the response is preemptively transformed by taking the natural logarithm.

Baseball salaries data
In light of the tendency of BICM to underestimate the number of nonempty components highlighted in the simulation experiments, in this application the optimal value for G is selected according to AICM. The number of nonempty componentsG resulted to be equal to 2 for the two semiparametric mixture of regression models (SFMRC and SFMR) as well as the two mixture of Gaussians (SFMC and FMC), and equal to 1 for the two parametric mixture of regression models (FMRC and FMR). Among these six models, the SFMRC presents the smallest AICM (663.4), followed by the SFMR (733.7), the SFMC (781.1) and the FMC (817.3), while the remaining best parametric mixture of regression models, having G = 1, collapse to the same model, with the highest AICM (888.1).
As Fig. 15 shows, the main difference between the semiparametric mixture of regression models seems to be related to the allocation of players with a low number of runs. The SFMRC keeps the two clusters well separated, by assigning all of these units to the lower one, whereas the SFMR creates some overlap, such that the functions describing the conditional means,μ 1 (x) andμ 2 (x), almost intersect one another. Figure 15 shows, in both cases, the presence of a nonlinear effect of the number of runs on the log-salary for the upper cluster, while the bands does not exclude a linear effect for the lower cluster.
Fixing the number of components G = 2, the FMR allocates the players similarly to the SFMRC. In particular, only 9 units out of 337 (ARI = 0.894) differ in cluster allocation between the two models. Focusing on the parametric regression approaches, the main difference between the allocations seems to be related to few among the lowest paid players having a number of runs ranging between 30 and 90, which are assigned to the upper component by the FMRC. This probably induces variability in the estimated mean functions of the latter, which present wider bands, if compared to the ones estimated by FMRC; see Fig. 16. The two approaches with constant conditional means produce a sensibly different partition with respect to the other methods, and similar to each other, detecting a cluster of highly paid players (around 500.000 dollars or more); see Fig. 17. Here, the boundary (not drawn) separating the two groups is not perfectly horizontal due to the covariate effect on the log-odds of the weights; see Fig. 18. In particular, all four mixture models with concomitants agree about the presence of a decreasing trend in the effect of the number of runs on the log-odds of the mixture weight η 1 (x), but the semiparametric methods estimate nonlinear functions that cannot be approximated properly by a straight line. The ability to pick this underlying effect is also the main reason for the differences observed between the performances of the two semiparametric regression approaches.
The partition induced by the SFMRC identifies a cluster, the lower one (in green), which might be broadly interpreted as the cluster of "underrated" (or "underpaid", with respect to the others) baseball players. In fact, while it is obvious players with better performances get paid more, as corroborated by the increasing trends of both means, there seems to be a group of players whose salary is substantially lower than that of players achieving similar performances (in terms of number of runs), belonging to the upper group (in blue). Indeed, the two estimated mean functionsμ 1 (x) andμ 2 (x) in Fig. 15 appear almost parallel. A partial explanation of this result can be found in some additional pieces of information available in the dataset. In particular, there is a variable indicating the "free agency eligibility" of each player, i.e. if that player could have gone to a team of his choice in 1992. At the time - Watnik (1998) explains-only players with a certain amount of experience were eligible for free agency (134 out of 337) and, thus, able to market themselves to the highest bidder. On the contrary, if a player not "free agency eligible" wanted to play, he had to accept what his team was willing to pay him, or go with his team to an appointed "arbitrator", who would choose between the player's suggested salary and the team's one. However, "arbitration eligibility", which is included in the dataset as a variable as well, was for players (65 out of 337, in the dataset) who had some experience in the League, although not enough to be eligible for free agency. For interpretation purpose, the two above described categories, "free agency eligible" and "arbitration eligible" players are merged, and Table 12 compares the partition induced by SFMRC with the one obtained by distinguishing between (free agency or arbitration) eligible and noneligible players. The resulting ARI (0.626) is the highest observed among the six models. Indeed, it can be noticed that almost all the eligible players (193 out of 199) belong to the upper (blue) cluster, together with 29 players who apparently had been able to obtain an "adequate" salary without probing the market.
Rather than using the additional information on eligibility to validate the clustering results, without including it into the models (and thus, treating it as a potential source of unobserved heterogeneity), this binary variable could be explicitly included into the models as an additional (binary) covariate. This analysis is focused only on the two semiparametric mixture of regression models (SFMRC and SFMR). Not surprisingly, the inclusion of this additional covariate leads the AICM to reduce the optimal number of components to G = 1 for both of them. This is consistent with the fact that the two conditional means as estimated by the two-component SFMRC (without the binary covariate) are almost parallel, and can be reasonably approximated by a single curve plus a vertical shift depending on the value taken by the binary covariate. In addition, this simplification in the model resulted in an AICM value lower than the one associated with the two-component SFMRC. This second part of the application should allow the Reader to appreciate the efficacy of SFMRC models in detecting possible sources of unobserved heterogeneity. Appendix A contains details about how to extend both the specification of the model given in Sect. 2 and the MCMC algorithm provided in Sect. 3.1 in order to include a binary covariate.

Nitrogen oxide data
First introduced by Brinkman (1981), this dataset includes 88 observations about the concentration of nitric oxide in engine exhaust and the equivalence ratio, which represents a measure of the richness of the air-ethanol mix, for burning ethanol in a single-cylinder automobile test engine. Mixtures of regression models have been already fit to these data in Xiang and Yao (2018), where the equivalence ratio has been considered as the dependent variable y, and the concentration of nitric oxide is taken as concomitant covariate x. In this Paper, a similar analysis is performed using the proposed flexible Bayesian approach. Although the two-component structure seems quite clear in the scatterplot, six algorithms with different levels of flexibility have been run for 10000 iterations each (burn-in: 5000 draws) with G ranging from G = 1 to G = 4. According to AICM, all the regression models considered find two nonempty components, while the remaining methods find three. Figure 19 shows the covariate effects as estimated by the proposed semiparametric mixture of regressions with concomitant covariates. More specifically, in the left panel the estimated conditional means are reported, while in the right one the estimated log-odds η 1 (x) can be observed. Both plots are in line with the results by Xiang and Yao (2018) and indicate the lack of need for a flexible specification of any covariate effect in this case. In fact, the bands do not exclude linearity, although some mild nonlinearity seems to be present in the conditional means. This shows the efficacy of the penalisation induced by the selected priors in adjusting the proposed model to the required level of complexity. Coherently, AICM points at SFMR as the model to be preferred; see Table 13.

Conclusions
In this Paper, a general specification of mixture of regression models is proposed, allowing both component weights and conditional means to be nonlinear functions of a covariate. This general approach resort to spline functions for approximating the smooth effect of the concomitant variables. Parameter estimation is based on a Bayesian approach through MCMC machinery. In principle, the Reader might question whether a full parametric approach, e.g. by considering a monomial set of bases to represent the map between component probabilities and covariates, could prove to be flexible enough to catch nonlinearity. Unfortunately, this parametric representation would require some arbitrary choices, such as the maximum degree for monomial bases, or the definition of an automatic selection criterion. The approach advocated for in this Paper bypasses this issue by controlling flexibility through the variance parameters of the spline coefficients, following Lang and Brezger (2004).
Using simulation experiments, the proposed method has proved to be a useful tool for recovering the underlying covariate effects -especially if indeed not linearand, consequently, for estimating models with a better goodness of fit and leading to a more accurate allocation. The potential of the proposal has been illustrated also through applications to real data.
Although the results shown seem encouraging, the proposed model is characterised by some limitations and there are some issues that might deserve further investigation.
Firstly, there are limitations related to the assumption for both the manifest variable and the concomitant covariate to be univariate. The adaptation to the multivariate case would require particular attention to deal with the presence of component-specific covariance matrices and multiple regressors. Regarding the latter, each predictor could be expressed as a sum of smooth functions, as in the additive paradigm introduced by Hastie and Tibshirani (1990); hence, Bayesian P-splines could be used to approximate such nonlinear functions. However, conditions for identifiability should be revised, and, in particular, further constraints should be introduced to guarantee identifiability of the predictors.
In addition, the simulation experiments have highlighted the tendency of BICM to underestimate the actual number of components. As previously mentioned, this seems coherent with other results already reported in the literature (see, for example, Redivo et al. 2020). Since BICM has been introduced as an approximation of the (logarithm of) the marginal likelihood, it would be interesting to consider other estimators of this quantity. For example, Frühwirth-Schnatter (2019) has recently proposed two bridge sampling estimators of the marginal likelihood for FMRC models, and investigating their performance in the context of the semiparametric models proposed in this Paper could be the subject of future investigation.
Furthermore, adopting a probit representation (Geweke and Keane 2007) instead of the dRUM approach could provide another potential avenue to explore, in order to understand the benefit of the proposed modelling framework, especially when the ease of interpretation given by the logit formulation is not of relevance.
Alternative approaches to penalise the spline coefficients associated with the bases could be considered. For example, one could consider applying shrinkage to the variances τ 2 g and δ 2 g in a similar way to what Bitto and Frühwirth-Schnatter (2019) and Cadonna et al. (2020) suggest in the context of time varying parameter models.
As far as the computational implementation is concerned, as a consequence of the use of mixture of Gaussians to approximate the Logistic distribution, no Metropolis-Hastings steps are required in the proposed MCMC algorithm. Although this can be considered an advantage, the increase in the computational burden due to the introduction of an additional latent variable should not be ignored. Furthermore, the implemented MCMC algorithm relies on the specification of a fixed value for the number of components. If this quantity is unknown, it is necessary to estimate it by running the algorithm many times with different inputs, which might be time consuming, especially when the "true" value is large. One solution could be incorporating the choice of G within the algorithm itself. For instance, a reversible jump MCMC algorithm could be exploited (Richardson and Green 1997), by designing appropriate dimension-changing moves. Alternatively, the issue of choosing the optimal value for G could be circumvented by focusing the attention on the posterior distribution of the number of nonempty components, through the combination of a large value for G with appropriate prior distributions, as suggested in Malsiner-Walli et al. (2016). This latter strategy seems more coherent with the peculiar behaviour observed in the simulation studies, where the proposed MCMC algorithm occasionally converges to a solution that is characterised by empty components.
Funding Open access funding provided by Alma Mater Studiorum -Universitá di Bologna within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Including an additional binary covariate
Let d i ∈ {0, 1} be a binary covariate associated to unit i = 1, . . . , n. Then, Equations (2) and (4) can be respectively rewritten as follows: with ξ g and ζ g unknown regression coefficients. Assuming a diffuse Gaussian prior N (0, ψ 2 g ) on such parameters, with variance ψ 2 g set sufficiently high (e.g., 100), leads to the following modifications to be applied to points 1 and 5 of the Gibbs sampler introduced in Sect. 3.1: 1a. For g = 1, . . . , G − 1, sample the regression coefficients' vector γ g conditional on z g and r g from a multivariate Gaussian density with precision matrix P g and