Introduction

Regression analysis represents one of the most popular tool to investigate the effect of a set of regressors/covariates on a dependent variable. In this context, a (possibly linear) regression model is usually specified to describe the conditional expected value of the dependent variable given the values of the regressors. When data come from different subpopulations, it may be reasonable to assume that the unknown parameters of such model may vary across these subpopulations. Finite Mixtures of Regression (FMR) models deal with this kind of data, whenever the information about subpopulation membership is missing (i.e., when subpopulation membership is a source of unobserved heterogeneity). Since their introduction (Goldfeld and Quandt 1973), mixtures of regression models have been extensively employed in many research fields (see, for example, Wedel and DeSarbo 1993; Wang et al. 1996; Turner 2000; Green and Richardson 2002; Ding 2006; Tashman and Frey 2009; Dyer et al. 2012; Van Horn et al. 2015; McDonald et al. 2016).

According to their basic formulation, FMR models are characterised by the so-called assignment independence: namely, it is assumed that subpopulation membership does not depend on the regressors. Finite Mixtures of Regression models with Concomitant covariates (FMRC), also known as mixtures of experts models (Jacobs et al. 1991), overcome this limitation by specifying not only the component conditional expected values but also the component weights as functions of two (sub)sets of regressors, which can be disjoint, coinciding, or overlapping. In particular, a multinomial logistic regression structure is commonly chosen to link the component weights to the regressors. Applications of FMRC models are described in the statistical, econometric and machine learning literature (see, for example, Weigend and Shi 2000; Lu 2006; Gormley and Murphy 2008; Villani et al. 2009; Lê Cao et al. 2010; Li et al. 2010, 2011; Frühwirth-Schnatter et al. 2012; Gormley and Frühwirth-Schnatter 2019; Murphy and Murphy 2020). It is worth mentioning that some of these applications consider multivariate regressors and/or multivariate dependent variables. Alternatively, Xu et al. (1994) and Ingrassia et al. (2012) show how assignment dependence in the conditional distribution of the dependent variable can be achieved by resorting to cluster-weighted models (Gershenfeld 1997). This latter approach, however, requires the specification of the joint distribution of both dependent variable and regressors (which is typically assumed to be a mixture, whose components are represented as the product between a component conditional distribution for the dependent variable and a component marginal distribution for the regressors).

In order to enhance the flexibility of FMR/FMRC models, recently several authors have focused their attention on providing semiparametric or nonparametric extensions of such models (see Xiang et al. 2019, for a recent review). In the context of models with Gaussian components, Young and Hunter (2010) and Huang and Yao (2012) have suggested FMRC models where the component weights are assumed to be smooth functions of a univariate covariate, while retaining a linear structure for the conditional expected values. This latter assumption has been relaxed by Huang et al. (2013), who have considered models with conditional expected values and conditional variances allowed to vary smoothly according to the value of a covariate. Furthermore, Xiang and Yao (2018) and Zhang and Zheng (2018) have proposed a semiparametric representation of the conditional expected values. It is worth noting that in all the papers just mentioned, estimation has been carried out through modified versions of the Expectation-Maximization (EM) algorithm.

To the best of the Authors’ knowledge, none of these flexible models have been examined from a Bayesian perspective, despite the fact that Bayesian algorithms to estimate their parametric counterparts have been extensively studied in the literature (see, for example, Frühwirth-Schnatter 2006; Gormley and Frühwirth-Schnatter 2019). The aim of this Paper is to fill this gap by considering semiparametric FMRC models within the Bayesian framework. In particular, this Paper focuses on models where the log-odds of component weights and the conditional means are smooth functions of a univariate covariate. Following the approach detailed in Berrettini et al. (2021), Bayesian P-splines (Lang and Brezger 2004) are exploited to obtain a parsimonious representation of these smooth functions, and a new Gibbs sampler algorithm is developed to perform inference based on: (i) an adaption of a data augmentation scheme with a (partial) difference Random Utility Model (dRUM) representation; (ii) an approximation of the logistic distribution through a Gaussian mixture (Frühwirth-Schnatter et al. 2012). Differently from Berrettini et al. (2021), where mixtures of multinomials are considered and smoothness is only allowed on the component weights, whereas the other parameters are assumed constant, in this Paper:

  • models for the conditional distribution of continuous dependent variables are examined,

  • and the component means are also assumed to be a function of the covariate.

The remainder of the Paper is organised as follows. Model specification is provided in Sect. 2, while in Sect. 3 the associated Bayesian inference procedure is elicited; results from simulation studies are presented in Sect. 4, while the ones about real data applications are reported in Sect.  5. Finally, Sect. 6 is devoted to discussion and conclusions.

Model specification

Suppose \(\{y_i\}, i = 1,\dots , n,\) is a random sample from a population clustered into G components, and that each observation i has an associated quantitative covariate \(x_i\). For simplicity, both \(y_i\) and \(x_i\) are assumed univariate throughout this Paper. Let \(c_i\in \{1,\dots ,G\}\) be the component indicator for the i-th unit having discrete distribution \(\mathrm {Pr}(c_i=g|x_i)=p_g(x_i)>0\), for \(g=1,\dots ,G\), such that \(\sum _{g=1}^G p_g(x_i)=1\), for \(i=1,\dots ,n\). In addition, suppose that, conditioning on \(c_i\) and \(x_i\), \(y_i\) follows a Gaussian distribution with mean \(\mu _{c_i}(x_i)\) and variance \(\sigma _{c_i}^2\). It is further assumed that each \(\mu _g(\cdot )\) is an unknown smooth function of the covariate x. Hence, given \(x_i\), the random variable \(y_i\) follows a finite mixture of Gaussian components:

$$\begin{aligned} f(y_i|x_i)=\sum _{g=1}^G p_g(x_i)f_{\mathcal {N}}\left( \mu _g(x_i),\sigma _g^2\right) , \end{aligned}$$
(1)

where \(f_{{\mathcal {N}}}(\cdot )\) denotes the Gaussian density function, and \(p_1(\cdot ),\dots , p_G(\cdot )\) can be referred to as the component (or mixture) weights. Conditions for identifiability of Model (1), whose corresponding graphical representation is reported in Fig. 1, can be derived by Theorem 1 in Huang et al. (2013) by taking into account that each variance \(\sigma _g^2\) is assumed independent – and, thus, constant—with respect to covariate x:

  • \(p_g(x)>0\) are continuous functions and \(\mu _g(x)\) are differentiable functions, for \(g = 1,\dots ,G\);

  • for any \(h = j\), \(\sigma _h^2 \ne \sigma _j^2\), or, if there exists \(h = j\) such that \(\sigma _h^2 = \sigma _j^2\) , then \(||{\mu _h(x)-\mu _j(x)}||+||{\mu '_h(x)- \mu '_j(x)}|| \ne 0\) for any value of x;

  • the domain \({\mathcal {X}}\) of x is an interval in \({\mathbb {R}}\).

Fig. 1
figure 1

Graphical model representation of the FMRC model in Eq. (1); grey-colored circle represent observed quantities

Jacobs et al. (1991) model the component weights \(p_g(x_i)\) using a multinomial logistic regression model, expressing the log-odds of these probabilities, with respect to the reference one (e.g., the G-th), as linear functions of the covariate x. In this Paper, similarly to Berrettini et al. (2021), each of these \(G-1\) linear predictors is replaced with a smooth function of x, represented by a linear combination of m cubic B-spline bases \(B_\rho (\cdot )\) and coefficients \(\gamma _{g\rho }\):

$$\begin{aligned} \log \frac{p_g(x_i)}{p_G(x_i)}=\eta _{g}(x_i)=\sum _{\rho =1}^m B_{\rho }(x_i)\gamma _{g\rho }, \quad \text{ for } i=1,\dots ,n. \end{aligned}$$
(2)

By defining the \(n \times m\) design matrix \({\mathbf {B}}\), where the element in row i and column \(\rho \) is given by \(B_{\rho }(x_{i})\), and after some algebra, Eq. (2) can be rewritten as:

$$\begin{aligned} {{\varvec{p}}}_g({\mathbf {x}})=\frac{\exp ({\mathbf {B}}\varvec{\gamma }_{g})}{\sum _{g=1}^G\exp ({\mathbf {B}}\varvec{\gamma }_{g})}, \end{aligned}$$
(3)

where \(\varvec{\gamma }_{g}=(\gamma _{g1},\dots , \gamma _{gm})'\) corresponds to the vector of unknown regression coefficients, where the exponential is applied elementwise. To guarantee identifiability, the vector of coefficients corresponding to the reference group G are all set equal to 0.

Regarding the components’ normal densities, each mean \(\mu _g(\cdot )\) is also assumed to be an unknown smooth function of covariate x, represented through B-splines:

$$\begin{aligned} \varvec{\mu }_g(\mathbf {x})= \mathbf {B}\varvec{\beta }_g \end{aligned}$$
(4)

with \(\varvec{\beta }_g=(\beta _{g1},\dots , \beta _{gm})'\) vector of unknown regression coefficients.

Bayesian inference

Adopting B-splines to represent a smooth function requires the specification of what is known as the number of knots (or, equivalently, the number of B-spline bases), which governs how the bases behave and the flexibility of the resulting function. In the Bayesian framework, Lang and Brezger (2004) suggest a large number of knots (between 20 and 40) to ensure enough flexibility; additionally, they show how to define priors for the regression parameters \(\gamma _{g1},\dots ,\gamma _{gm}\) and \(\beta _{g1},\dots , \beta _{gm}\) in terms of a random walk:

$$\begin{aligned}&\gamma _{g\rho }=\gamma _{g,\rho -1}+w_{g\rho }, \quad w_{g\rho }\sim N(0,\delta _g^2);\end{aligned}$$
(5)
$$\begin{aligned}&\beta _{g\rho }=\beta _{g,\rho -1}+u_{g\rho }, \quad u_{g\rho }\sim N(0,\tau _g^2). \end{aligned}$$
(6)

Using this representation is equivalent to inducing a penalisation, based on differences of adjacent B-spline coefficients, and leads to the definition of “penalised” B-splines, commonly abbreviated to “P-splines”. Through this approach, the amount of smoothness is controlled by the variance parameters \(\delta _g^2\) and \(\tau _g^2\): their presence protects against possible overfitting when a larger than needed number of knots is chosen. In particular, small values for \(\delta _g^2\) and \(\tau _g^2\) lead to approximately constant log-odds and conditional mean, respectively. Hyperpriors are assigned to the variances \(\tau _g^2\), \(\delta _g^2\) and \(\sigma _g^2\), selecting Inverse Gamma distributions \(\text{ IG } (a,b)\), with \(a = 1\) and a small value for b, for example \(b=0.005\), leading to almost diffuse priors. The priors in (5) and (6) can be equivalently written in the form of global smoothness priors:

$$\begin{aligned} \varvec{\gamma }_{g}|\delta _{g}^2\propto \exp \left( -\frac{1}{2\tau _{g}^2}\varvec{\gamma }_{g}'{\mathbf {K}}\varvec{\gamma }_{g}\right) ,\quad \varvec{\beta }_{g}|\tau _{g}^2\propto \exp \left( -\frac{1}{2\tau _{g}^2}\varvec{\beta }_{g}'{\mathbf {K}}\varvec{\beta }_{g}\right) \end{aligned}$$

where the penalty matrix \({\mathbf {K}}\) is given by \({\mathbf {K}}=\varvec{\Delta }_1'\varvec{\Delta }_1\), with \(\varvec{\Delta }_1\) being the first order difference matrix ( Rue and Held 2005, Chapter 2, p. 52). Because \({\mathbf {K}}\) is rank deficient with \(\text{ rank }({\mathbf {K}}) = m-1\) for a first-order random walk, the priors are improper. It is worth mentioning that in the literature these kind of priors are usually referred to as intrinsic Gaussian Markov random fields (Rue and Held 2005).

The multinomial model in Equation (2) can be conveniently represented as a binary formulation in the partial dRUM representation proposed by Frühwirth-Schnatter et al. (2012). Conditional on each \(\lambda _g(x_i)=\exp (\eta _g(x_i))\), the random utilities are defined as

$$\begin{aligned} z_{gi}=\eta _g(x_i)-\log \left( \sum _{l\ne g}\lambda _l(x_i)\right) +\epsilon _{gi},\, \quad D_{gi}=\mathbbm {1}(z_{gi}>0); \end{aligned}$$
(7)

where \(z_{gi}\) are latent variables, \(D_{gi}=\mathbbm {1}(c_i=g)\) are the allocation indicators and \(\epsilon _{gi}\) are i.i.d. errors following a Logistic distribution (\(g=1,\ldots ,G, i=1,\dots ,n\)). Given \(\lambda _1(x_i),\dots ,\lambda _G(x_i)\) and the latent indicator variables \(D_{1i},\dots ,D_{Gi}\), the latent variables \(z_{1i},\dots , z_{Gi}\) are distributed according to an Exponential distribution and can be easily sampled in a data augmented implementation. To avoid any Metropolis-Hastings step, Frühwirth-Schnatter and Frühwirth (2010) approximate, for each \(\epsilon _{gi}\), the Logistic distribution by a finite scale mixture of H Gaussian distributions, with zero means and variances \(\{s^2_{1},\dots , s^2_{H}\}\) drawn with fixed probabilities \(\{w_1,\dots ,w_H\}\). The same authors obtained their finite scale mixture approximation by minimizing the Kullback-Leibler divergence between the densities, and recommend choosing \(H = 3\) in larger applications, where computing time matters, and to work with \(H = 6\) whenever possible. In a second step of data augmentation, the component indicators \(r_{gi}\) (\(g=1,\ldots ,G-1, i=1,\dots ,n\)), each taking value \(h=1,\dots , H\), are introduced as yet another level of latent variables. Conditional on \({\mathbf {z}}_g=(z_{g1}, \dots , z_{gn})'\) and \({\mathbf {r}}_g = (r_{g1}, \dots , r_{gn})'\), the binary logit regression model (7) reduces to a Gaussian regression model.

MCMC algorithm

Based on the representation of Sect. 3, a new MCMC algorithm is implemented, for a fixed G, by integrating the scheme proposed by Frühwirth-Schnatter et al. (2012) with the Bayesian P-spline approach by Lang and Brezger (2004), similarly to Berrettini et al. (2021). A sketch of the algorithm is comprised of the following steps:

  1. 1.

    Sample the regression coefficients’ vector \(\varvec{\gamma }_g\) conditional on \({\mathbf {z}}_{g}\) and \({\mathbf {r}}_g\), \(g=1,\dots , G-1\). Using the prior in Equation (5), the full conditional of \(\varvec{\gamma }_{g}\) is given by a multivariate Gaussian density. Straightforward calculations (Brezger and Lang 2006) show that the precision matrix \({\mathbf {P}}_{g}\) and the mean \({\mathbf {m}}_{g}\) of \(\varvec{\gamma }_{g}|\cdot \) are given by

    $$\begin{aligned} \begin{aligned}&{\mathbf {P}}_{g}={\mathbf {B}}'{\mathbf {W}}_g^{-1}{\mathbf {B}}+\frac{1}{\delta _{g}^2}{\mathbf {K}}, \\&{\mathbf {m}}_{g}={\mathbf {P}}_{g}^{-1}{\mathbf {B}}'{\mathbf {W}}_g^{-1}\left( {\mathbf {z}}_g+\log \varvec{\lambda }_{-g}({\mathbf {x}})\right) , \end{aligned} \end{aligned}$$
    (8)

    respectively, where \({\mathbf {W}}_g\) is a \(n\times n\) diagonal matrix with nonzero elements equal to the randomly drawn variances \((\omega _{1g}=s_{r_{g1}}^2,\dots , \omega _{ng}=s_{r_{gn}}^2)\) for the g-th group, the i-th element of \(\varvec{\lambda }_{-g}({\mathbf {x}})\) is \(\sum _{l \ne g}\lambda _{l}(x_i)\) and the logarithm is applied elementwise;

  2. 2.

    Sample the \(G-1\) variance parameters \(\delta _{g}^2\) conditional on \(\varvec{\gamma }_{g}\):

    $$\begin{aligned} \delta ^2_{g}|\varvec{\gamma }_{g} \sim IG\left( a+ \frac{\text{ rank }({\mathbf {K}})}{2}, b+ \frac{1}{2}\varvec{\gamma }_{g}'{\mathbf {K}}\varvec{\gamma }_{g}\right) ; \end{aligned}$$
    (9)
  3. 3.

    For each unit \(i=1,\dots , n\), sample all the (partial) differences of utilities \(z_{1i},\dots , z_{G-1,i}\) simultaneously from:

    $$\begin{aligned} z_{gi}=\log \left( \frac{\lambda _{g}(x_i)}{\sum _{l\ne g}\lambda _{l}(x_i)}U_{gi}+D_{gi}\right) -\log \left( 1-U_{gi}+\frac{\lambda _{g}(x_i)}{\sum _{l\ne g}\lambda _{l}(x_i)}D_{gi}\right) , \end{aligned}$$
    (10)

    with \(U_{gi}\sim \text {Unif}(0,1)\);

  4. 4.

    For \(g=1,\dots ,G-1\) and \(i=1,\dots ,n\), sample the component indicators \(r_{gi}\) conditional on \(z_{gi}\) from:

    $$\begin{aligned} \text{ Pr }(r_{gi}=h|z_{gi},\varvec{\gamma }_g)\propto \frac{w_h}{\sqrt{s^2_h}}\exp \left[ -\frac{1}{2}\left( \frac{z_{gi}-\eta _{g}(x_i)+\log \sum _{l\ne g}\lambda _{l}(x_i)}{\sqrt{s^2_h}}\right) ^2\right] ; \end{aligned}$$
    (11)
  5. 5.

    Sample the regression coefficients’ vector \(\varvec{\beta }_g,\, g=1,\dots , G\) from a multivariate Gaussian density with covariance matrix \({\mathbf {V}}_g\) and mean \(\varvec{\nu }_g\)

    $$\begin{aligned} {\mathbf {V}}_{g}=\left( \frac{1}{\sigma _g^2}{{\mathbf {B}}^{(g)}}'{\mathbf {B}}^{(g)}+\frac{1}{{\tau _{gj}^2}}{\mathbf {K}}_{j}\right) ^{-1}, \quad \varvec{\nu }_{g}={{\mathbf {V}}_g}{{\mathbf {B}}^{(g)}}'{\mathbf {y}}^{(g)}, \end{aligned}$$
    (12)

    where the superscript \(^{(g)}\) is applied throughout this section to any matrix or vector to indicate the rows of that matrix (or the elements of that vector) corresponding to the units allocated to the g-th group;

  6. 6.

    Sample the G variance parameters \(\tau _{g}^2\) conditional on \(\varvec{\beta }_{g}\):

    $$\begin{aligned} {\tau ^2_{g}}|\varvec{\beta }_{g} \sim IG\left( a+ \frac{\text{ rank }({\mathbf {K}})}{2}, b+ \frac{1}{2}\varvec{\beta }_{g}'{\mathbf {K}}\varvec{\beta }_{g}\right) \end{aligned}$$
    (13)
  7. 7.

    Sample the G variance parameters \({\sigma _{g}^2}\) conditional on \(\varvec{\mu }_g({\mathbf {x}})={\mathbf {B}}\varvec{\beta }_g\):

    $$\begin{aligned}&\sigma ^2_{g}|\varvec{\mu }_g({\mathbf {x}}),\sim IG\Big (a+ \frac{\sum _{i=1}^{n}D_{gi}}{2}, \nonumber \\&b+ \frac{1}{2}\left( {\mathbf {y}}^{(g)}-\varvec{\mu }_g^{(g)}({\mathbf {x}})\right) '\left( {\mathbf {y}}^{(g)}-\varvec{\mu }_g^{(g)}({\mathbf {x}})\right) \Big ). \end{aligned}$$
    (14)
  8. 8.

    Classify each unit i according to Bayes’ rule: draw \(D_{gi}\) (\(g=1,\ldots ,G\), \(i=1,\dots ,n\)) from the following discrete probability distribution which combines the likelihood and the prior:

    $$\begin{aligned} \text{ Pr }(D_{gi}=1|y_i,x_i,\varvec{\gamma }, \varvec{\beta }_g,\sigma _g^2) \propto \frac{\lambda _{g}(x_i)}{\sum _{g=1}^G\lambda _{g}(x_i)}f_{{\mathcal {N}}}(y_i|\mu _g(x_i), \sigma _g^2). \end{aligned}$$
    (15)

It is worth mentioning that Steps 1 to 4 of this algorithm are similar to those proposed by Berrettini et al. (2021) to sample the parameters related to the component weights, while Steps 3 to 5 are specific to the models considered in this Paper, and are needed to sample the parameters associated with the component means.

Label switching, posterior inference and model selection

The MCMC algorithm described in the previous section can be prone to label switching (see Frühwirth-Schnatter 2006, Section 3.5 for a review). A possible solution to deal with this problem, which exploits k-means clustering (with G clusters) of the posterior draws to identify a unique labeling, has been proposed by Frühwirth-Schnatter et al. (2012). This solution is readily available to the semiparametric FMRC models introduced in this Paper.

Posterior inference is carried out after completing the prefixed number T of iterations. As usual, parameter’s posterior mean can be computed considering averages of the last \(T-T_0\) draws of the chains, with \(T_0\) defining the burn-in phase. As far as the smooth functions are concerned, posterior quantities are obtained by exploiting their representation as linear combinations of spline bases and the corresponding regression coefficients’ estimates. Pointwise percentiles (usually 2.5–97.5 or 5–95) computed over the last \(T-T_0\) posterior draws can be used to quantify uncertainty associated to the estimated smooth functions.

The Maximum-A-Posteriori (MAP) rule is adopted to partition the observations into G groups, by allocating them to the G components. In particular, each unit \(i=1,\dots ,n\) is assigned to the component \({\hat{c}}_i\) such that

$$\begin{aligned} {\hat{c}}_i=\arg \max _g\left( \sum _{t=T_0}^T D_{1i}^{(t)},\dots , \sum _{t=T_0}^T D_{gi}^{(t)},\dots ,\sum _{t=T_0}^T D_{Gi}^{(t)}\right) . \end{aligned}$$
(16)

Occasionally, the use of the MAP rule can lead to empty groups, when one or more components could have no units assigned to them. In such situations, it might be worth distinguishing between the number of components G and the number of nonempty components, denoted as

$$\begin{aligned} {\tilde{G}} =\sum _{g=1}^G\mathbbm {1}\left( {\hat{n}}_g>0\right) , \end{aligned}$$
(17)

where \({\hat{n}}_g=\sum _{i=1}^n \mathbbm {1}({\hat{c}}_{i}=g)\) is the number of observations assigned to group g (\(g=1,\ldots ,G\)).

A relevant issue related to mixture models is the choice of the number of components, which originated many efforts in the statistical literature. The proposed MCMC algorithm requires the value of G to be fixed in advance. Thus, the algorithm should be run for different values of G and the obtained results should be compared in order to select the optimal number of components. Several model selection criteria are available to perform these comparisons (see Celeux et al. 2019, for a recent review). Many of these criteria require the determination of the number of free parameters of each candidate model. However, the quantification of this number for the semiparametric FMRC models described in this Paper can be difficult due to the regularisation induced by the prior distributions on the spline coefficients. A solution to circumvent this problem is proposed by Raftery et al. (2007). They suggest the use of \(2s_l^2\) as an estimate of this unknown quantity, where \(s_l^2\) is the sample variance of the log-likelihoods computed as \(l^{(t)}=\sum _{i=1}^n\log f({\mathbf {y}}_i|\varvec{\theta }_{{\mathbf {D}}_i}^{(t)})\), with \({\varvec{\theta }}_{{\mathbf {D}}_i}^{(t)}\) denoting the vector of estimated parameters for the component unit i is allocated to, at iterations \(t=T_0,\dots ,T\), after the burn-in. Using this estimate, they derive two model selection criteria, whose values depend only on the log-likelihoods from the posterior simulation, that are readily available:

$$\begin{aligned}&\text{ AICM }=-2({\bar{l}}-s_l^2), \end{aligned}$$
(18)
$$\begin{aligned}&\text{ BICM }=-2\left[ {\bar{l}}-s_l^2(\log (n)-1)\right] , \end{aligned}$$
(19)

where \({\bar{l}}\) is the sample mean of the sequence of log-likelihoods \(l^{(t)}\), for each iteration \(t=T_0,\dots ,T\), after the burn-in. As pointed out by Raftery et al. (2007), the AICM is connected to the DIC criterion (Spiegelhalter et al. 2002). More specifically, it coincides with the DIC definition provided by Gelman et al. (2004, Sect. 6.7). Concerning the BICM, it can be related to an approximation of the log-marginal likelihood. Successful applications of the AICM in the mixture modelling context are described, for example, by Erosheva et al. (2007), Gormley and Murphy (2010), Gormley and Murphy (2011), and Mollica and Tardella (2017). The BICM is exploited, for example, by Ranciati et al. (2017), Murphy et al. (2020) and Redivo et al. (2020).

Simulation study

The performance of the proposed approach is investigated in a simulated environment, considering two scenarios that differ in terms of the true number of components and the distribution of the manifest variable. In both scenarios, the manifest variable y and the concomitant covariate x are assumed to be univariate, for simplicity.

The quality of the estimates for the covariate effects on the conditional means is evaluated through a comparison between the true effects and their posterior estimates, after fitting each of the following mixture of regression models:

  • Semiparametric Finite Mixture of Regressions models with Concomitants (SFMRC), with flexible specification of both the mixture weights \(\pi _g(x)\) and the conditional means \(\mu _g(x)\), \(g=1,\dots ,G\);

  • Semiparametric Finite Mixture of Regression (SFMR) models, with constant mixture weights \(\pi _g\) and flexible specification of the conditional means \(\mu _g(x)\), \(g=1,\dots ,G\);

  • (parametric) FMRC, with linearity assumption for the effect of x on both the log-odds of the mixture weights \(\log (\pi _g(x)/\pi _G(x))=\eta _g(x)\) and the conditional means \(\mu _g(x)\), \(g=1,\dots ,G\);

  • (parametric) FMR, with constant mixture weights \(\pi _g\) and linearity assumption for the effect of x on the conditional means \(\mu _g(x)\), \(g=1,\dots ,G\).

Additionally, by adapting to the univariate Gaussian case the models discussed in Berrettini et al. (2021), two mixture models with concomitants are also considered:

  • Semiparametric Finite Mixture models with Concomitants (SFMC), with flexible specification of the mixture weights \(\pi _g(x)\) and constant conditional means \(\mu _g\), \(g=1,\dots ,G\);

  • (parametric) FMC, with linearity assumption for the effect of x on the log-odds of the mixture weights \(\log (\pi _g(x)/\pi _G(x))=\eta _g(x)\) and constant conditional means \(\mu _g\), \(g=1,\dots ,G\);

For every class of models, G is initially set equal to the true number of components. In particular, the pointwise means of the estimated \(\mu _g(x^*)\), denoted \({\hat{\mu }}_g(x^*)\), are plotted together with the pointwise 2.5 and 97.5 percentiles among all samples, where \(\{x_i^*\}, i=1,\dots , n,\) are grid points taken evenly in the range of covariate x. To quantitatively assess the performance of the estimators of the unknown regression functions \(\mu _g(x)\), the same measure employed in Huang and Yao (2012), Huang et al. (2013) and Xiang and Yao (2018) is adopted, that is the square Root of the Average Squared Errors (RASE), computed as

$$\begin{aligned} \text{ RASE}_{\mu _g}=\sqrt{\frac{1}{n}\sum _{i=1}^n\left( {\hat{\mu }}_g(x_i^*)-\mu _g(x_i^*)\right) ^2}, \quad g=1,\dots ,G; \end{aligned}$$
(20)

in practice, the RASE measures the (Euclidean) pointwise distance between the “true” curve and the estimated one. The same graphical and quantitative evaluations are carried out for the covariate effects on the mixture weights, this time by restricting the analysis to the class of semiparametric and parametric models with concomitants.

Regarding the clustering performance, a comparison between true allocations and inferred ones is made in terms of Adjusted Rand Index (ARI) (Hubert and Arabie 1985) and soft ARI (sARI) (Flynt et al. 2019). For each method and each value of G, 4000 MCMC draws are simulated after a burn-in of as many draws. Both AICM and BICM are considered to select the optimal number of components, and the number of nonempty components \({\tilde{G}}\) is computed according to Equation (17). For each of the competing classes of models, a proper MCMC algorithm has been implemented in R (R Core Team 2020). The R codes for the four algorithms are available on GitHub at the following link: github.com/MarcoBerrettini/sMoE.

First simulation experiment: G=2

A batch of 100 independent datasets is generated with \(n=1000\) from a two-component mixture of regression models with weights

$$\begin{aligned} \pi _1(x)=0.1+0.85\sin (\pi x),\quad \pi _2(x)=1-\pi _1(x), \end{aligned}$$

where x is the only covariate, sampled from a standard uniform distribution: \(x_i\sim \text {Unif}(0,1)\), \(i=1,\dots ,1000\). The functional form of \(\eta _1(x)=\log \frac{\pi _1(x)}{1-\pi _2(x)}\), coupled with the specific range of values for \(x_i\), leads to a nonmonotonic concave log-odds. Conditional on x and the component indicators, each component density is a Gaussian distribution, with means \(\mu _1(x), \mu _2(x)\), and variances \(\sigma _1^2, \sigma _2^2\) given by:

$$\begin{aligned} \begin{array}{lll} &{}\mu _1(x)=15(x-0.5)^2+1, &{}\quad \sigma _1^2=0.09;\\ &{}\mu _2(x)=5(x-0.5)^2, &{}\quad \sigma _2^2=0.0625. \end{array} \end{aligned}$$

Figure 2 shows one of the 100 independent samples. Figure 3 highlights the limits of the parametric approach when a nonmonotonic function, symmetric about \(x=0.5\), has to be approximated. In particular, for fixed number of components \(G=2\), both FMC and FMRC tend to fit a constant function with an associated \(\text{ RASE}_{\eta }\) (and its standard deviation), averaged over the 100 simulations, equal to 2.402 (0.209) and 10.091 (31.020), respectively. On the other hand, SFMRC seems to catch the underlying trend even though some oversmoothing is present around the peak of the function. For this model, the average \(\text{ RASE}_{\eta }\) drops to 0.209, with standard deviation of 0.665. A peculiar behaviour emerges when examining the estimates for \(\eta _1(x)\) obtained with SFMC, characterised by an average \(\text{ RASE}_{\eta }\) equal to 4.646 (and a standard deviation equal to 0.776). This might be caused by an unsuccessful attempt to counterbalance the evident model misspecification, related to the assumption of constant conditional means, by using the flexible mixture weights’ specification.

Fig. 2
figure 2

First simulation experiment: example of a simulated dataset

Fig. 3
figure 3

First simulation experiment: pointwise average and 2.5–97.5 percentiles of the log-odds of the mixture weight \(\eta _1\) estimated by the four mixture models with concomitants over 100 simulated datasets, for fixed \(G=2\)

Regarding the estimated conditional means, the SFMRC shows good performance in the left panel of Fig. 4, apart from some oversmoothing in the lower component \(\mu _2(x)\), for central values of x. In this area, the probability of observing units from component 2 reaches its minimum, as previously shown in Fig. 3. Around this region, most of the observations come from component 1, with only few observations from component 2. This disproportion, coupled with a certain degree of overlap of the two components, seems to have led the MCMC algorithm to assign erroneously some units from component 1 to component 2, with a consequent slight upward bias in \({\hat{\mu }}_2(x)\). This explains also the oversmoothing observed when estimating the effect of the covariate x on the log-odds \(\eta _1(x)\) of the mixture weights. This issue becomes way more evident if constant weights are assumed without considering the effects of the concomitant covariate x, as for the SFMR; see the second plot in Fig. 4. Again, the main problem regards mostly the lower component, whose true mean is not fully included in the bands, even though they widen considerably in the overlap region.

Fig. 4
figure 4

First simulation experiment: pointwise average and 2.5–97.5 percentiles of the conditional means estimated with both semiparametric regression approaches over 100 simulated datasets, for fixed \(G=2\)

No assumption of constant weights is made when fitting the FMRC, but, as previously shown in Fig. 3, this model estimates a constant effect of the covariate, making it practically equivalent to a FMR. This is evident in Fig. 5, where the conditional means estimated by the two models are compared. Since these two functions are generated to be quadratic and symmetric about \(x=0.5\), both parametric regression models fit horizontal lines, effectively collapsing to a simple mixture of Gaussians not involving the effect of the covariate for the conditional distribution of the dependent variable, thus leading to estimated conditional means very close to those obtained with FMC and SFMC (Fig. 6). It is worth mentioning that the fitted constant means obtained by these four class of models are not even centered around the true average group means. Table 1 summarises this comparison among the conditional mean functions estimated by the six competing models from a quantitative point of view, by displaying, for each combination of method and component \(g=1,2\), the average \(\text{ RASE}_{\mu _g}\) and the standard deviation over 100 simulated datasets. Quality of the estimates are strictly related to the quality of the allocations, as Table 2 confirms. The SFMRC, in fact, outperforms its competitors in terms of AICM, BICM, ARI and sARI for fixed number of components \(G=2\), followed by the SFMR. Given the previous considerations about the results, both the parametric approaches and those assuming constant conditional means prove to be not satisfactory in this simulation setting.

Fig. 5
figure 5

First simulation experiment: pointwise average and 2.5–97.5 percentiles of the conditional means estimated with both parametric regression approaches over 100 simulated datasets, for fixed \(G=2\)

Fig. 6
figure 6

First simulation experiment: pointwise average and 2.5–97.5 percentiles of the conditional means estimated with both mixture approaches with concomitants over 100 simulated datasets, for fixed \(G=2\)

Table 1 First simulation experiment: mean (and standard deviation) of the RASE scores computed on the estimated conditional means over 100 simulated datasets
Table 2 First simulation experiment: average AICM, BICM, ARI and sARI (number of times each model ranks first) over 100 simulated datasets, for fixed \(G=2\)

A comparison among the six competing models is performed also by examining the best models selected according to AICM and BICM when considering a number of components ranging from 1 to 4. Table 3 reports the distribution of the number of nonempty components \({\tilde{G}}\) resulting from the selection based on AICM. \({\tilde{G}}=2\) is always the best choice according to the SFMRC, while 26 times out of 100 the SFMR provides a better AICM with an additional component. It is interesting to note that parametric regression approaches tend to show better values for AICM when considering a single component. Conversely, models characterised only by the presence of concomitants appear to improve if a larger number of components is considered. This again might be related to the fact that increasing the number of components might compensate for their intrinsic mispecification.

Table 3 First simulation experiment: number of nonempty component selected for each method, according to AICM, over 100 simulated datasets
Table 4 First simulation experiment: number of nonempty component selected for each method, according to BICM, over 100 simulated datasets

By comparing the results reported in Table 3 to those in Table 4, the tendency of the BICM to select models with fewer nonempty components is apparent. This seems to be consistent with what has been already reported in the literature (see, for example, Redivo et al. 2020). Only SFMRC models do not seem to suffer from this systematic underestimation.

Examining the best models (according to AICM) fitted with each method for each simulated dataset, rather than fixing the number of components, the conclusions does not seem to change. All AICM averaged values reported in Table 5 improve with respect to those in Table 2, also for the SFMRC, because sometimes having an extra component, even if it is emptied during the posterior allocation, slightly decreases AICM. For the same reason, both the average ARI and sARI appear to slightly improve for the SFMRC, while they worsen for models that tend to pick the wrong number of components; see Table 5. Similar conclusions can be drawn when considering the best models selected using BICM (data not shown).

Table 5 First simulation experiment: average AICM, ARI and sARI (number of times each model ranks) over 100 simulated datasets, for optimal G according to AICM

Second simulation expertiment: G=3

A batch of 100 independent datasets is generated with \(n=1000\) from a three-component mixture of regression models, with log-odds of mixture weights \(\eta _{g}(x)=\log \pi _g(x)/\pi _3(x)\), \(g=1,2\), defined as:

$$\begin{aligned} \begin{aligned}&\eta _1(x)=3\, \frac{\exp (7.5 - 15x)}{1 + \exp (7.5 - 15x)} - 1.5,\\&\eta _2(x)=3\, \frac{\exp (15x-7.5)}{1 + \exp (15x-7.5)} - 1.5; \end{aligned} \end{aligned}$$

where x is the only covariate, sampled from a uniform distribution: \(x_i \sim \text {Unif} (0,1)\), \(i=1,\dots ,1000\). Conditional on x and the component indicators, y follows a univariate Gaussian distribution, with means \(\mu _1(x), \mu _2(x), \mu _3(x)\), and variances \(\sigma _1^2, \sigma _2^2,\sigma _3^2\), respectively, defined as follows:

$$\begin{aligned} \begin{aligned}&\mu _1(x)=0.5\sin (6x + 0.8) + \exp (-16 (3x + 0.15)^2) - 1.75,\quad \sigma _1^2=0.04;\\&\mu _2(x)=1.75-0.5\sin (6x + 0.8) + \exp (-16 (3x + 0.15)^2) - 1.75), \quad \sigma _2^2=0.04;\\&\mu _3(x)=-0.5\sin (2\pi x), \quad \sigma _3^2=0.25.\\ \end{aligned} \end{aligned}$$

Figure 7 shows one of the 100 independently generated samples for this second scenario.

Fig. 7
figure 7

Second simulation experiment: example of a simulated dataset

Figure 8 shows that the SFMRC is able to catch almost perfectly the effects of the covariate x on both predictors \(\eta _1\) and \(\eta _2\). On the contrary, due to nonlinearity, the linear approximation by the FMRC is worse, so that the true effects exceed the bands at the boundaries of the range of x. As expected, the average RASE scores for the SFMRC (together with the associated standard deviations) reported in Table 6 are lower than those of the parametric competitor. Table 6 contains also information about the performance of SFMC and FMC. In this second simulation experiment, it is evident that imposing constant conditional means has a dramatic impact on the ability of these two class of models in recollecting the effect of the covariates on the mixture weights: the average RASE associated with these two classes of mixture models with concomitants are almost ten times larger than those obtained with SFMRC.

Fig. 8
figure 8

Second simulation experiment: comparison between the log-odds of the mixture weights estimated by the semiparametric (left) and parametric (right) mixture of regressions with concomitants over 100 simulated datasets, for fixed \(G=3\)

Regarding the estimates of the conditional means, the SFMRC seems to outperform the competitors, despite some overlap present between Cluster 1 and Cluster 3 for low values of x, and between Cluster 2 and Cluster 3 for high values of x (see Fig. 9). The FMRC is unable to properly approximate the nonlinear trends, especially where there are fewer observations (i.e. in Cluster 1 for high values of x, and in Cluster 2 for low values of x). Nevertheless, Fig. 10 shows that, thanks to the good estimates of the mixture weights, the FMRC discriminates almost perfectly among groups in the aforementioned overlapping areas.

Table 6 Second simulation experiment: mean (and standard deviation) of the RASE scores computed on the estimated log-odds of the mixture weights over 100 simulated datasets, for fixed \(G=3\)
Fig. 9
figure 9

Second simulation experiment: conditional means estimated by the SFMRC over 100 simulated datasets, for fixed \(G=3\)

Fig. 10
figure 10

Second simulation experiment: conditional means estimated by the FMRC over 100 simulated datasets, for fixed \(G=3\)

Results for SFMR are reported, in detail, in Fig. 11. The flexibility allowed for the estimates of the conditional means, combined with the impossibility to include the effect of covariate x into the estimates of the mixture weights, results into overlapping estimated functions and wide bands. The performance of the FMR is just slightly worse with respect to the ones obtained by the FMRC. The main differences can be observed in the overlapping regions of Fig. 12, where the estimated conditional means intersect each other. As far as SFMC and FMC are concerned, these two class of models show a similar behaviour in the estimated (constant) conditional means (Figs. 13 and 14). Due to the specific settings considered in this second experiment, the impact of the mispecification error seems less severe. However, it is apparent that both models suffers from a large sampling variability in the estimated conditional means. This is particularly evident in the estimates for component 3. It is worth noting that component 3 is used as baseline to define the log-odds. Thus, the extremely large variability in the estimates for the conditional mean of this component can be connected with the previously mentioned poor performance of SFMC and FMC in estimating the log-odds. All the conclusions drawn from a graphical point of view are confirmed by the quantitative results in terms of \(\text{ RASE}_{\mu }\) reported in Table 7.

Fig. 11
figure 11

Second simulation experiment: conditional means estimated by the SFMR model over 100 simulated datasets, for fixed \(G=3\)

Fig. 12
figure 12

Second simulation experiment: conditional means estimated by the FMR over 100 simulated datasets, for fixed \(G=3\)

Fig. 13
figure 13

Second simulation experiment: conditional means estimated by the SFMC model over 100 simulated datasets, for fixed \(G=3\)

Fig. 14
figure 14

Second simulation experiment: conditional means estimated by the FMC model over 100 simulated datasets, for fixed \(G=3\)

Table 7 Second simulation experiment: mean (and standard deviation) of the RASE scores computed on the estimated conditional means over 100 simulated datasets

Table 8 shows that, in terms of both AICM and BICM, the SFMRC is evidently better than its competitors with \(G=3\). However, this result does not correspond to an equal gap in the quality of the allocations, expressed in terms of both ARI and sARI. Indeed, either mixture of regression models with concomitants perform well, even though the semiparametric one slightly prevails.

Table 8 Second simulation experiment: average AICM, BICM, ARI and sARI (number of times each method ranks first) over 100 simulated datasets, for fixed \(G=3\)

Regarding model selection, for each competing method and each simulated dataset, the best model is considered among different mixture models with \(G=1,\dots ,5\). Table 9 shows that, when the selection is based on AICM, the SFMRC is the only one which is able to consistently pick the correct number of nonempty components.

Table 9 Second simulation experiment: number of nonempty components selected for each method, according to AICM, over 100 simulated datasets

This leads to more favorable results for the SFMRC, if ARI and sARI computed with reference to the best models selected by each method are compared; see Table 10. On the contrary, results worsen when BICM is considered as model selection criterion. As shown in Table 11, the tendency of BICM to underestimate the actual number of nonempty components is confirmed, and in this second simulation experiment also SFMRC models suffer from this. As a consequence, the values of ARI and sARI computed for the best models selected using BICM are negatively impacted (data not shown).

Table 10 Second simulation experiment: average AICM, ARI and sARI (number of times each method ranks first) over 100 simulated datasets, for optimal G according to AICM
Table 11 Second simulation experiment: number of nonempty components selected for each method, according to BICM, over 100 simulated datasets

Real data applications

Baseball salaries data

Watnik (1998) provides a dataset consisting of information about players for the 1992 Major League Baseball season. In particular, their salaries are considered as the response, along with measures of the 337 players’ previous year’s performances. Notice that this dataset is already well known in the literature on FRMC (see, for example, Khalili and Chen 2007; Chamroukhi and Huynh 2018). For simplicity, the analysis described in this section focuses on one of the quantitative covariates, the number of runs, taken as a measure of a player’s contribution to the team. More specifically, the effect of this variable on player salaries is studied, by fitting the six different mixture of regression models considered in Sect. 4 for a fixed number of components ranging from \(G=1\) to \(G=4\). As suggested by Watnik (1998), due to asymmetry, the response is preemptively transformed by taking the natural logarithm.

In light of the tendency of BICM to underestimate the number of nonempty components highlighted in the simulation experiments, in this application the optimal value for G is selected according to AICM. The number of nonempty components \({\tilde{G}}\) resulted to be equal to 2 for the two semiparametric mixture of regression models (SFMRC and SFMR) as well as the two mixture of Gaussians (SFMC and FMC), and equal to 1 for the two parametric mixture of regression models (FMRC and FMR). Among these six models, the SFMRC presents the smallest AICM (663.4), followed by the SFMR (733.7), the SFMC (781.1) and the FMC (817.3), while the remaining best parametric mixture of regression models, having \(G=1\), collapse to the same model, with the highest AICM (888.1).

As Fig. 15 shows, the main difference between the semiparametric mixture of regression models seems to be related to the allocation of players with a low number of runs. The SFMRC keeps the two clusters well separated, by assigning all of these units to the lower one, whereas the SFMR creates some overlap, such that the functions describing the conditional means, \({\hat{\mu }}_1(x)\) and \({\hat{\mu }}_2(x)\), almost intersect one another. Figure 15 shows, in both cases, the presence of a nonlinear effect of the number of runs on the log-salary for the upper cluster, while the bands does not exclude a linear effect for the lower cluster.

Fig. 15
figure 15

Baseball salaries data: estimated posterior conditional means (and pointwise 95% posterior credible bands) obtained from the SFMRC (left panel) and the SFMR (right panel)

Fixing the number of components \(G=2\), the FMR allocates the players similarly to the SFMRC. In particular, only 9 units out of 337 (ARI = 0.894) differ in cluster allocation between the two models. Focusing on the parametric regression approaches, the main difference between the allocations seems to be related to few among the lowest paid players having a number of runs ranging between 30 and 90, which are assigned to the upper component by the FMRC. This probably induces variability in the estimated mean functions of the latter, which present wider bands, if compared to the ones estimated by FMRC; see Fig. 16.

Fig. 16
figure 16

Baseball salaries data: estimated posterior conditional means (and pointwise 95% posterior credible bands) obtained from the FMRC (left panel) and the FMR (right panel), for fixed \(G=2\)

The two approaches with constant conditional means produce a sensibly different partition with respect to the other methods, and similar to each other, detecting a cluster of highly paid players (around 500.000 dollars or more); see Fig. 17. Here, the boundary (not drawn) separating the two groups is not perfectly horizontal due to the covariate effect on the log-odds of the weights; see Fig. 18. In particular, all four mixture models with concomitants agree about the presence of a decreasing trend in the effect of the number of runs on the log-odds of the mixture weight \(\eta _1(x)\), but the semiparametric methods estimate nonlinear functions that cannot be approximated properly by a straight line. The ability to pick this underlying effect is also the main reason for the differences observed between the performances of the two semiparametric regression approaches.

Fig. 17
figure 17

Baseball salaries data: estimated posterior conditional means (and pointwise 95% posterior credible bands) obtained from the SFMC (left panel) and the FMC (right panel), for fixed \(G=2\)

Fig. 18
figure 18

Baseball salaries data: estimated posterior effects on the log-odds (and pointwise 95% posterior credible bands) obtained from the SFMRC (top left panel), the FMRC (top right panel), the SFMC (bottom left panel), the FMC (bottom right panel) for fixed \(G=2\)

The partition induced by the SFMRC identifies a cluster, the lower one (in green), which might be broadly interpreted as the cluster of “underrated” (or “underpaid”, with respect to the others) baseball players. In fact, while it is obvious players with better performances get paid more, as corroborated by the increasing trends of both means, there seems to be a group of players whose salary is substantially lower than that of players achieving similar performances (in terms of number of runs), belonging to the upper group (in blue). Indeed, the two estimated mean functions \({\hat{\mu }}_1(x)\) and \({\hat{\mu }}_2(x)\) in Fig. 15 appear almost parallel. A partial explanation of this result can be found in some additional pieces of information available in the dataset. In particular, there is a variable indicating the “free agency eligibility” of each player, i.e. if that player could have gone to a team of his choice in 1992. At the time -Watnik (1998) explains- only players with a certain amount of experience were eligible for free agency (134 out of 337) and, thus, able to market themselves to the highest bidder. On the contrary, if a player not “free agency eligible” wanted to play, he had to accept what his team was willing to pay him, or go with his team to an appointed “arbitrator”, who would choose between the player’s suggested salary and the team’s one. However, “arbitration eligibility”, which is included in the dataset as a variable as well, was for players (65 out of 337, in the dataset) who had some experience in the League, although not enough to be eligible for free agency. For interpretation purpose, the two above described categories, “free agency eligible” and “arbitration eligible” players are merged, and Table 12 compares the partition induced by SFMRC with the one obtained by distinguishing between (free agency or arbitration) eligible and noneligible players. The resulting ARI (0.626) is the highest observed among the six models. Indeed, it can be noticed that almost all the eligible players (193 out of 199) belong to the upper (blue) cluster, together with 29 players who apparently had been able to obtain an “adequate” salary without probing the market.

Table 12 Baseball salaries data: comparison between the resulting allocations of the SFMRC and (free agency or arbitration) eligibility

Rather than using the additional information on eligibility to validate the clustering results, without including it into the models (and thus, treating it as a potential source of unobserved heterogeneity), this binary variable could be explicitly included into the models as an additional (binary) covariate. This analysis is focused only on the two semiparametric mixture of regression models (SFMRC and SFMR). Not surprisingly, the inclusion of this additional covariate leads the AICM to reduce the optimal number of components to \(G=1\) for both of them. This is consistent with the fact that the two conditional means as estimated by the two-component SFMRC (without the binary covariate) are almost parallel, and can be reasonably approximated by a single curve plus a vertical shift depending on the value taken by the binary covariate. In addition, this simplification in the model resulted in an AICM value lower than the one associated with the two-component SFMRC. This second part of the application should allow the Reader to appreciate the efficacy of SFMRC models in detecting possible sources of unobserved heterogeneity. Appendix A contains details about how to extend both the specification of the model given in Sect. 2 and the MCMC algorithm provided in Sect. 3.1 in order to include a binary covariate.

Fig. 19
figure 19

Nitrogen oxide data: estimated posterior conditional means (left panel) and log-odds of mixture weights (right panel), with pointwise 95% posterior credible bands for the two components

Table 13 Nitrogen oxide data: AICM produced by the six models for optimal G

Nitrogen oxide data

First introduced by Brinkman (1981), this dataset includes 88 observations about the concentration of nitric oxide in engine exhaust and the equivalence ratio, which represents a measure of the richness of the air-ethanol mix, for burning ethanol in a single-cylinder automobile test engine. Mixtures of regression models have been already fit to these data in Xiang and Yao (2018), where the equivalence ratio has been considered as the dependent variable y, and the concentration of nitric oxide is taken as concomitant covariate x. In this Paper, a similar analysis is performed using the proposed flexible Bayesian approach. Although the two-component structure seems quite clear in the scatterplot, six algorithms with different levels of flexibility have been run for 10000 iterations each (burn-in: 5000 draws) with G ranging from \(G=1\) to \(G=4\). According to AICM, all the regression models considered find two nonempty components, while the remaining methods find three. Figure 19 shows the covariate effects as estimated by the proposed semiparametric mixture of regressions with concomitant covariates. More specifically, in the left panel the estimated conditional means are reported, while in the right one the estimated log-odds \(\eta _1(x)\) can be observed. Both plots are in line with the results by Xiang and Yao (2018) and indicate the lack of need for a flexible specification of any covariate effect in this case. In fact, the bands do not exclude linearity, although some mild nonlinearity seems to be present in the conditional means. This shows the efficacy of the penalisation induced by the selected priors in adjusting the proposed model to the required level of complexity. Coherently, AICM points at SFMR as the model to be preferred; see Table 13.

Conclusions

In this Paper, a general specification of mixture of regression models is proposed, allowing both component weights and conditional means to be nonlinear functions of a covariate. This general approach resort to spline functions for approximating the smooth effect of the concomitant variables. Parameter estimation is based on a Bayesian approach through MCMC machinery. In principle, the Reader might question whether a full parametric approach, e.g. by considering a monomial set of bases to represent the map between component probabilities and covariates, could prove to be flexible enough to catch nonlinearity. Unfortunately, this parametric representation would require some arbitrary choices, such as the maximum degree for monomial bases, or the definition of an automatic selection criterion. The approach advocated for in this Paper bypasses this issue by controlling flexibility through the variance parameters of the spline coefficients, following Lang and Brezger (2004).

Using simulation experiments, the proposed method has proved to be a useful tool for recovering the underlying covariate effects -especially if indeed not linear- and, consequently, for estimating models with a better goodness of fit and leading to a more accurate allocation. The potential of the proposal has been illustrated also through applications to real data.

Although the results shown seem encouraging, the proposed model is characterised by some limitations and there are some issues that might deserve further investigation.

Firstly, there are limitations related to the assumption for both the manifest variable and the concomitant covariate to be univariate. The adaptation to the multivariate case would require particular attention to deal with the presence of component-specific covariance matrices and multiple regressors. Regarding the latter, each predictor could be expressed as a sum of smooth functions, as in the additive paradigm introduced by Hastie and Tibshirani (1990); hence, Bayesian P-splines could be used to approximate such nonlinear functions. However, conditions for identifiability should be revised, and, in particular, further constraints should be introduced to guarantee identifiability of the predictors.

In addition, the simulation experiments have highlighted the tendency of BICM to underestimate the actual number of components. As previously mentioned, this seems coherent with other results already reported in the literature (see, for example, Redivo et al. 2020). Since BICM has been introduced as an approximation of the (logarithm of) the marginal likelihood, it would be interesting to consider other estimators of this quantity. For example, Frühwirth-Schnatter (2019) has recently proposed two bridge sampling estimators of the marginal likelihood for FMRC models, and investigating their performance in the context of the semiparametric models proposed in this Paper could be the subject of future investigation.

Furthermore, adopting a probit representation (Geweke and Keane 2007) instead of the dRUM approach could provide another potential avenue to explore, in order to understand the benefit of the proposed modelling framework, especially when the ease of interpretation given by the logit formulation is not of relevance.

Alternative approaches to penalise the spline coefficients associated with the bases could be considered. For example, one could consider applying shrinkage to the variances \(\tau _g^2\) and \(\delta _g^2\) in a similar way to what Bitto and Frühwirth-Schnatter (2019) and Cadonna et al. (2020) suggest in the context of time varying parameter models.

As far as the computational implementation is concerned, as a consequence of the use of mixture of Gaussians to approximate the Logistic distribution, no Metropolis-Hastings steps are required in the proposed MCMC algorithm. Although this can be considered an advantage, the increase in the computational burden due to the introduction of an additional latent variable should not be ignored. Furthermore, the implemented MCMC algorithm relies on the specification of a fixed value for the number of components. If this quantity is unknown, it is necessary to estimate it by running the algorithm many times with different inputs, which might be time consuming, especially when the “true” value is large. One solution could be incorporating the choice of G within the algorithm itself. For instance, a reversible jump MCMC algorithm could be exploited (Richardson and Green 1997), by designing appropriate dimension-changing moves. Alternatively, the issue of choosing the optimal value for G could be circumvented by focusing the attention on the posterior distribution of the number of nonempty components, through the combination of a large value for G with appropriate prior distributions, as suggested in Malsiner-Walli et al. (2016). This latter strategy seems more coherent with the peculiar behaviour observed in the simulation studies, where the proposed MCMC algorithm occasionally converges to a solution that is characterised by empty components.