Seemingly unrelated clusterwise linear regression for contaminated data

Clusterwise regression is an approach to regression analysis based on finite mixtures which is generally employed when sample observations come from a population composed of several unknown sub-populations. Whenever the response is continuous, Gaussian clusterwise linear regression models are usually employed. Such models have been recently robustified with respect to the possible presence of mild outliers in the sub-populations. However, in some fields of research, especially in the modelling of multivariate economic data or data from the social sciences, there may be prior information on the specific covariates to be considered in the linear term employed in the prediction of a certain response. As a consequence, covariates may not be the same for all responses. Thus, a novel class of multivariate Gaussian linear clusterwise regression models is proposed. This class provides an extension to mixture-based regression analysis for modelling multivariate and correlated responses in the presence of mild outliers that let the researcher free to use a different vector of covariates for each response. Details about the model identification and maximum likelihood estimation via an expectation-conditional maximisation algorithm are given. The performance of the new models is studied by simulation in comparison with other clusterwise linear regression models. A comparative evaluation of their effectiveness and usefulness is provided through the analysis of a real dataset.


Introduction
In multivariate regression analysis, when modelling the dependence of a random vector Y = (Y 1 , . . . , Y m , . . . , Y M ) of M responses on a given vector X = (X 1 , . . . , X p , . . . , X P ) of P predictors through a sample S = {(x 1 , y 1 ), . . . , (x I , y I )} drawn from a certain population, the following sources of complexity could affect the data and make the prediction of the responses a task difficult to perform.
(a) With multivariate longitudinal data, time-series data or repeated measures, the M responses contained in Y are typically correlated. Furthermore, in analyses of economic data or data from the social sciences, it is not unusual that prior information about the phenomenon under study enables the analyst to specify a system of M regression equations (one equation for each response) in which certain regressors contained in X are absent from certain regression equations. This is especially true for multivariate economic data referring to general theories (i.e., investment equations, production functions) or applications dealing with the explanation of a certain economic activity (i.e., demand of petrol, employment) in different geographical locations (see, e.g., Giles and Hampton 1984;White and Hewings 1982;Zellner 1962). Further examples can be found also in other fields, such as medicine, food quality, tourism economics, quality of life and health (see, e.g., Cadavez and Hennningsen 2012;Disegna and Osti 2016;Heidari et al. 2017;Keshavarzi et al. 2012Keshavarzi et al. , 2013. A parametric framework able to take into consideration both multivariate correlated responses and systems of regression equations with equation-dependent vectors of predictors (i.e., vectors which do not necessarily contain the same predictors for all the responses) is given by the so-called seemingly unrelated regression approach (see, e.g., Park 1993;Srivastava and Giles 1987). In particular, in this approach the random disturbances associated with the M regression equations are allowed to be correlated with each other; hence, the variance-covariance matrix of the resulting M-dimensional vector of the error terms will have a non-diagonal structure. (b) In general, real data can often be characterised by the presence of atypical observations. In parametric regression analysis, such observations negatively impact on both the estimation of the regression coefficients and the prediction of the responses based on the classical procedures. Such procedures have been widely recognized to be extremely sensitive to even seemingly minor or negligible deviations from some conventional assumptions (see, e.g., Tukey 1960). Thus, when the data are contaminated by such observations, it is crucial that robust methods are employed (see, e.g., Maronna et al. 2006). Departures from the Gaussian distribution of the error terms in the regression model caused by some mildly atypical observations can be managed by simply resorting to heavy-tailed models for the conditional distribution of Y|X = x. Those observations are also called small or mild outliers (see, e.g., Ritter 2015). Examples of robust methods against the presence of such outliers have been developed by Lange et al. (1989), Kibria and Haq (1999), Lachos et al. (2011); to this end, the multivariate t distribution or scale mixtures of Gaussian distributions have been exploited. Another model able to manage the possible presence of mild outliers in a dataset is the contaminated Gaussian distribution (see, e.g., Aitkin and Wilson 1980;Tukey 1960). This probabilistic model is defined as a mixture of two Gaussian distributions having the same expected mean values but different variances-covariances. Furthermore, the Gaussian distribution having the smallest mixing weight also has inflated variances-covariances and is employed to represent the mild outliers. Maximum likelihood (ML) estimation can be performed via an expectation-maximisation (EM) algorithm (see Aitkin and Wilson 1980;Dempster et al. 1977). Once such a model is fitted to the observed data, each sample observation can be classified as either typical or outlier using the maximum a posteriori probability (for further details see, e.g., Aitkin and Wilson 1980). With an approach based on the use of one of these distributions, robustness can be achieved without suppressing any observation from the sample S. (c) Sometimes the population from which the sample S comes from is composed of a certain number, say K , of sub-populations. Furthermore, when the information about the value of K and the specific sub-population each sample observation belongs to is not known, S is characterised by unobserved heterogeneity. If this source of heterogeneity affects the distribution of Y|X = x, then a mixture of K different regression models (one for each sub-population) will describe the distribution of Y|X = x in the population. This phenomenon can be experienced in many fields, such as economics, marketing, agriculture, education, human genomics, quantitative finance, social sciences and transport systems (see, e.g., Ding 2006;Dyer et al. 2012;Elhenawy et al. 2017;Fair and Jaffe 1972;Kamakura 1988;McDonald et al. 2016;Qin and Self 2006;Tashman and Frey 2009;Turner 2000;Van Horn et al. 2015). In this case, the sample S should be analysed in a regression framework able to detect both the number of sub-populations and their regression models. Methods for clusterwise regression analysis play a special role. They exploit clusterwise regression models, which are mixtures of K regression models (see, e.g., De Sarbo and Cron 1988;Depraetere and Vandebroek 2014;Frühwirth-Schnatter 2006;Hosmer 1974). In these models, the mixing weights can also be expressed as a function of some concomitant variables (Wedel 2002). With M continuous responses in vector Y, multivariate Gaussian clusterwise linear regression models are generally employed (see, e.g., Jones and McLachlan 1992). If the P predictors are random and the source of heterogeneity mentioned above affects the distribution of (X, Y), then Gaussian cluster-weighted models should be employed (see, e.g., Dang et al. 2017).
Recently, Mazza and Punzo (2020) have introduced methods to perform Gaussian clusterwise linear regression analysis which are robust with respect to heavy-tailed departures from Gaussianity due to the presence of mild outliers in the data. By relying on contaminated Gaussian clusterwise linear regression models, their methods are able to produce a simultaneous clustering of the sample observations and the detection of mild outliers in a multivariate regression context. In this way, they allow to manage the sources of complexity (b) and (c); they are also capable of explaining the correlation among responses. A limitation of an approach based on those models is that the same vector of regressors has to be employed for the prediction of all responses. Galimberti and Soffritti (2020) have developed models for Gaussian clusterwise linear regression which make use of seemingly unrelated regression equations. The methods based on these latter models are suitable for the analysis of data affected by complexities (a) and (c); however, they are not insensitive to the possible presence of mild outliers in the K sub-populations. Based on all these considerations, multivariate seemingly unrelated clusterwise linear regression models for data contaminated by mild outliers are introduced here. They are obtained from the models described in Mazza and Punzo (2020) by modifying the definition of the linear terms in the M regression equations so that a different vector of regressors can be employed for each dependent variable. With these new models, the three sources of complexities mentioned above are jointly taken into consideration when predicting the responses in a multivariate linear regression framework. Thus, a more flexible approach for the analysis of linear dependencies in multivariate data is provided.
The key contributions of this paper are: • the specification of a novel class of models able to jointly account for the sources of complexity (a), (b) and (c) mentioned above; • a comparison with some other linear clusterwise regression models; • the description of conditions for the identifiability of the novel models; • details about ML estimation via an expectation-conditional maximisation (ECM) algorithm (Meng and Rubin 1993); • a treatment of the initialisation and convergence of the ECM algorithm and the issue of model selection; • an investigation of the effectiveness of the new models, based on simulated datasets, in comparison with the models proposed by Galimberti and Soffritti (2020) and Mazza and Punzo (2020); • an application to a study of the effects of prices and promotional activities on sales for two U.S. brands of canned tuna.
The remainder of this paper is organised as follows. The novel models are introduced in Sect. 2.1. Section 2.2 shows how they relate to some clusterwise linear regression models. Identifiability is treated in Sect. 2.3. Section 2.4 and Appendix A provide details on the ECM algorithm. Issues of algorithm initialisation, convergence criterion and model selection are discussed in Sects. 2.5 and 2.6 . Section 3 contains a summary of the experimental results obtained from the analysis of simulated data. The study of the effects of prices and promotional activities on U.S. canned tuna sales is presented in Sect. 4. Finally, in Sect. 5, some concluding remarks and ideas for future research are illustrated.

Seemingly unrelated contaminated Gaussian linear clusterwise regression models
In order to introduce the new model, the following notation is required. Suppose that only P m of the P covariates contained in X are considered to be relevant for the prediction of the response Y m , where P m ≤ P. Thus, let X m = (X m 1 , X m 2 , . . . , X m Pm ) be the vector composed of such P m covariates, and let X * m = (1, X m ) . Furthermore, let β km = (β k,m 1 , β k,m 2 , . . . , β k,m Pm ) be the vector of the P m regression coefficients capturing the linear effect of such covariates on the response Y m in the kth subpopulation, and β * km = (β 0k,m , β km ) . Then, the vector containing all linear effects on the M responses in the kth sub-population can be obtained by stacking the M regression coefficient vectors specific for the kth sub-population one underneath the other; it can be denoted as β where 0 P m +1 denotes the (P m + 1)-dimensional null vector. The random vector Y follows a seemingly unrelated contaminated Gaussian linear clusterwise regression model of order K if the conditional probability density function (p.d.f.) of Y|X = x has the form where π k is the mixing weight of the kth sub-population, with π k > 0 for k = 1, . . . , K , and K k=1 π k = 1; h (y; θ k ) is the contaminated Gaussian p.d.f. of Y|X = x in the kth sub-population, defined as follows: and φ M (·; μ, ) denotes the p.d.f. of an M-dimensional Gaussian distribution with expected mean vector μ and positive definite covariance matrix . The term μ k (x; β * k ) in Eq. (2) is the conditional expected value of Y|X = x in the kth sub-population; it 123 is defined as follows: wherex * denotes the realisation ofX * obtained when X = x. Thus,x * β * k coincides with an M-dimensional vector whose mth element is a linear combination of the realisations of the P m regressors selected for the prediction of Y m with weights given by the elements of vector β * km . Terms α k ∈ (0, 1) and η k > 1 are the weight of the typical observations in the kth sub-population and the factor contaminating the conditional variances and covariances of Y|X = x for the mild outliers in the kth sub-population, respectively. In robust statistics, it is generally assumed that at least half of the observations are typical; thus, it is also possible to consider α k ∈ [0.5, 1). As a consequence of the constraint η k > 1, η k represents an inflation parameter for the elements of k . θ k = (β * k , k , α k , η k ) is the parameter vector of model (2). The parameter vector of model (1) is given by ψ = (ψ 1 , . . . , ψ k , . . . , ψ K ), where ψ k = (π k , θ k ); the number of free parameters in this vector is equal to In summary, the conditional p.d.f. f (y|x; ψ) in Eq. (1) can be interpreted as a weighted average (namely, a mixture) of K Gaussian regression models with weights π k , k = 1, . . . , K . The kth component of this mixture represents a multivariate seemingly unrelated contaminated Gaussian linear regression model with intercepts and regression coefficients β * k , symmetric and positive definite covariance matrix k , proportion of typical points α k and inflation parameter η k . Thanks to the nondiagonal structure of the variance-covariance matrices k , k = 1, . . . , K , the proposed model is able to account for correlated random disturbances within each of the K sub-populations associated with the mixture (1). Since the contaminated Gaussian distribution (2) is a mixture of two Gaussian linear regression models which are both associated with the kth component of the mixture in Eq. (1), the model defined by this latter equation can also be considered as a mixture of 2K seemingly unrelated Gaussian clusterwise linear regression models, whose components can be grouped into K pairs, each of which contains two Gaussian components having the same expected values and proportional covariance matrices.

Comparisons with other linear clusterwise regression models
When specific conditions are met, some special linear regression models can be obtained from model (1).
• If M > 1 and X m = X ∀m (the same vector of predictors is considered for all responses), the following equality holds:x * = I M ⊗ x * , where I M is the identity matrix of order M and ⊗ denotes the Kronecker product operator (see, e.g., Magnus and Neudecker 1988). Equation (3) can be rewritten as where B k = β * k1 · · · β * km · · · β * k M . Thus, Eq. (1) reduces to the mixture of multivariate contaminated Gaussian regression models introduced by Mazza and Punzo (2020).
• If M > 1, α k → 1 and η k → 1 ∀k (there is no contamination in the data), the resulting model coincides with the mixture of multivariate seemingly unrelated linear regressions described in Galimberti and Soffritti (2020). • If α k → 1, η k → 1 ∀k and X m = X ∀m (there is no contamination in the data and the same vector of predictors is considered for all responses), Eq. (1) reduces to a mixture of either univariate Gaussian linear regression models (see, e.g., De Sarbo and Cron 1988;De Veaux 1989;Quandt and Ramsey 1978) or multivariate Gaussian linear regression models (see Jones and McLachlan 1992). • If α k → 1, η k → 1 ∀k, X m = X ∀m and β * k = β * ∀k (there is no contamination in the data, the same vector of predictors is considered for all responses and their effects are the same across all the sub-populations), the resulting model coincides with a linear regression model with error terms distributed according to a mixture of K either univariate Gaussian distributions (Bartolucci and Scaccia 2005) or multivariate Gaussian distributions (Soffritti and Galimberti 2011).
there is no contamination in the data and the effects of the predictors are the same across all the sub-populations), a multivariate seemingly unrelated linear regression model whose error terms are assumed to follow a Gaussian mixture model is obtained (Galimberti et al. 2016).
Seemingly unrelated regression models represent multivariate regression models in which prior information about the absence of certain covariates for the prediction of certain responses is explicitly taken into consideration (Srivastava and Giles 1987). Thus, Eq. (1) can also be seen as a mixture of multivariate contaminated Gaussian regression models in which some regression coefficients are constrained to be a priori equal to zero. To the best of the authors' knowledge, the inclusion of such constraints in these latter models has not been addressed yet. Models obtained from Eq. (1) by embedding different constraints on the regression coefficients could also be employed in any practical application in which the relevant regressors for each response cannot be established from a priori information and, thus, the choice of the regressors to be used for the M responses is questionable. As it will be illustrated in Sect. 4, in such situations strategies based on a joint use of models (1) and variable selection techniques could be devised and employed.

Identifiability
A preliminary requirement for the consistency and other asymptotic properties of the ML estimator is represented by identifiability of the model parameters. Thus, before detailing ML estimation of ψ, a discussion about identifiability of model (1) is provided here. Consider the class of models F = {F K , K = 1, . . . , K max }, where F K = { f (y|x; ψ), ψ ∈ }, f (y|x; ψ) is the p.d.f. of Y|X = x under the seemingly unrelated contaminated Gaussian linear clusterwise regression model of order K defined in (1) and K max denotes the maximum order specified by the researcher for that model. This class is said to be identifiable if, for any two models M,M ∈ F with parameters ψ = (ψ 1 , . . . , ψ k , . . . , ψ K ) andψ = (ψ 1 , . . . ,ψ k , . . . ,ψK ), respectively, Several types of non-identifiability can affect the model class F. A first type is due to invariance to relabeling the components of the mixture (also known as labelswitching). Non-identifiability can also be caused by potential overfitting associated with empty components or equal components (see, e.g., Frühwirth-Schnatter 2006). Imposing suitable constraints on the parameter space can prevent such sources of non-identifiability for F. Another type of non-identifiability affecting this class is specifically associated with the use of finite mixtures in linear regression analysis with fixed covariates, which requires an additional constraint on the number of components of the mixture (1) (see Hennig 2000). Non-identifiability due to empty components is avoided by requiring the positivity of all the mixing weights π k . Conditions specifically devised for ensuring identifiability of mixtures of contaminated Gaussian regression models are provided in Mazza and Punzo (2020). These results have been exploited in Theorem 1 to show that model (1) is identifiable if the parameters (β * k , k ), k = 1, . . . , K , are pairwise distinct and the order K is exceeded by the number of distinct (P m − 1)-dimensional hyperplanes required to cover the covariates employed for the prediction of Y m , for m = 1, . . . , M. In order to state Theorem 1, the following notation is also required: · F is the element-wise matrix 2-norm (also known as the Frobenious norm); H P m −1 = {x m ∈ R P m : λ x m = c, λ ∈ R P m , λ = 0} is a (P m − 1)dimensional hyperplane; J m is the minimum number of such hyperplanes required to cover the covariates x m ; H P m −1 is the space of (P m − 1)-dimensional hyperplanes of R P m .
Conditions (C1) and (C2) are obtained from Mazza and Punzo (2020) after suitable modifications of similar conditions required for the identifiability of their mixtures of contaminated Gaussian regression models. In particular, condition (C2) results from a simple substitution of the vector β * k of model (1) for the matrix B k introduced in Eq. (4) containing the intercepts and regression coefficients in the kth component of the regression mixture model developed by Mazza and Punzo (2020). The modifications involved in the definition of the condition (C1) derive from the fact that each Y m ∈ Y may have its own covariates and, thus, M different restrictions on K have to be required, each one involving a (possibly) different minimum number of lowdimensional hyperplanes to cover those covariates. As a consequence, the proof of Theorem 1 can be obtained by exploiting the same arguments illustrated in Mazza and Punzo (2020) for the proof of their theorem about identifiability of mixtures of contaminated Gaussian regression models.

Maximum likelihood estimation
The ML estimation of the parameters ψ is carried out here for a fixed value of K . Given a sample S of I independent observations drawn from model (1), the model log-likelihood is equal to (ψ) = I i=1 ln K k=1 π k h y i ; θ k . Following Mazza and Punzo (2020), ML estimatesψ can be computed by means of an ECM algorithm, which represents a variant of the EM algorithm usually employed for the computation of ML estimates from incomplete data. In the considered situation, the missing information is twofold. On the one hand, there is a classical source of incompleteness of any mixture model associated with the component memberships of the I sample observations. On the other hand, it is not known whether such observations are outliers with reference to any component or not. These two sources can be described by two different types of K -dimensional vectors. For the ith sample observation, they are given by z i and u i , respectively: observation is typical in the kth component and u ik = 0 if it is an outlier, for k = 1, . . . , K . Then, the set of complete data would be S c = {(x 1 , y 1 , z 1 , u 1 ), . . . , (x I , y I , z I , u I )}, and the complete-data likelihood function is equal to

123
Thus, up to an additive constant, the complete-data log-likelihood function employed in the ECM algorithm for the computation of the parameter estimates can be expressed as follows: is the squared Mahalanobis distance between y i and μ k ( The hth iteration of the E-step in the ECM algorithm consists in calculating the conditional expectation of l c (ψ) on the basis of the current estimate ψ (h) of the model parameters ψ; up to an additive constant, this expected value can be expressed as follows: ik are the posterior probabilities (evaluated using ψ (h) ) that the ith observation is generated from the kth component of the mixture (1) and that the ith observation is a typical point of such a component, respectively: with Z i = (Z i1 , . . . , Z i K ) denoting a K -dimensional multinomial random vector with probabilities π = (π 1 , . . . , π K ) , and U ik |Z ik = 1 having a Bernoulli distribution with success probability of α k . As far as the conditional maximisation is concerned, the update of ψ (h) is carried out by considering the following two parameter sub-vectors: γ = (π, β * , , α) and η = ( Such updates coincide with the ones reported in Mazza and Punzo (2020) for their model. Based on Eq. (9), it is possible to highlight that the update η (h+1) k will be larger when the kth component is highly contaminated by the presence of outliers (i.e., when it is characterised by many observations with a small value ofû (h) ik and a large value of the squared Mahalanobis distance from μ k (x i ; β * (h+1) k )). As far as the remaining parameters are concerned, their updates are (details are reported in the Appendix): It is worth noting that the matrix I i=1ẑ (10) has to be nonsingular; otherwise, the update β * (h+1) k cannot be computed. Equation (10) also highlights that this update can be considered as a generalised least squares estimate with weights 123 depending onŵ (h) ik ; this latter term also affects the update (h+1) k in (11), which represents a weighted sum of squared residuals. Using such weights leads to a reduction in the effects of the outliers on the estimation of β * (h+1) k ; thus, this approach provides robust estimates of β * (h+1) k , for k = 1, . . . K . Furthermore, based on (12), sample observations with the highest posterior estimated probabilities of being generated from the kth component and of representing typical points in the kth component will have the largest impact on the updates of both the regression coefficients and covariances within that component.
Once the convergence is reached and the ML estimatesψ are computed, by exploiting Eq. (6) the ECM algorithm provides estimates of the posterior probabilities Such estimated probabilities can be employed to partition the I sample observations into K clusters, by assigning each observation to the component showing the highest posterior probability; for the ith observation: Furthermore, Eq. (7) allows to compute the estimated posterior probabilities =û ik , and an intra-cluster distinction between typical observations and mild outliers can be defined: the ith observation will be classified as an outlier of the hth cluster, where h is the label of the component for which MAP(ẑ ik ) = 1, if u ih < 0.5. From the ML estimatesψ and Eq. (5) it is also possible to compute the estimated squared Mahalanobis distancesd 2 . . , I , k = 1, . . . , K , which can be employed as multivariate measures of the outlyingness of the I sample observations with respect to the K clusters detected by the model. From the definition of the squared Mahalanobis distance given in Eq. (5) and the expressions forû (h) ik andŵ (h) ik reported in Eqs. (7) and (12), respectively, it is possible to express bothû ik andŵ ik as decreasing functions ofd 2 ik (see Mazza and Punzo 2020, for the explicit expressions). Thus, atypical observations could also be detected and studied by considering the values ofd 2

Technical details about the ECM algorithm
A crucial point of any EM-based algorithm is the choice of the starting values for the model parameters (i.e., ψ (0) ). Multiple executions of the algorithm in association with multiple random initialisations or approaches based on non-random choices of either ψ (0) or the missing information can provide a solution (see, e.g., Biernacki et al. 2003;Karlis and Xekalaki 2003). As far as the ECM algorithm described above is concerned, the initialisation technique illustrated in Mazza and Punzo (2020) could be modified so as to be employed also for model (1). This task would require setting the initial values z (0) ik , i = 1, . . . , I , k = 1, . . . , K , equal to the posterior probabilities from the EM algorithm for the estimation of the seemingly unrelated Gaussian clusterwise linear regression models, which are nested in model (1) when α k → 1 − and η k → 1 + , k = 1, . . . , K ; furthermore,û (0) ik = 0.999, i = 1, . . . , I , k = 1, . . . , K . Another strategy for the initialisation of ψ which exploits the relationship between model (1) and seemingly unrelated Gaussian clusterwise linear regression models (see Sect. 2.2) could be composed of the following three steps. Firstly, a Gaussian mixture model with K components is fitted to the sample residuals of a seemingly unrelated linear regression model (Srivastava and Giles 1987); this allows to obtain the starting values π (0) k and (0) k . Secondly, the starting values β * (0) k are obtained from the fitting of K different seemingly unrelated linear regression models, one for each cluster of the partition associated with the Gaussian mixture model considered in the previous step. Thirdly, α (0) k and η (0) k , k = 1 . . . , K , are set equal to 0.999 and 1.001, respectively. Models involved in the first two steps can be estimated through the packages mclust (Scrucca et al. 2017) and systemfit (Henningsen and Hamann 2007) in the R environment (R Core Team 2021). In the analyses of Sects. 3 and 4 , the ECM algorithm has been initialised using this latter strategy. Furthermore, since (1 − α k ) in model (1) can be considered as the proportion of outliers in the kth sub-population, when this model is employed for outlier detection, a reasonable requirement is that in each cluster the number of typical observations cannot be smaller than the number of outliers, that is α k ∈ [0.5, 1) ∀k. To guarantee this result, constraints on the estimation of α k , k = 1, . . . , K , have been included in the ECM algorithm; namely, Eq. (8) has been modified as follows: α In order to avoid premature stops of the ECM algorithm associated with the use of lack of progress stopping criteria, such as the one based on the difference between the log-likelihood values at two consecutive steps, a convergence criterion based on the Aitken acceleration (Aitken 1926) has been adopted. It consists in stopping the algorithm when | Aitken accelerated estimate of the log-likelihood limit, and (ψ (h) ) is the incomplete log-likelihood evaluated at ψ (h) (see, e.g., McNicholas 2010). Furthermore, a criterion based on a maximum number of iterations for the ECM algorithm has been employed. In the analyses of Sects. 3 and 4 , the maximum number of iterations and have been set equal to 500 and 10 −6 , respectively. Furthermore, in order to circumvent the possible issue of unbounded likelihood associated with a degenerate model, the ECM algorithm has been developed by embedding some constraints on the eigenvalues of (h) k for k = 1, . . . , K . Namely, for all estimated covariance matrices, the ratio between the smallest and the largest eigenvalues is required to be not lower than 10 −10 .

Determining the value of K
As illustrated in Sect. 2.4, the ML estimation of ψ based on the ECM algorithm is carried out for a given number of mixture components. When this number is not known and has to be determined from the data S, it is common practice to employ model 123 selection criteria able to take account of different aspects which are considered relevant when evaluating the adequacy of a model (see, e.g., Depraetere and Vandebroek 2014;Frühwirth-Schnatter 2006). For example, the Bayesian Information Criterion (Schwarz 1978) provides a trade-off between the fit and the model complexity; it can be computed as follows: Model selection criteria that also consider the uncertainty of the estimated partition of the sample observations could be employed. An example is represented by the integrated completed likelihood (Biernacki et al. 2000), which can be computed according to different ways of measuring the uncertainty of the estimated partition (see, e.g., Andrews and McNicholas 2011;Baek and McLachlan 2011): These latter criteria penalize complex models more severely than B I C because of the presence of an additional penalty, which represents the estimated mean entropy. Thus, when using these criteria in comparison with the B I C, one cluster should be less likely split into two different components. I C L 1 and I C L 2 differ on whether a soft (i.e.,ẑ ik ) or hard (i.e., MAP(ẑ ik )) clustering is considered in the estimation of the mean entropy. Higher values of these criteria indicate better-fit models; as it will be illustrated in Sect. 4, B I C, I C L 1 and I C L 2 can also be employed to select the predictors to be considered in the linear terms employed in the prediction of the M responses in model (1).

Settings
This section focuses on the investigation of the effectiveness of models (1) (mixtures of contaminated seemingly unrelated Gaussian regressions, hereafter denoted as MCSG) in comparison with other approaches using simulated datasets. This task has been carried out in a multivariate setting with M = 4 responses, P = 4 covariates and datasets comprising K = 3 groups of observations. The additional models considered in the comparison are those described by Mazza and Punzo (2020) and Galimberti and Soffritti (2020). From now on, these latter models have been denoted as MCG (mixtures of contaminated Gaussian regressions) and MSG (mixtures of seemingly unrelated Gaussian regressions), respectively. The simulated datasets have been generated using three different data generation processes: (a) MSG; (b) MCSG with α k = 0.9 ∀k, η 1 = 40, η 2 = η 3 = 20; (c) mixtures of regression models with seemingly unrelated t-distributed errors (MSt), with ν 1 = ν 2 = ν 3 = 4 degrees of freedom.
In all the regression models employed to generate the datasets, the response Y m has been assumed to depend on X m , for m = 1, 2, 3, 4; thus, P m = 1 ∀m. With each process, the following parameters have been employed: π 1 = 0.3, π 2 = 0.5, π 3 = 0.2, It is worth noting that the second and third components only differ in the intercepts of the four regression equations. Covariate values have been generated by a uniform distribution over the interval (−5, 5). As concerns , two alternatives have been considered in order to produce two different degrees of separation between groups of observations: = 9 (higher degree), = 6.5 (lower degree). Figure 1 shows the scatterplots of the variables Y 1 and X 1 for a sample of size I = 1000 generated using the MSG (upper panel), MCSG (central panel) and MSt (lower panel) processes with = 9 (on the left) and = 6.5 (on the right). Due to the values of the regression coefficients employed to model the linear dependencies of Y m and X m across the three components, the scatterplots of Y m and X m for m = 2, 3, 4 are similar. Under each data generating process, 100 random samples of size I have been simulated for each . As far as the sample size is concerned, the following values have been examined: I = 500, 1000. Thus, the degree of separation and the sample size can be considered as experimental factors. This yields a total of 600 generated datasets for each I . The whole analysis has been run on an IBM x3750 M4 server with 4 Intel Xeon E5-4620 processors with 8 cores and 128GB RAM.

Results
A first analysis has been carried out where the MSG, MCG and MCSG models of order K = 3 have been fitted to each dataset. It is worth noting that the MCG models have been specified and estimated by assuming that each of the four responses depends on all covariates. Thus, using such models leads to non-parsimonious specifications for all the models that have generated the simulated datasets, as 12 regression coefficients for each component have been estimated although in fact they are equal to zero. The average execution times (over the 100 datasets with I = 500) for the MCSG models have ranged between 2.499 and 55.020 s, depending on the process and the specific value of employed to generate the datasets. Concerning the other two models, the minimum and maximum average execution times have resulted to be equal to 1.722 and 24.580 s with MSG models, 7.765 and 58.520 s with MCG models. It is worth noting that, since  the implementation of the ECM algorithm has not been carried out with the goal of being efficient from a computational point of view, these CPU times should be regarded as merely illustrative and can be reduced using more efficient implementations. In the first analysis, the performances of the three competing models have been evaluated with respect to the following aspects: (i) the estimation of the proportions of typical observations and the degrees of contamination (proper estimation of α k and η k ); (ii) the ability to recover the true values of the unknown parameters (parameter recovery); (iii) the ability to recover the true partition of the sample observations (classification recovery). When evaluating properties of the parameter estimators using simulation studies under mixture models, there may be label switching issues. Several labeling methods have been proposed. For the models examined here, as in Bai et al. (2012), A second analysis has been carried out so as to obtain an evaluation of the three approaches without exploiting the knowledge of the true number of components. Thus, in addition to the models already examined in the first analysis, also models of order K = 1, 2, 4, 5 have been fitted to each dataset. All the obtained results have been employed to collect information on the following aspects: (iv) the capability to reach the best trade-off between the fit and model complexity; (v) the ability of B I C, I C L 1 and I C L 2 to detect the true value of K (comparison among information criteria).

Estimation of˛k and Á k
The aspect (i) has been studied for the fitted MCG and MCSG models with K = 3. Under the first two data generation processes, the averages of the estimated proportions of good points (α k ) and the estimated inflation parameters (η k ) are close to their true values under both MCG and MCSG models, regardless of the level of separation and the sample size (see the upper part of Tables 1 and 2). However, it is worth noting that slightly lower standard deviations of such estimates have been registered under the first process, thus giving an indication of a higher stability of the obtained estimates; furthermore, the estimation of η 1 , η 2 and η 3 under the second process appears to be characterised by a certain instability, which results to reduce as the sample size I 123 increases using both MCG and MCSG models. As far as the results from the analyses of the datasets generated using the third process are concerned (lower part of Tables 1 and 2), the estimated values of α k and η k , k = 1, 2, 3, are far from 1, regardless of the values of and I . Thus, the departure from a four-dimensional Gaussian distribution for the errors of the regression model has been detected within each of the three mixture components of both MCG and MCSG models for both sample sizes. The standard deviations ofη k , k = 1, 2, 3 are high, and this result holds true particularly with MCG models and I = 1000.

Parameter recovery
The evaluation of the aspect (ii) has been focused on the regression coefficients β km and has been carried out by computing the following quantities: Bias β km =  whereβ (r ) km is the ML estimate of β km obtained from the r th dataset (r = 1, . . . , 100) using models of order K = 3. With I = 500 and under the first data generating process (Table 3), MSG and MCSG models show the same performance in terms of recovering the true values of the regression coefficients with both degrees of separation. The good performance of MCSG models is consistent with the proper estimation of α k and η k associated with these models under the first process (see the previous aspect). On the contrary, the inclusion of irrelevant predictors in the four regression equations (MCG models) leads to a slight increase in the RMSEs. With contaminated datasets of size I = 500, as expected, the lowest (absolute) biases and RMSEs are obtained using the MCSG model (see Table 4); there also seems to be a tendency for MCG models to perform slightly better than MSG models for the majority of the regression coefficients. When the datasets are generated with I = 500 and according to the third process, the highest accuracy in the estimation of the regression coefficients is obtained using MCSG models (see Table 5). It is also worth noting that, in spite of their ability to detect a departure from the Gaussian distribution within each component, MCG models show the lowest accuracy. Similar results have been obtained with I = 1000 (see Tables 6, 7 and 8).

Classification recovery
To obtain information on the aspect (iii), the partitions of the sample units associated with the models of order K = 3 under each competing model class have been compared with the true partition; the agreement with this latter partition has been measured by resorting to the adjusted Rand index (AR I ) (Hubert and Arabie 1985). When the datasets are generated using the first process and the highest level of separation (see the upper part of Tables 9 and 10), an almost perfect classification recovery (AR I = 0.999) is obtained by each of the three models regardless of the sample size. When the level of separation is low ( = 6.5), a slight decrease in the ability to recover the true partition of the sample observations is registered for all models and, in particular, for the MCG ones when I = 500 (AR I = 0.937). When there are outliers in the data and = 9, the best performance is obtained using either MCG models or MCSG models with both sample sizes (AR I = 0.91); these latter models slightly outperform MCG models when = 6.5. As far as MSG models are concerned, due to their inability to manage the presence of mild outliers in the data, the classification recovery appears to be markedly lower, especially with the lowest level of separation (AR I = 0.723 with I = 500, AR I = 0.716 with I = 1000). Under the third process and the highest level of separation, good performances are obtained by all models with both sample sizes (AR I > 0.93). When the level of separation is reduced, a general decrease in the capability to reconstruct the true partition is registered; MCSG models appear to be less affected by this tendency, regardless of the sample size.

Trade-off between fit and complexity
In order to study the aspect (iv), for each dataset and each model class, the models of orderK I C have been selected, where I C denotes an information criterion (I C ∈ 123 {B I C, I C L 1 , I C L 2 }) andK I C = arg max I C(K ) for K ∈ {1, 2, 3, 4, 5}. Then, the average values of the 100 resulting values of B I C(K B I C ), I C L 1 (K I C L 1 ) and I C L 2 (K I C L 2 ) have been computed within the three model classes. As expected, when datasets of I = 500 observations are generated without outliers (first process), the best trade-off between the fit and model complexity is reached by MSG models, regardless of the level of separation and the criterion employed to select the best model (see the upper part of Table 11). With these datasets, MCSG models slightly outperform MCG models. When there are outliers in the data (second process) or the error terms of the K regression models have tails heavier than the Gaussian ones (third process), MCSG shows the best performance in terms of capability to reach the best trade-off between fit and complexity, regardless of the level of separation and the criterion employed to select the best model (see the lower part of Table 11). Interestingly, when the outliers are generated using a MCSG model (second process), MSG models slightly outperform MCG models, regardless of the value of . Similar conclusions can be drawn also from the results obtained when I = 1000 (see Table 12).

Comparison among information criteria
As far as the aspect (v) is concerned, the attention has been focused on the number of times each value of K has been selected by each examined criterion. With datasets generated using the first process and the highest level of separation, all the examined information criteria always recognize the presence of three clusters, regardless of the fitted model and the sample size (see the upper part of Tables 13 and 14 ). If the level of separation is reduced ( = 6.5), the B I C still tends to correctly identify the presence of three clusters regardless of the fitted model only with the largest sample size. If I = 500, the same tendency is slightly weaker with MSG and MCSG models; the order of the models employed to generate the datasets is always underestimated by the B I C when MCG models are employed. I C L 1 and I C L 2 show a clear preference for K = 3 components only when models embedding the information on the relevant regressors (e.g., MSG and MCSG) are employed and the sample size is I = 1000. Otherwise, they generally underestimate the true number of clusters. Under the second process, when MSG models are fitted to the data, all the examined information criteria show a clear tendency to select K = 4 components an additional component accommodating     outliers is typically selected), regardless of the level of separation and the sample size (see also Mazza and Punzo 2020). On the contrary, with both MCG and MCSG models, the three criteria almost always correctly identify three components, regardless of the sample size, provided that the degree of separation is high. When = 6.5, the same result is obtained by the B I C in association with MCG and MCSG models and by I C L 1 in association with MCSG models only with the largest sample size; otherwise, due to both a low separation between two clusters and a low sample size, the examined criteria generally underestimate the true value of K . This behaviour is particularly evident when the selection of K is based on I C L 2 . A possible explanation for this is that the penalty employed by I C L 2 (a function of the uncertainty of the estimated posterior probabilitiesẑ ik ) is the most severe and is also expected to be particularly large whenever the analysed dataset contains true clusters which are not well separated. When the datasets are generated using the third process and the smallest sample size, the obtained results show that, if = 9, the three criteria generally detect the true value of K (see the lower part of Table 13). This tendency appears to be stronger when MCG and MCSG models are employed. These results hold true also with I = 1000 except when MSG models are fitted to the data and K is selected using either the B I C or the I C L 1 ; in these latter situations the true K is overestimated. On the contrary, when the degree of separation is low, models of order K = 2 are generally selected from each examined model class according to I C L 1 and I C L 2 , regardless of the sample size. Also this result could be due to the role played by the penalties employed by these two latter criteria in the presence of true clusters which are not well separated. As far as the B I C is concerned, it allows to detect the true number of components only when MCSG models are fitted to samples of size I = 1000. It also shows a tendency to underestimate the true K both with MCSG models fitted to smaller samples and with MCG models regardless of the sample size. Finally, a slight preference with MSG models of order K = 2 and K = 4 emerges in association with samples of size I = 500 and I = 1000, respectively.

Results from the analysis of canned tuna sales
The practical usefulness and effectiveness of the proposed models have been evaluated through the analysis of a dataset containing the volume of weekly sales (Move) for seven of the top 10 U.S. brands in the canned tuna product category for I = 338 weeks between September 1989 and May 1997 (Chevalier et al. 2003). Measures of the display activity (Nsale) and the log price (Lprice) of each brand in each week are also available. This dataset is included in the R package bayesm (Rossi 2012). The analysis here considers two products: Star Kist 6 oz. (SK) and Bumble Bee Solid 6.12 oz. (BBS). In order to study the dependence of canned tuna sales on prices and promotional activites for these two brands, the analysis has been carried out starting from the following vectors of variables: Y = (Y 1 = Lmove SK, Y 2 = Lmove BBS), X = (X 1 = Nsale SK, X 2 = Lprice SK, X 3 = Nsale BBS, X 4 = Lprice BBS), where Lmove denotes the logarithm of Move; thus, M = 2 and P = 4. Previous studies focused on other brands are illustrated in Galimberti et al. (2016) and Galimberti and Soffritti (2020). The analysis has been carried out through MSG, MCG and MCSG models. The additional class comprising mixtures of linear Gaussian regression models (Jones and McLachlan 1992) has been included in the comparison; the notation employed for this model class is MRM. Models from each of these four classes have been estimated for K ∈ {1, 2, 3, 4}. Furthermore, since prices and promotional activities for one product could have an impact on the sales of the other product, models from MSG and MCSG classes have been specified and fitted by considering all possible sub-vectors of X as vectors X m , m = 1, 2, for each K . Thus, the analysis has also included an exhaustive search of the relevant regressors for both Lmove SK and Lmove BBS. For each K , 2 P·M = 256 different mixtures of regression models have been estimated either with contamination or without contamination; the overall number of estimated models is 2048. It is worth noting that none of the models employed in this analysis explicitly accounts for serial dependencies that may characterise this dataset. Figure 2 shows the values of B I C, I C L 1 and I C L 2 for the fitted MCSG, MSG, MCG and MRM models which maximise each of these model selection criteria by K . An analysis based on a single linear regression model without contamination (MSG and MRM models with K = 1) is clearly inadequate according to all criteria. The best trade-off among the fit, the model complexity and the uncertainty of the estimated partition of the weeks is reached by models of order K = 2 for each of the four examined model classes. If model selection is only based on the fit and the model complexity, the best MCSG and MCG models still have K = 2 components, while MSG and MRM models of order K = 3 should be preferred.  Table 15 reports more detailed information about the six models which best fit the analysed dataset according to the three model selection criteria over the five examined values of K within each model class. All the examined criteria select a seemingly unrelated contaminated Gaussian linear clusterwise regression model of order K = 2 as the overall best model for studying the effect of prices and promotional activities on sales for the two brands. In this model, the log unit sales of SK canned tuna are regressed on the log prices and the promotional activities of the same brand; as far as the regressors for the BBS log unit sales are concerned, the selected regressors are the log prices of both brands and the promotional activities of BBS. From the parameter estimates (see Table 16) it emerges that the analysed dataset is characterised both by heterogeneity over time and by the presence of atypical observations. This latter feature seems to characterise the two clusters of weeks detected by the model almost in the same way (the estimated weights of the typical observations areα 1 = 0.827 andα 2 = 0.829); however, the strength of the contaminating effect on the conditional variances and covariances of Y|X = x results to be stronger in the first cluster, where the estimated inflation parameter for the elements of 1 is larger (η 1 = 13.44). Heterogeneity over time appears to emerge both in some effects of the selected regressors and in the conditional expected variances and covariances of log sales for the typical observations. From the estimates of the regression equation for Lmove SK it emerges that sales of SK canned tuna are negatively affected by prices and positively affected by promotional activities of the same brand within both clusters detected by the model. However, the estimated effects of these two variables in the first cluster result to be stronger than those in the second cluster. Similar results have been obtained with reference to the regression equation for Lmove BBS, from which it also emerges that the log prices of SK canned tuna positively affect the log unit sales of the other brand, especially in the first cluster of weeks. As far as the estimated conditional variances and covariances are concerned, typical weeks in the first cluster appear to be characterised by values of Lmove SK which are more homogeneous than those of Lmove BBC; the opposite holds true for the typical weeks belonging to the second cluster. Heterogeneity over time appears to emerge also in the correlation between log sales of SK and BBS products, which is slightly positive (0.191) within the largest cluster of weeks, while a mild negative correlation (− 0.299) between Lmove SK and Lmove BBC is estimated in the weeks belonging to the first cluster. The first cluster determined according to the highest estimated posterior probabilities of the selected model is composed of 20 weeks; 17 of these weeks are consecutive (from week no. 58 to week no. 74) and correspond to a period (from mid-October 1990 to mid-February 1991) characterised by a worldwide boycott campaign encouraging consumers not to buy Bumble Bee tuna because Bumble Bee was found to be buying yellow-fin tuna caught by dolphin-unsafe techniques (Baird and Quastel 2011). The selected model seems to suggest that such events may be one of the sources of the unobserved heterogeneity detected by the analysis. The fact that the estimated effects of all the selected regressors on the log prices of both products are stronger in the first cluster of weeks and weaker in the second cluster could be associated with those events. According to the rule for the intra-class distinction between typical observations and mild outliers illustrated in Sect. 2.4, some weeks have been classified as mild outliers within both clusters. As far as the first cluster is concerned, this has happened for week no. 60 (immediately after Halloween 1990) and week no. 73 (2 weeks immediately before Presidents day 1999). For these weeks, the estimated squared Mahalanobis distancesd 2 i1 , equal to 36.68 and 37.82, respectively, appear to be extremely higher than those of the other 18 weeks of the same cluster, which are comprised between 0.05 and 7.05. From the estimated sample residuals y i −μ 1 (x i ;β * 1 ) for the 20 weeks belonging to the first cluster (see the scatterplot on the left side of Fig. 3) it emerges that week no. 60 noticeably deviates from the other weeks because log unit sales of SK tuna are slightly lower than the predicted value, while an opposite result characterises the log unit sales of BBS tuna. On the contrary, the selected model identifies week no. 73 as a mild outlier mainly because of a large overestimation of the sales of BBS tuna. Among the 318 weeks of the second cluster, 35 have resulted to be mild outliers, most of which are associated with holidays and special events that took place between September 1989 and mid-October 1990 or between mid-February and May 1997. The scatterplot with the estimated sample residuals y i −μ 2 (x i ;β * 2 ) for all the weeks of the second cluster (see the right side of Fig. 3) shows that, for the majority of the 35 mild outlying weeks, the reason for the outlyingness detected by the model has been an overestimation or an underestimation of the sales for either brands. The values of the estimated distancesd 2 i2 for the weeks that have been classified as typical are between 0.003 and 7.993; the minimum and maximum of the same distances for the outlying weeks are 8.20 and 114.95, respectively.

Conclusions
A new family of seemingly unrelated clusterwise linear regression models for possibly contaminated data has been introduced. Such models can account for heterogeneous regression data with mild outliers and multivariate correlated responses, each one depending on its own vector of covariates. This latter feature represents the main novelty of the models proposed here in reference with the ones described in Mazza and Punzo (2020). The new family encompasses several other types of Gaussian mixturebased linear regression models previously proposed in the literature. It also provides a more flexible framework for modelling data in applications where sample observations could be atypical and different covariates are expected to be relevant in the prediction of different responses, based on some prior information to be conveyed in the analysis. The new family could be made more flexible by exploiting the approach illustrated in Celeux and Govaert (1995), which allows to introduce constraints on the elements of the covariance matrices k , k = 1, . . . , K , so that models with a lower number of variances and covariances of Y|X = x in the K sub-populations are obtained. Monte Carlo studies have shown that the choice of the number of components and the reconstruction of the true classification of the sample observations can be negatively affected by the inclusion of irrelevant regressors in a clusterwise linear regression model, especially with overlapping clusters of observations. Whenever the choice of the regressors to be considered in the specification of the linear predictor of each response is questionable, models introduced here can be employed in conjunction with techniques for variable selection (e.g., genetic algorithms, stepwise strategies) in a multivariate regression setting in order to detect the relevant predictors for each regression equation. Since the ECM algorithm for the ML estimation of the model parameters does not automatically produce any estimate of the covariance matrix of the ML estimator, additional computations are necessary to obtain an assessment of the sample variability of model parameter estimates. This task could be carried out by means of some approaches commonly employed under finite mixture models (see, e.g., McLachlan and Peel 2000). We are currently developing an extension of the methods proposed herein to some mixtures of Gaussian linear regression models with random covariates . Another avenue of future research is represented by the study of seemingly unrelated clusterwise regression models explicitly accounting for contaminated data and space/time-dependent observations.
Funding Open access funding provided by Alma Mater Studiorum -Università di Bologna within the CRUI-CARE Agreement. This study has been funded by the University of Bologna, Italy.

123
Data availability The datasets generated and analysed during the current study are available from the corresponding author upon request. The real dataset on canned tuna sales which has been analysed on Sect. 4 is freely downloadable from the R package bayesm (Rossi 2012).

Code Availability
The R code developed by the authors for the implementation of the ECM algorithm illustrated in Sects. 2.4-2.6 is available from the corresponding author upon request.

Declarations
Competing interest The authors have no competing interests to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. =